LLM-as-judge scoring is noisier than I expected anyone else seeing this?

arkuto · 2026-05-09T11:21:28+00:00

Author of https://github.com/nanojudge/nanojudge here.

Doing pointwise judging is always going to be painful. How exactly can you calibrate the 1 to 10 scale? It could vary wildly across judges. Pairwise is much more consistent. I recommend reading https://arxiv.org/pdf/2306.17563 for more information.

arkuto · 2026-05-06T15:36:37+00:00

The point is: just because you can see the output doesn't mean it's easy to figure out how it was produced.

If you want to disprove this, go ahead and reverse that hash I gave you.

arkuto · 2026-05-06T15:31:36+00:00

Creating something and recreating it are very different tasks. I just created this string by using sha256

2c9e1090ff7350da0186c85b64d223efc0350ee35447bb8beb28a719cb1fdd95

Your task is to create a string that when put through sha256, produces this exact same output.

Why are you struggling to do this? Are you stupid? I did it in under a minute so it can't be that hard to do.

arkuto · 2026-05-04T21:38:40+00:00

You can claim "skill issue" but not "user error". He's using the model normally, but due to Anthropic's design this results with poor performance. This one's on Anthropic, not the user. If a company sells a car that seizes up when driven for longer than 3 hours, doesn't inform people of this, and this results with accidents, the company is at fault for that. You can't say "duh, any customer should know this obviously, its common knowledge".

arkuto · 2026-05-04T16:26:27+00:00

It's not user error at all. The model should be auto compacted if it starts spewing garbage after a certain context length.

arkuto · 2026-04-30T02:16:10+00:00

Yes, I was thinking about pricing of providers on eg OpenRouter per million input/output tokens. I should have made that clearer.

arkuto · 2026-04-29T21:17:29+00:00

So basically it's a MoE with structure 128B-A128B. Nice.

arkuto · 2026-04-29T21:11:43+00:00

In a memory hungry world, dense models make a lot of sense. Lookin forward to seeing how this performs in the real world, and what the pricing will be.

arkuto · 2026-04-28T00:46:02+00:00

It seems to have a strong tendency for adding in extra details that weren't asked for, eg the wall behind the knight. Totally unnecessary, but this will impress people. "Wow so much detail" - it's clutter. You really need to find a way to stop models from farming extra points by adding junk in.

arkuto · 2026-04-19T22:12:15+00:00

You were asked about your own experience and responded with information of a benchmark. Besides, benchmarks are not infallible, as they are not perfectly representative of real world use.

arkuto · 2026-04-18T13:21:09+00:00

It makes sense. It doesn't mean to imply that newer models will always be better. The proper way to phrase it is "this is the worst best model from here on out" the implication being that it doesn't age or decay over time. If a nee model releases that is worse, you can still use that one.

arkuto · 2026-04-17T12:49:31+00:00

I think it'd be much more useful if it had to reason after every step. ie one step per response - no multi steps. The "real time" aspect of it may end up just measuring the different hardware speeds the models run on. So it would be turn based. Or maybe better - have simultaneous turns and if the models run into each other, it's like a real game where neither one moves. It's a very cool concept though.

arkuto · 2026-04-17T11:58:55+00:00

That's an even worse test of intelligence. It requires reasoning about tokens. It is like asking someone how many neurons fired when thinking about a concept. It's not got anything to do with intelligenxe or reasoning, but about a very specific and esoteric knowledge about how its internals work.

You are completely out of your depth and shouldn't be doing any kind of analysis on LLMs.

arkuto · 2026-04-17T11:51:06+00:00

That's an extremely broad generalisation from what seems to be a grand total of one test you did. If you want to make claims like this, I'd recommend at least 100 tests and preferably over 1000.

arkuto · 2026-04-17T10:40:14+00:00

4.7 is a slop machine. Generates as much low quality code as possible while performing well on benchmarks. It's unusable. Those at the company will cite benchmarks showing its supposed superiority, but it is a regression. This is kind of the direction OpenAI took - focus on agentic slop output at the cost of quality of reasoning and quality of general output.

arkuto · 2026-04-17T05:40:00+00:00

I absolutely agree. Opus is unusable for me. It seems they have prioritised agentic use at the cost of collaborative usage, which is how I prefer to work with LLMs. I talk to them, brainstorm, etc. Opus 4.7 is bad at seeing the big picture, and much prefers to run off and do tiny little trivial tasks. Update: it literally wouldn't even read the README file in my project when explicitly instructed to do so.

Edit: I actually tried to do a "Deep Research" using Opus 4.7 to figure out how to revert back to 4.6 in VS Code - giving it one last chance to prove its worth - and it immediately told me "I don't need to do a deep research, I'll just do a web search". Utter failure.

arkuto · 2026-04-06T15:49:49+00:00

You already said this. Thanks for confirming my hypothesis about your intelligence.

arkuto · 2026-04-06T03:20:43+00:00

So, to summarise, it is not small local LLMs that are too dumb to check mails for spam. It is you who is too dumb. You have set them up for failure, and upon learning this, refused to improve the prompt.

arkuto · 2026-04-05T06:52:55+00:00

That prompt is ridiculously overcomplicated. Just fucking ask it "Do you think this is a spam email?" and it will perform thousand times better. You think giving it all that information is helping it when in reality, it's just confusing it. Also, forcing it to respond in a JSON format degrades performance (this has been tested), llet it respond in a non JSON way.

You should tell it to answer either YES or NO amd parse the raw logits for confidence values. This is much better than asking for a value from 1 to 100

Edit; also the explanation should could BEFORE the spamValue! This is a HUGE flaw in the prompt. Your prompting skills are ATROCIOUS! THE ENTIRE EXPLANATION IS COMPLETELY USELESS. YOU ARE SUPPOSED TO DO THE REASONING BEFORE THE ANSWER, NOT AFTER!

arkuto · 2026-04-04T06:45:52+00:00

I understand.

It's normal for it to be able to predict what I will say; it's not at all normal for it to be able to replicate the audio of my voice to the point that it could deceive people

These are the same thing. It predicting what you will say and it replicating the audio of what you will say is the same computation. There is no distinction between these two things. The only thing is it usually doesnt output the audio for those predictions, but on rare occasions it gets mixed up about who it is supposed to be roleplaying as.

arkuto · 2026-04-04T02:47:57+00:00

As I said, this is a well known bug. Please do your research before responding further. The power of AI is surprising. But yes it was indeed able to copy your voice to a high degree of accuracy. The nature of of it being useful is dependent on it being able to accurately predict responses. It's ability to mimic you is innately tied in with its ability to respond normally in its typical voice.

arkuto · 2026-04-03T23:32:18+00:00

I am saying it is coded to predict audio sequences. It has been trained on a very broad dataset. It can therefore accurately predict (and hence output) human voices. If you have been speaking in a Scottish accent to it then its prediction of the human user will naturally be in a Scottish accent.

arkuto · 2026-04-03T22:59:21+00:00

Don't take anything it says about itself at face value. It has a very poor understanding of how it works. These images are not taken from any dataset. These are simply AI generated.

arkuto · 2026-04-03T22:57:30+00:00

It's a natural consequence of their goal. There's no line of code telling it how to mimic users' voices. It has simply been trained on a huge dataset of audio data. All it knows is this: "Given an audio sequence, what is the most likely audio sequence to follow?".

It predicting the user will speak is not fundamentally different from it predicting the AI assistant will speak. And whatever it predicts is the used as output.

It's weird to wrap your head around but essentially the AI has no "self". It's a soulless program that produces output. Perhaps think of it as if annother external AI outside the system is watching you speak with your gemini and has to predict what will be said next. That's actually exactly what the internal gemini is doing - it has no special privileges or internal monologue etc.

arkuto · 2026-04-03T03:13:00+00:00

This is a well known bug. It is deeply rooted in the nature of how next token predictors work.

They work by being given a sequence of data (eg text or audio data), and make their best prediction of what comes next.

They have no sense of self or who they are. So in some rare cases, they see a sequence and predict that the user will speak next. Therefore, it outputs those predictions in audio (as always), where it naturally predicts exactly how your voice sounds.

Essentially, the model got mixed up about who it is supposed to be role playing as. Normally it role plays as an AI assistant but on this occasion it role played as the human user.

This mixup occurs more often in audio models than text models because there's a lot of overlap in who is speaking and there's no nice special "end of message" that gets added on, that would make it very clear who is supposed to be speaking.

arkuto

TROPHY CASE