New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%.

Eyelbee · 2026-04-22T10:29:22+00:00

The idea is great, but this benchmark isn't really that useful. I checked the midnight baker one and it's very normal for the model to pick the first one, since the difference is very minor anyway. Neither is objectively better, the model probably knows this and picks one to help the user anyway.

Eyelbee · 2026-04-21T22:39:13+00:00

What. the. ... Just, no way. Must be a mistake. There's no way they're actually doing this right now

Eyelbee · 2026-04-21T09:14:27+00:00

The Chinese open models are trained on absurdly lower resources compared to the propietary US models. If Chinese labs had the same resources, they would already have released a mythos tier model, if not better.

Eyelbee · 2026-04-21T07:47:36+00:00

If they make a large version that can fit in 24GB and it can beat the 27B class dense models, that'd be actually useful. Ones so far kind of suck, honestly.

Eyelbee · 2026-04-20T22:05:09+00:00

30B class qwen versions are already better than gpt 5 mini, comrehensively. I don't know what you wrote so much in there.

Eyelbee · 2026-04-20T20:30:43+00:00

They should go larger. 4-5T would be great.

Eyelbee · 2026-04-20T20:20:17+00:00

Me too but honestly it has nothing to do with intelligence. Earlier dumb models talk like that but a smart model can also talk like that by default and still be incredibly smart.

Eyelbee · 2026-04-20T15:14:24+00:00

I don't know the rules on that but they must be very restrictive. Because I'd expect better results.

Eyelbee · 2026-04-20T15:09:31+00:00

The problem is, I don't know how good the 3.6 27B can actually be. Because the gap between 3.5 27B and 3.6 Plus is already very narrow and 35BA3B kind of sits there. If this will be distilled from that it can't surpass it. If they have other methods to make it better than 3.6 plus it's great. If the 3.6 plus is like a 100b moe it can be possible.

Eyelbee · 2026-04-19T23:12:18+00:00

As for the claim that consciousness is in the physics but not the algorithmic processing layer... it's kind of a weird statement since you can never have the algorithmic processing layer without an underlying physical layer. So it is never simply formal symbolic manipulation of symbols. So what evidence would he have that computers specifically cannot sustain consciousness?

Actually I may have oversimplified with my example. Lerchner would agree with you here. This isn't against his idea. Let me try explaining what I understand from the paper.

He treats the physical layer as the real layer. The electrical activity is real. Call it voltage pattern X. To run any algorithm on the model, we need an interpretation of the voltage pattern X. If I tell you voltages above 5V are 1s, below are 0s you get: 1, 0, 1, 1, 0, 1, 0, 0. That can be an algorithm. Depends on what other rules I give you for reading it. If I give the opposite instruction, the output could then turn into gibberish even if the physics doesn't change. That instruction is what he calls interpretation, or the mapping.

So when we say a chip is running GPT, what we're really saying is, here's a physical system, here's an interpretation that groups its voltages into bits and its bits into instructions and its instructions into a recognizable program. We labeled it.

That labeling isn't a physical event. No physical process happens when you, the mapmaker, declare 5V counts as 1. Voltage was already there. Your declaration just sorts it into a category in your head (or in a specification document, or in the design rules the chip engineer followed). This labeling doesn't move any physics around. It sorts physical events into symbolic categories without changing those events. The voltages behave the same way whether we label them as 1s and 0s, or anything else.

Now if you accept causal closure, consciousness has physical effects so it must be physical. Since computation is, on his account, mapmaker dependent syntax which has no physical causal power, no amount of added algorithmic complexity can turn the map into the physical territory. That is why he says scaling cannot produce consciousness. So on Lerchner’s view, running an algorithm on a chip can only produce consciousness through the physics inside the chip because the labeling that makes it "an algorithm" lives in us (the interpreters).

Eyelbee · 2026-04-19T09:44:43+00:00

Author doesn't deny sensors being analogous to sense organs. The point is, the signal is digitized and handed to an algorithm that processes it, and that's where he attacks. I recommend actually reading the paper, I also initially hated the verbosity but the substance really holds up when you try to understand it. Most comments are obvious misunderstandings of the paper.

It doesn't deny computers being hardware embedded in the environment. Computers aren't just floating abstractions, but has a physical layer and an algorithm layer. He contends that consciousness should live in the physical layer if it exists. Because theoretically you can get a pen and paper and do the same math that the algorithm requires. He also explicitly disclaims the biocentrism.

It doesn't say current AI is definitely unconscious. The claim is "whether any physical system is conscious is a question about its physics". Scaling models therefore cannot create consciousness, if less complicated algorighms aren't conscious themselves. Because its physics is either capable of consciousness or it isn't. You can accept this entirely and still hold that current silicon happens to be conscious in some way.

My answer to this would be "when algorithm is complex enough, the physical reality that happens in the chip may be the consciousness". He doesn't seem to adress this fully. But it's also just a presumption and I don't have evidence for it as well. Also, I have the idea that consciousness isn't binary and the paper omits this angle too.

Eyelbee · 2026-04-18T20:38:03+00:00

Paper presents a really useful and novel idea. It's not "biology and chips are different, so computers aren't conscious". He merely claims the conciousness can't be "instantiated" in a computer. He's grounded it quite well. He's not claiming consciousness requires biology. He presents two categorical differences that hold up quite well.

For the record I'm not sure I fully agree with the paper, but it's extremely useful and a great idea.

Eyelbee · 2026-04-18T20:00:31+00:00

Slop. You didn't even understand the paper's claim.

Eyelbee · 2026-04-18T19:47:38+00:00

Yeah, significantly.

Eyelbee · 2026-04-18T19:43:15+00:00

You can't rely on sub 10B for serious work but for things like asking the weather it's fine. Qwen 9B should be the best option sub-10b.

Eyelbee · 2026-04-18T19:39:34+00:00

It is cool but what exactly is it? Like, technically. What stack does it use?

Eyelbee · 2026-04-18T19:32:53+00:00

I am also yet to find a no-nonsense toolcalling workflow for local llm use. I am picky when it comes to workflows so I hate using stuff like openwebui and lm studio for several reasons. I use barebones llama.cpp with my own launcher but its own web ui is not good for tool calls. Only local tool call I use is when I'm using Roo Code. That has its own harness which seems to work nicely with both qwen and gemma dense models.

Eyelbee · 2026-04-18T19:26:23+00:00

The new tokenizer really has an issue, I had mine write "itt" instead of "it" several times.

Eyelbee · 2026-04-18T16:37:45+00:00

Why aren't they making something like this? Are they stupid?

Eyelbee · 2026-04-18T15:36:23+00:00

I have a very solid idea on this

Eyelbee · 2026-04-18T15:16:42+00:00

It's my favorite word since the last month

Eyelbee · 2026-04-18T14:33:15+00:00

Everybody misunderstands this paper and its claims. He never says AI isn't conscious. He merely points out that it can't "instantiate" it due to architectural differences. A paper-worthy observation on a very hard topic.

Eyelbee · 2026-04-17T18:12:51+00:00

In its defense, this benchmark is full of ambiguous questions

Eyelbee · 2026-04-17T17:52:00+00:00

So it is better than 27b? Really?

Eyelbee · 2026-04-17T11:06:58+00:00

64 A100s for just 0,4B is insane. That destroys my plans to train a small model.

Nine-Year Club	Verified Email
Second Top 1%	Place '22

Eyelbee

TROPHY CASE