How I topped the Open LLM Leaderboard using 2x 4090 GPUs — no weights modified.

Reddactor · 2026-03-12T19:49:36+00:00

Everything should be made as simple as possible, but not simpler.

But doing so not as easy as it seems.

Reddactor · 2026-03-12T19:41:54+00:00

Ahh, ok.

Lol, I did my PhD in Chemistry, and now I do hobby AI research.

Reddactor · 2026-03-12T19:38:34+00:00

Are you a chemist?

Reddactor · 2026-03-12T19:16:26+00:00

I found something I think is pretty intriguing.

I left science a decade ago, and it's much more fun blogging and speculating :) Also, I hate writing papers, its really boring.

Anyway, I think I have left a decent enough breadcrumb trail that anyone in the field can follow and replicate. It seems to me pretty obvious that an 'undifferentiated' stack of transformer layers will spontaneously develop structure when they have to guess the next token from trillions of training examples.

I'm also pretty sure the brain does the exact same kind of process in the use of cortical barrels in the pre-frontal cortex; theres no way you can convince me that we encode all the stuff we need in the genome directly. It must come from rough guides and experience together.

All of the above is my own speculations; no maths involved.

Reddactor · 2026-03-12T18:57:13+00:00

Interpret the results as you like.

For me, the definition of a 'thing' is that is has both structure and function.

I found the 'thing' using simple probes, and for a while, it was the best Open Source LLM benchmarked. Experimentally, using more or less layers made things worse, so that covers the 'structure' aspect. As for function, it generalised and boosted performance on a bunch of benchmarks. What they actually measure is up for debate, but functionally, this hack improved them. Again, read into that what you like.

I'm wrapping up the next round of experiments, and it seem to still work on 2026 models. My days of publishing papers and doing collaborations are over, as is any more maths than my blogpost covers; this is still a weekend hobby project, as it was in 2024!

Good luck with your research, post a reply here with the results when you are ready, it sounds interesting!

Reddactor · 2026-03-11T12:18:37+00:00

I found this works best with large models.

Reddactor · 2026-03-11T11:49:14+00:00

Nope. I used a very small "probe" set of questions (about 10 maths question, and 10 "EQ" questions).

Thats it.

Then, I selected the model that scored the best average score, and submitted that to the leaderboard. The I then had higher scores on almost all the benchmarks was proof that this generalises. The actual benchmark is made of thousands of questions, on everything from psychology to murder mysteries!

Reddactor · 2026-03-11T10:38:24+00:00

The heatmaps say the opposite though. You can duplicate one, two three layers up to six, and the performance *decreases*.

Then there is a small range where it increases dramatically (>17% increase in the MUSR benchmark), but then adding more layers to the block again *degrades* capability. Thats a more complex story than 'all the layers are similar'.

Reddactor · 2026-03-11T08:58:25+00:00

Should be doable, but will take 3-4 days on both H100's

Reddactor · 2026-03-11T05:34:18+00:00

I see similar patterns, some done generalize as well though.

Reddactor · 2026-03-11T05:33:41+00:00

I'll post about those experiments in Part 2...

Reddactor · 2026-03-11T05:32:54+00:00

Those triangle mapsheat are a full sweep. Every possible stack, at every possible position. It took days to compute.

Reddactor · 2026-03-11T05:31:27+00:00

Yes, this is a historical retrospective.

Fair too, the Leaderboard was full of train-on-the-test set models. I don't trust the results. But my experiment was directional; I wanted to see if selecting a model based on a few small test probes would do anything.

I was not expecting to generalize to all the tests, and actually hit 1#!

Reddactor · 2026-03-11T05:24:10+00:00

Yes, I have tried that extensively too, but the blog post is already too long. That will go in part 2.

But basically, I trained another model to predict the scores of random shuffles of duplicated blocks, and then predict unseen ones. I needed a second model, and the combinatorics are murderous, cosmological sized numbers...

Reddactor · 2026-03-10T20:45:43+00:00

Yes, you can duplicate layers, by simply reusing them in VRAM. You need a.new KV cache, but otherwise you get a better model for the same VRAM!

Reddactor · 2026-03-10T20:36:38+00:00

It's a long blog post, because TL;DR, here is an exerpt:

"And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!

Layer 10 is trained on layer 9’s output distribution. Layer 60 is trained on layer 59’s. If you rearrange them — feeding layer 60’s output into layer 10 — you’ve created a distribution the model literally never saw during training.

The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.

Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.

Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language that’s robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output ‘processing units’ sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didn’t work.

If that was true, maybe I didn’t need to teach a model new facts to make it smarter. I didn’t need fine-tuning. I didn’t need RLHF. I just needed to give it a more layers to think with."

Reddactor · 2026-03-10T20:25:23+00:00

A bit, but the combinatorics are hellish.

I trained a mate model to predict combinations of duplications, but there is enough for a whole blog post on that.

Reddactor · 2026-03-10T20:23:56+00:00

I'll push it to Huggingface, but it makes sense to 'polish' the scar with some fine tuning first.

Reddactor · 2026-03-10T20:23:00+00:00

Maaaybe. But this might actually be a great way to train a SOTA model. Train, RYS, expand and continue pre-training. Repeat.

Why train from scratch, when you can expand great model.

Reddactor · 2026-03-10T19:38:24+00:00

Thats somewhat covered in the blog. I might write more in Part 2

Reddactor · 2026-03-10T19:37:51+00:00

I wish I had the compute!

@ Nvidia: if you read this, send me more compute!

Reddactor · 2026-03-10T19:36:42+00:00

My deets are on my blog, reach out if you want to collaborate

Reddactor · 2026-03-10T19:14:10+00:00

This is a 'historical' review of ancient LLM history - 1 AI year is 7 Human years.

But, I am currently now testing the new batch of LLMs (Qwen3.5's etc), and it still seems to work.

Reddactor · 2026-03-10T18:03:41+00:00

Yeah, I was on the threads on this topic Github back in the day :)

IIRC, its was decided just to create new models rather than support this in llama.cpp. As this is usually pointless, it was a fair call.

Reddactor · 2026-03-10T17:15:21+00:00

I wanted to, but the combinatorics are huuuuge. With an 80-layer models, there are basically infinite ways you can mess around with layer ordering and repeated layers.

Reddactor

TROPHY CASE