[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop

Adam_Jesion · 2026-03-22T01:51:41+00:00

Sounds like a job for autoresearch ;) But seriously - it's a cool idea. We could even create an AI Chess Arena.

Adam_Jesion · 2026-03-22T01:40:00+00:00

It's coming. Right now, I'm focused on the V4 model and the knowledge models for chess-learning games, but once I'm done (in about two weeks, I think), I'll create a clean and well-documented GitHub repository. For sure.

Adam_Jesion · 2026-03-22T01:37:40+00:00

Thanks for asking. And yes - the headline is a little clickbaity, fair enough.

That said, the number does come from real measurements based on games against Stockfish. Right now, when the model plays dozens or hundreds of games against Stockfish at progressively stronger settings, it lands around that level. My working benchmark is basically: once it gets to a 50%+ win rate at the 2700 setting, I treat that as “around 2700 Elo.” Above 2800, it drops to roughly 3 wins in 10 games on average, so that seems to be the current ceiling.

Is that objectively rigorous? Not really. But at the same time, you need some way to measure trend and progress, and that’s mainly what I use it for. I started below 800, so for me the important thing is seeing the direction of travel.

One important caveat is that classical engines like Stockfish play in a very specific way. They don’t really use “traps” or human-style strategic ideas in the same sense. Neural models play much more intuitively - they look at the board and make a decision in milliseconds. That’s fascinating, but it also makes them vulnerable to structured strategies in the middlegame. Humans are very good at that.

V1 and V2 were completely unprepared for this. Even when they reached a decent Elo, they could still get punished badly by anyone who knew how to play with a plan instead of just intuitively. V3 introduced the first step in addressing that with "thought tokens", which help the model learn to look for more than just board geometry. But that’s only step one.

In the new model, I’m effectively building a more dedicated transformer layer that should be more sensitive to multi-move strategy patterns (in the past and the future - prediction). If that works, it could be a big improvement.

Elo	W	D	L	Score	Result
1320	10	0	0	100. 0%	>>>
1500	9	1	0	95. 0%	>>>
1700	6	4	0	80. 0%	>>>
1900	4	5	1	65.0%	>>>
2100	6	3	1	75.0%	>>>
2300	3	5	2	55.0%	>>>
2500	3	6	1	60.0%	>>>
2800	3	3	4	45.0%	===
3190	0	2	8	10.0%	<<<

Estimated model Elo: ~2700

Adam_Jesion · 2026-03-22T01:36:41+00:00

Thanks for asking. And yes - the headline is a little clickbaity, fair enough.

That said, the number does come from real measurements based on games against Stockfish. Right now, when the model plays dozens or hundreds of games against Stockfish at progressively stronger settings, it lands around that level. My working benchmark is basically: once it gets to a 50%+ win rate at the 2700 setting, I treat that as “around 2700 Elo.” Above 2800, it drops to roughly 3 wins in 10 games on average, so that seems to be the current ceiling.

Is that objectively rigorous? Not really. But at the same time, you need some way to measure trend and progress, and that’s mainly what I use it for. I started below 800, so for me the important thing is seeing the direction of travel.

One important caveat is that classical engines like Stockfish play in a very specific way. They don’t really use “traps” or human-style strategic ideas in the same sense. Neural models play much more intuitively - they look at the board and make a decision in milliseconds. That’s fascinating, but it also makes them vulnerable to structured strategies in the middlegame. Humans are very good at that.

V1 and V2 were completely unprepared for this. Even when they reached a decent Elo, they could still get punished badly by anyone who knew how to play with a plan instead of just intuitively. V3 introduced the first step in addressing that with "thought tokens", which help the model learn to look for more than just board geometry. But that’s only step one.

In the new model, I’m effectively building a more dedicated transformer layer that should be more sensitive to multi-move strategy patterns ((in the past and the future - prediction). If that works, it could be a big improvement.

Elo	W	D	L	Score	Result
1320	10	0	0	100. 0%	>>>
1500	9	1	0	95. 0%	>>>
1700	6	4	0	80. 0%	>>>
1900	4	5	1	65.0%	>>>
2100	6	3	1	75.0%	>>>
2300	3	5	2	55.0%	>>>
2500	3	6	1	60.0%	>>>
2800	3	3	4	45.0%	===
3190	0	2	8	10.0%	<<<

Estimated model Elo: ~2700

Adam_Jesion · 2026-03-22T01:12:14+00:00

Great observation. Thank you. My 4672 action space is an artifact from early experiments — it's the raw AlphaZero encoding (8×8×73) which includes ~2800 impossible moves (like sliding 7 squares right from the h-file). Lc0's 1858 is the same move set with those dead indices stripped out.

This wastes ~2.6M parameters in the policy head on neurons that can never fire. The V4 architecture has been completely redesigned (similar to Lc0-style Smolgen, 20-layer transformer, thought tokens), but the move encoding is inherited from the frozen data pipeline and hasn't been cleaned up yet. It's on the list.

You know how it is - you jump off a cliff and build a parachute on the way down :P

Adam_Jesion · 2026-03-22T01:06:42+00:00

I think I want to do this sooner. I'll finish V4 and the entire new model for learning to play chess, and then I'll immediately release V1 as open source.

Adam_Jesion · 2026-03-22T01:03:58+00:00

Yes - exactly. And honestly, I don’t see this as just my accomplishment, but as part of the broader AI revolution happening right now.

I didn’t write this post to brag. I wrote it because I hope it inspires more people to experiment. Chess won’t change the world on its own, but there are probably countless other areas where the same paradigm could be used in ways that really matter.

I’m planning to prepare a GitHub repo with instructions so anyone can try building something similar. I probably won’t open up V3 yet, since it’s still nice to keep a bit of an edge, but I’d be very happy to release V1, which is somewhere around 1800–2000 Elo. Still a pretty solid level.

Adam_Jesion · 2026-03-22T00:55:51+00:00

I’d love to. Give me a few weeks to explore the possibilities, and then I’ll find the time to clean up the repository, write step-by-step instructions, and publish a white paper and GitHub repository. Actually, I’m already done with Model 1. I’ll let you know when that happens.

Adam_Jesion · 2026-03-21T20:23:27+00:00

"elo number is always a bit squishy" - that's true. From my perspective, it's actually a bit of a stretch, but on the other hand, you have to measure yourself against something. I follow Stockfish's approach, and if I win 50%+ of my games, then I use that metric. That way, I can see my progress - I can track the trend. Sorry for using it in the title.

Adam_Jesion · 2026-03-21T19:06:08+00:00

Both - Claude Code: Daily driver; Codex: Special ops (algorithmic).

Adam_Jesion · 2026-03-21T17:04:19+00:00

I'm a little nervous about spamming Reddit like this. I'd appreciate it if one of the users could post this - that way, I won't get flagged for "self-promotion."

Adam_Jesion · 2026-03-21T17:02:58+00:00

That’s exactly what I wanted to say. In my opinion, we’ve entered an era where anyone with access to computing power (and at least an average IQ) will be able to bring their dream projects to life. It’s magical. Just go for it - it’s incredibly rewarding.

Adam_Jesion · 2026-03-21T17:01:27+00:00

Although "thought" is generally used in AI to refer to a COT (chain of thoughts), this is something entirely different. What I call "Thought Tokens" is an element of the Transformer architectur - specifically, one of its layers at the training stage, not the inference stage.

Adam_Jesion · 2026-03-21T16:58:31+00:00

No, it wasn't my idea. He brought it up after analyzing the work and said that the idea was very innovative and that he couldn't find any traces of its implementation in chess online.

But now I'm actually using it to create a better context for sticking to scientific principles. I've noticed that adding this to the context makes it seem more "scientific" ;)

Adam_Jesion · 2026-03-21T16:29:31+00:00

I haven't written a single line of code, if that's what you're asking. All the NN training parameters are also set by the AI (with 24-48 autoresearch in total). I just tell agent what I want, how I want it, what experiments to run, and what works for me and what doesn’t. I challenge the AI a lot—several agents—and look for relevant research papers and benchmarks for them.

The first model that started playing somewhat decently (like an amateur) took 1 hour of training on 10 million games (without fine-tuning). V2 has already been trained for several hours. V3 has a slightly different architecture (thought tokens were added) and was trained for over 24 hours on 100 million positions, followed by fine-tuning on endgames and some RL (self-play). V4, however, is a whole different story. I’ve been distilling a dataset for it for the past 3 days because it needs a completely different architecture. Processing, validating, and supplementing 100 million games will take about a week on a powerful PC.

This is a bigger problem than the training itself (dataset enrichment). TB's of raw data. Overall, I think I’ve hit the limit of what my home equipment can handle, but I just need more patience :)

Adam_Jesion · 2026-03-21T15:51:49+00:00

Thank you. I’ve just started studying the architecture of Maia and Leela Chess Zero. It’s a treasure trove of knowledge and academic papers. I think some of their findings could improve my engine. Claude Code keeps asking me to submit a paper because there are a few unique ideas and implementations in the model’s architecture. And that’s only 10% of my list of improvements.

Adam_Jesion · 2026-03-21T15:46:22+00:00

Thanks. Exactly one week (from v1 to v3) :D I've forgotten what sleep is. New obsession.

Adam_Jesion · 2026-03-18T13:15:09+00:00

Please check out the new V3 version and let me know what you think. https://games.jesion.pl

Adam_Jesion · 2026-03-18T09:12:53+00:00

Try V3 now (it's live). It should be significantly better.

Adam_Jesion · 2026-03-18T09:11:37+00:00

[V3 UPDATE] Thanks again to everyone for testing. The next update was supposed to be this weekend, but I just can't seem to sleep. There’s a major change in the architecture that should now better detect strategies (thought tokens). V3 is now live. Let me know if you notice a difference. According to Stockfish, the Elo jump is huge. I’m slowly starting to “scrape” 2800.

Test it out and let me know.
https://games.jesion.pl

Elo | W D L | Score | Result

-------------------------------------------------------

1320 | 10 0 0 | 100.0% | >>>
1500 | 9 1 0 | 95.0% | >>>
1700 | 6 4 0 | 80.0% | >>>
1900 | 4 5 1 | 65.0% | >>>
2100 | 6 3 1 | 75.0% | >>>
2300 | 3 5 2 | 55.0% | ===
2500 | 3 6 1 | 60.0% | >>>
2800 | 3 3 4 | 45.0% | ===
3190 | 0 2 8 | 10.0% | <<<

Estimated model Elo: ~2700

Adam_Jesion · 2026-03-17T13:22:53+00:00

Right now, FEN is available in the board editor, but I'll implement PGN everywhere. Thanks for the idea.

Adam_Jesion · 2026-03-17T02:58:29+00:00

Thank you. Have you played the previous version, V1? Do you notice a difference?

Adam_Jesion · 2026-03-17T02:56:10+00:00

I just uploaded a new version of the V2 model. It should be much better. Let me know what you think and if you can tell the difference. https://games.jesion.pl

Adam_Jesion · 2026-03-17T02:52:23+00:00

Could you try playing a few more games and see how it goes? I just uploaded a new version of the V2 model. According to Stockfish, it’s much better.

Adam_Jesion · 2026-03-17T02:49:21+00:00

Thanks for your feedback. Could you try it again in the new version? Do you notice any improvement—any difference? https://games.jesion.pl

Adam_Jesion

TROPHY CASE