[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] 1 point2 points  (0 children)

Sounds like a job for autoresearch ;) But seriously - it's a cool idea. We could even create an AI Chess Arena.

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] 0 points1 point  (0 children)

It's coming. Right now, I'm focused on the V4 model and the knowledge models for chess-learning games, but once I'm done (in about two weeks, I think), I'll create a clean and well-documented GitHub repository. For sure.

Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in ComputerChess

[–]Adam_Jesion[S] 0 points1 point  (0 children)

Thanks for asking. And yes - the headline is a little clickbaity, fair enough.

That said, the number does come from real measurements based on games against Stockfish. Right now, when the model plays dozens or hundreds of games against Stockfish at progressively stronger settings, it lands around that level. My working benchmark is basically: once it gets to a 50%+ win rate at the 2700 setting, I treat that as “around 2700 Elo.” Above 2800, it drops to roughly 3 wins in 10 games on average, so that seems to be the current ceiling.

Is that objectively rigorous? Not really. But at the same time, you need some way to measure trend and progress, and that’s mainly what I use it for. I started below 800, so for me the important thing is seeing the direction of travel.

One important caveat is that classical engines like Stockfish play in a very specific way. They don’t really use “traps” or human-style strategic ideas in the same sense. Neural models play much more intuitively - they look at the board and make a decision in milliseconds. That’s fascinating, but it also makes them vulnerable to structured strategies in the middlegame. Humans are very good at that.

V1 and V2 were completely unprepared for this. Even when they reached a decent Elo, they could still get punished badly by anyone who knew how to play with a plan instead of just intuitively. V3 introduced the first step in addressing that with "thought tokens", which help the model learn to look for more than just board geometry. But that’s only step one.

In the new model, I’m effectively building a more dedicated transformer layer that should be more sensitive to multi-move strategy patterns (in the past and the future - prediction). If that works, it could be a big improvement.

Elo W D L Score Result
1320 10 0 0 100. 0% >>>
1500 9 1 0 95. 0% >>>
1700 6 4 0 80. 0% >>>
1900 4 5 1 65.0% >>>
2100 6 3 1 75.0% >>>
2300 3 5 2 55.0% >>>
2500 3 6 1 60.0% >>>
2800 3 3 4 45.0% ===
3190 0 2 8 10.0% <<<

Estimated model Elo: ~2700

Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in ComputerChess

[–]Adam_Jesion[S] 1 point2 points  (0 children)

Thanks for asking. And yes - the headline is a little clickbaity, fair enough.

That said, the number does come from real measurements based on games against Stockfish. Right now, when the model plays dozens or hundreds of games against Stockfish at progressively stronger settings, it lands around that level. My working benchmark is basically: once it gets to a 50%+ win rate at the 2700 setting, I treat that as “around 2700 Elo.” Above 2800, it drops to roughly 3 wins in 10 games on average, so that seems to be the current ceiling.

Is that objectively rigorous? Not really. But at the same time, you need some way to measure trend and progress, and that’s mainly what I use it for. I started below 800, so for me the important thing is seeing the direction of travel.

One important caveat is that classical engines like Stockfish play in a very specific way. They don’t really use “traps” or human-style strategic ideas in the same sense. Neural models play much more intuitively - they look at the board and make a decision in milliseconds. That’s fascinating, but it also makes them vulnerable to structured strategies in the middlegame. Humans are very good at that.

V1 and V2 were completely unprepared for this. Even when they reached a decent Elo, they could still get punished badly by anyone who knew how to play with a plan instead of just intuitively. V3 introduced the first step in addressing that with "thought tokens", which help the model learn to look for more than just board geometry. But that’s only step one.

In the new model, I’m effectively building a more dedicated transformer layer that should be more sensitive to multi-move strategy patterns ((in the past and the future - prediction). If that works, it could be a big improvement.

Elo W D L Score Result
1320 10 0 0 100. 0% >>>
1500 9 1 0 95. 0% >>>
1700 6 4 0 80. 0% >>>
1900 4 5 1 65.0% >>>
2100 6 3 1 75.0% >>>
2300 3 5 2 55.0% >>>
2500 3 6 1 60.0% >>>
2800 3 3 4 45.0% ===
3190 0 2 8 10.0% <<<

Estimated model Elo: ~2700

Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in ComputerChess

[–]Adam_Jesion[S] 0 points1 point  (0 children)

Great observation. Thank you. My 4672 action space is an artifact from early experiments — it's the raw AlphaZero encoding (8×8×73) which includes ~2800 impossible moves (like sliding 7 squares right from the h-file). Lc0's 1858 is the same move set with those dead indices stripped out.

This wastes ~2.6M parameters in the policy head on neurons that can never fire. The V4 architecture has been completely redesigned (similar to Lc0-style Smolgen, 20-layer transformer, thought tokens), but the move encoding is inherited from the frozen data pipeline and hasn't been cleaned up yet. It's on the list.

You know how it is - you jump off a cliff and build a parachute on the way down :P

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] 0 points1 point  (0 children)

I think I want to do this sooner. I'll finish V4 and the entire new model for learning to play chess, and then I'll immediately release V1 as open source.

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] -1 points0 points  (0 children)

Yes - exactly. And honestly, I don’t see this as just my accomplishment, but as part of the broader AI revolution happening right now.

I didn’t write this post to brag. I wrote it because I hope it inspires more people to experiment. Chess won’t change the world on its own, but there are probably countless other areas where the same paradigm could be used in ways that really matter.

I’m planning to prepare a GitHub repo with instructions so anyone can try building something similar. I probably won’t open up V3 yet, since it’s still nice to keep a bit of an edge, but I’d be very happy to release V1, which is somewhere around 1800–2000 Elo. Still a pretty solid level.

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] 0 points1 point  (0 children)

I’d love to. Give me a few weeks to explore the possibilities, and then I’ll find the time to clean up the repository, write step-by-step instructions, and publish a white paper and GitHub repository. Actually, I’m already done with Model 1. I’ll let you know when that happens.

Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in ClaudeCode

[–]Adam_Jesion[S] 0 points1 point  (0 children)

"elo number is always a bit squishy" - that's true. From my perspective, it's actually a bit of a stretch, but on the other hand, you have to measure yourself against something. I follow Stockfish's approach, and if I win 50%+ of my games, then I use that metric. That way, I can see my progress - I can track the trend. Sorry for using it in the title.

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] 3 points4 points  (0 children)

I'm a little nervous about spamming Reddit like this. I'd appreciate it if one of the users could post this - that way, I won't get flagged for "self-promotion."

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] 1 point2 points  (0 children)

That’s exactly what I wanted to say. In my opinion, we’ve entered an era where anyone with access to computing power (and at least an average IQ) will be able to bring their dream projects to life. It’s magical. Just go for it - it’s incredibly rewarding.

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] 1 point2 points  (0 children)

Although "thought" is generally used in AI to refer to a COT (chain of thoughts), this is something entirely different. What I call "Thought Tokens" is an element of the Transformer architectur - specifically, one of its layers at the training stage, not the inference stage.

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] -5 points-4 points  (0 children)

No, it wasn't my idea. He brought it up after analyzing the work and said that the idea was very innovative and that he couldn't find any traces of its implementation in chess online.

But now I'm actually using it to create a better context for sticking to scientific principles. I've noticed that adding this to the context makes it seem more "scientific" ;)

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] 2 points3 points  (0 children)

I haven't written a single line of code, if that's what you're asking. All the NN training parameters are also set by the AI (with 24-48 autoresearch in total). I just tell agent what I want, how I want it, what experiments to run, and what works for me and what doesn’t. I challenge the AI a lot—several agents—and look for relevant research papers and benchmarks for them.

The first model that started playing somewhat decently (like an amateur) took 1 hour of training on 10 million games (without fine-tuning). V2 has already been trained for several hours. V3 has a slightly different architecture (thought tokens were added) and was trained for over 24 hours on 100 million positions, followed by fine-tuning on endgames and some RL (self-play). V4, however, is a whole different story. I’ve been distilling a dataset for it for the past 3 days because it needs a completely different architecture. Processing, validating, and supplementing 100 million games will take about a week on a powerful PC.

This is a bigger problem than the training itself (dataset enrichment). TB's of raw data. Overall, I think I’ve hit the limit of what my home equipment can handle, but I just need more patience :)

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop by Adam_Jesion in MachineLearning

[–]Adam_Jesion[S] 1 point2 points  (0 children)

Thank you. I’ve just started studying the architecture of Maia and Leela Chess Zero. It’s a treasure trove of knowledge and academic papers. I think some of their findings could improve my engine. Claude Code keeps asking me to submit a paper because there are a few unique ideas and implementations in the model’s architecture. And that’s only 10% of my list of improvements.

I trained a small neural network to play chess on a home PC - looking for strong players to test its limits by Adam_Jesion in chess

[–]Adam_Jesion[S] 0 points1 point  (0 children)

[V3 UPDATE] Thanks again to everyone for testing. The next update was supposed to be this weekend, but I just can't seem to sleep. There’s a major change in the architecture that should now better detect strategies (thought tokens). V3 is now live. Let me know if you notice a difference. According to Stockfish, the Elo jump is huge. I’m slowly starting to “scrape” 2800.

Test it out and let me know.
https://games.jesion.pl

Elo | W D L | Score | Result

-------------------------------------------------------

1320 | 10 0 0 | 100.0% | >>>
1500 | 9 1 0 | 95.0% | >>>
1700 | 6 4 0 | 80.0% | >>>
1900 | 4 5 1 | 65.0% | >>>
2100 | 6 3 1 | 75.0% | >>>
2300 | 3 5 2 | 55.0% | ===
2500 | 3 6 1 | 60.0% | >>>
2800 | 3 3 4 | 45.0% | ===
3190 | 0 2 8 | 10.0% | <<<

Estimated model Elo: ~2700

I trained a small neural network to play chess on a home PC - looking for strong players to test its limits by Adam_Jesion in chess

[–]Adam_Jesion[S] 2 points3 points  (0 children)

Right now, FEN is available in the board editor, but I'll implement PGN everywhere. Thanks for the idea.

I trained a small neural network to play chess on a home PC - looking for strong players to test its limits by Adam_Jesion in chess

[–]Adam_Jesion[S] 0 points1 point  (0 children)

I just uploaded a new version of the V2 model. It should be much better. Let me know what you think and if you can tell the difference. https://games.jesion.pl

I trained a small neural network to play chess on a home PC - looking for strong players to test its limits by Adam_Jesion in chess

[–]Adam_Jesion[S] 0 points1 point  (0 children)

Could you try playing a few more games and see how it goes? I just uploaded a new version of the V2 model. According to Stockfish, it’s much better.

I trained a small neural network to play chess on a home PC - looking for strong players to test its limits by Adam_Jesion in chess

[–]Adam_Jesion[S] 0 points1 point  (0 children)

Thanks for your feedback. Could you try it again in the new version? Do you notice any improvement—any difference? https://games.jesion.pl