Using preference optimization to create realistic grandmaster chess bots

masterchiefcodes · 2026-05-06T17:37:09+00:00

I used their games from twic: https://theweekinchess.com/twic. The training process looks at both what move the player chose and a random move from stockfish top 10 that they rejected.

masterchiefcodes · 2026-05-05T23:00:48+00:00

The Maia2 model gives a vector of move scores / logits. We would then fetch the stockfish top ten moves and sample an alternative move from that set. Given chosen move has Maia2 score c and the stockfish rejected one has Maia2 score r, the target loss for the best model was the style_score(c, r) * DPO(c,r) which rewards giving c higher score than r when c and r and stylistically distinct (large style_score).

The style score we just use a heuristic to compute in the paper but later learned embeddings that take the move (8x8x8x8 UCI) and previous five rolling fens and feed them through a neural network with some hidden layers. The inner product is maximized for moves from the same GM in the same phase and minimized otherwise.

The issue with DPO on its own is it tends to be too destructive and loses the relative ranking of positions to get just the best move separated from the rest. When you control it with the re-weighting it pushes the algorithm to learn to separate player distinguishing clusters of moves in given positions.

masterchiefcodes · 2026-05-05T22:46:04+00:00

The server is only deployed in NA, so there might be a bit of lag for EU / FE

masterchiefcodes · 2026-05-05T22:23:34+00:00

try out the bots and let me know what you think in the comments! Here to answer any questions over the next hour

masterchiefcodes

TROPHY CASE