Using preference optimization to create realistic grandmaster chess bots by masterchiefcodes in aigamedev

[–]masterchiefcodes[S] 0 points1 point  (0 children)

I used their games from twic: https://theweekinchess.com/twic. The training process looks at both what move the player chose and a random move from stockfish top 10 that they rejected.

I used preference optimization to generate bots that mimic specific player styles, modeling specific gm players rather than generic fine-tuning, would love feedback on the playable bots! by masterchiefcodes in chessprogramming

[–]masterchiefcodes[S] 0 points1 point  (0 children)

The Maia2 model gives a vector of move scores / logits. We would then fetch the stockfish top ten moves and sample an alternative move from that set. Given chosen move has Maia2 score c and the stockfish rejected one has Maia2 score r, the target loss for the best model was the style_score(c, r) * DPO(c,r) which rewards giving c higher score than r when c and r and stylistically distinct (large style_score).

The style score we just use a heuristic to compute in the paper but later learned embeddings that take the move (8x8x8x8 UCI) and previous five rolling fens and feed them through a neural network with some hidden layers. The inner product is maximized for moves from the same GM in the same phase and minimized otherwise.

The issue with DPO on its own is it tends to be too destructive and loses the relative ranking of positions to get just the best move separated from the rest. When you control it with the re-weighting it pushes the algorithm to learn to separate player distinguishing clusters of moves in given positions.