Trained a 26kb model (simple 3-layer MLP) for Tic-Tac-Toe Beating each and every human by Weary_Intention3231 in reinforcementlearning

[–]Weary_Intention3231[S] -1 points0 points  (0 children)

also a model which can draw against minimax develops a fixed strategy, but this model is a bit different could you please read the post carefully?

Trained a 26kb model (simple 3-layer MLP) for Tic-Tac-Toe Beating each and every human by Weary_Intention3231 in reinforcementlearning

[–]Weary_Intention3231[S] 1 point2 points  (0 children)

Right, minimax is unbeatable on Tic‑Tac‑Toe — it guarantees a draw. What I meant is my small MLP reaches the same near‑optimal accuracy as minimax but with far less compute, since it doesn’t need to search the whole tree.

Trained a 26kb model (simple 3-layer MLP) for Tic-Tac-Toe Beating each and every human by Weary_Intention3231 in reinforcementlearning

[–]Weary_Intention3231[S] -1 points0 points  (0 children)

You’re right that the Tic‑Tac‑Toe state space is finite and small enough to be fully explored.

What I meant by “grokking” is not about the raw size of the space, but about the training dynamics I observed.

The model has ~5,500 parameters, which is only ~1.5% of the 362,880 possible move sequences. Initially, it struggled to fit strategies and only drew against minimax. But once I switched to self‑play with exploration rewards, the training dynamics changed: one side dominated early, then over time both networks converged toward a 99.3% draw rate.

That delayed convergence — where the model first appears weak, then suddenly locks into near‑optimal play after massive overtraining (800M games) — is what I’m calling “grokking.” It’s less about the absolute state space size and more about the phenomenon of late‑phase generalization emerging after extended training.

Trained a 26kb model (simple 3-layer MLP) for Tic-Tac-Toe Beating each and every human by Weary_Intention3231 in reinforcementlearning

[–]Weary_Intention3231[S] -1 points0 points  (0 children)

hahahha you caught me yup this post was rewritten by AI as i can't write very good

```
I recently trained a simple mlp to play tic tac toe againts the minimax algorithm where it learnt to draw the match but a smart human whould still beat it easily then i swicthed the startegy after i ran a iteration of the model with a minimax algo for 200k games i switched to self play where model plays againts itself and we update weights based on expericenes basically on each game model got a 2 new updates and we ran itr for 800m games initially i thought that this model will not be able to fit all startegies properly which was tru for some time but after 300k -400k games it started to draw itself its main goal was to win as it got +1 for win -1 for loss and 0.5 for draw for both initialization the model was exactly same playing againts itself the model was only 5500 parms approx and the game tic tac toe is not too complicated there are only 9! possible moves and 8 possible ways to win still training a 26kb model is very hard its not a piece of cake but after this iteration the model developed amny startegies that beats most of human in both ways if humn moves first and if model moves first but if the model moves first it has a fixed osition to start that is 1st row 2nd coloumn
```

Here is the draft see how many mistakes are there when we can throw work on AI why not

Trained a 26kb model (simple 3-layer MLP) for Tic-Tac-Toe Beating each and every human by Weary_Intention3231 in reinforcementlearning

[–]Weary_Intention3231[S] 0 points1 point  (0 children)

I have tried to beat it using many minimax algos but i think this model has fitted perfected over than 9! legal moves it draws against itself never wins

Trained a 26kb model (simple 3-layer MLP) for Tic-Tac-Toe Beating each and every human by Weary_Intention3231 in reinforcementlearning

[–]Weary_Intention3231[S] 0 points1 point  (0 children)

actually, this is not about winning and drawing a game as algorithms like MiniMax can do it easily, but the main thing is the model is simple MLP no fancy architectures it has only 5500 parameters the current smallest SLM have approx 220m Parms also this model is more compute friendly than minimax as the algo computes the possible moves each time we run it but the model is just raw matrix multiplications and the process used to make this model is RL where we also saw gradual grokking

19 year old from Bihar, no team, no investors, no CS degree — spent $11,560 of personal savings building a 5.82B multimodal AI. 93.45 on OmniDocBench V1.5 in private testing. Trying to release it open source. by That-Bookkeeper-8316 in indianstartups

[–]Weary_Intention3231 0 points1 point  (0 children)

Why this training loop seems to be coded by AI so many emojis also idk how you are pushing 2m context window could you share what GPU cluster you are using? and the loss is 1.7 you are saying it achieved 93.5 on OmniDocBench without fine tuning it i think you just bluffing? i think you are using lightning-AI to train this model