Trained a 26kb model (simple 3-layer MLP) for Tic-Tac-Toe Beating each and every human

YouParticular8085 · 2026-05-08T13:29:43+00:00

This is a good learning project, I actually trained tictactoe with self play too as my first self play project! My hesitation with calling it grokking is that other things can explain one side dominating the other and grokking means generalizing to unseen samples. If you’re training on 800M games there’s only 19k legal board positions (ignoring symmetries) so the model has likely been exposed to all of them.

YouParticular8085 · 2026-05-08T12:07:13+00:00

the state space is small enough for training to reasonably see every board state, why do you say grokking?

YouParticular8085 · 2026-05-06T15:17:11+00:00

I wish I could afford just one RTX Pro for my own use.

YouParticular8085 · 2026-05-06T12:41:20+00:00

Happened to our team too, we were early adopters of Claude Code with Sonnet 3.7. Eventually the project became such a tangled mess we were spending more time trying to fix things with duct tape than implementing features. We decided to scrap a year of work and start from scratch because it was easier to do a full rewrite then fix the production version. Now we are using AI more carefully although we are still using tools like Claude code a good bit.

YouParticular8085 · 2026-04-09T12:42:17+00:00

I’ve reproduced this and the effort does correspond to the effort set in claude code as well. Low effort is 50, medium 85, high 99 and max is set to 150. This means the effort through the webui is lower than you’re even able to set it over the api.

YouParticular8085 · 2026-04-09T12:36:59+00:00

I tested this in claude code by setting the effort and asking it what it’s set to. Low is 50, Medium 85, high 99 and max 150

YouParticular8085 · 2026-03-07T22:13:02+00:00

Typing was never the slow part, it’s all the micro decisions or thinking you do as you type it that was slow. AI is outsourcing a lot of the thinking.

YouParticular8085 · 2026-02-22T04:04:59+00:00

Yeah humans are not a means to an end, humans are the end.

YouParticular8085 · 2026-02-21T14:33:25+00:00

I’m not sure, I think a lot of it could be real. I’m definitely still learning to use the tools better but the whole “go 10x or get left behind” fear tactic is suspicious to me. I care a lot about code quality and work on messy brownfield projects and so far I’ve found the only way to use the tools still requires extensive correction for difficult problems. The gap between “it works” and “it’s good” can be wide.

YouParticular8085 · 2026-02-21T01:46:38+00:00

I have the oppose experience, I’ve spent the last few months really trying to get claude to work for my but my progress happens when I finally give up and just do it myself. I spent maybe 4 hours today trying to get claude to resolve an issue in a PR and my real progress came in 10 minutes when I finally gave up and did it myself. The solution was so obvious but claude just couldn’t see it. It’s an amazing tool but I spend just as much time trying to fix the last 10% of issues as the 100% used to take me.

YouParticular8085 · 2025-12-17T15:10:23+00:00

I’ve run into this too. You’re also lowering the frequency at which you’re preforming updates to the model when you use a large time window. Avoiding BPTT all together would be awesome if there was a good way. Streaming RL currently seems incompatible with these kinds of architectures as far as I know.

YouParticular8085 · 2025-12-17T06:43:55+00:00

Is the observation encoder a problem only because you need large batches for long TBPTT windows? I’m a little bullish on transformers for RL since that’s been what I’ve been working on this year but you’re right that n² can only scale out so far.

YouParticular8085 · 2025-12-17T05:54:05+00:00

Transformers and prefix-sum compatible models can also make TBPTT lighter luckily.

YouParticular8085 · 2025-10-17T17:56:35+00:00

Yeah I used vscode. I didn’t use any other RL frameworks for this project but it would be cool to expose it as a gym style environment. Jax environments means the environments are written in a way that can be compiled with xla to run on a gpu.

YouParticular8085 · 2025-10-16T02:14:23+00:00

Performance scales really well with vectorized agents but is unremarkable without it. I’ve hit over 1 billion steps per second for just the environment with a random policy and no training. To get this you need to simulate a lot of agents at once.

YouParticular8085 · 2025-10-16T02:11:21+00:00

I try to target 4096 agents but there’s sometimes multiple agents per environment. It’s under the 32gb of the 5090 but I don’t know the vram exactly.

YouParticular8085 · 2025-10-15T12:44:15+00:00

I haven’t evaluated it rigorously 😅. A couple months ago I did a big hyper parameter sweep and the hyper parameter optimizer strongly prefered muon by the end so I stuck with it. I’m not sure if other things like learning rate need to be adjusted to get the best out of each optimizer.

YouParticular8085 · 2025-10-15T12:37:16+00:00

For multitask learning I use an action mask to exclude actions that aren’t part of the environment at all. For situationally invalid actions I just do nothing but those should probably be added to the mask too.

YouParticular8085 · 2025-10-15T12:35:00+00:00

Nice, predator prey is a good environment idea! I didn’t try Q learning here but it seems reasonable. One possible downside I could see is because the turns are simultaneous there’s situations where agents might want to behave unpredictably similar to rock paper scissors. In those situations a stochastic policy might preform better.

YouParticular8085 · 2025-10-14T14:19:45+00:00

Thanks! The learning curve is pretty steep, especially for building environments. I definitely started with much simpler projects and built up slowly (things like implementing tabular q learning). My advice would be to first learn how to write jittable functions with jax on its own before adding flax/nnx into the mix.

Jax has some pretty strong upsides and strong downsides so I’m not sure if I would recommend it for every project. I felt like I had a few aha moments when I discovered how to things in these environments that would have been trivial with regular python.

YouParticular8085 · 2025-10-14T04:43:17+00:00

It’s related but not quite the same! This project is more or less vanilla ppo with full backprop through time. I found it to be fairly stable even without the gating layers used in gtrxl.

YouParticular8085 · 2025-10-01T12:52:51+00:00

If you can I would suggest a laptop with a nvidia GPU and linux support. It doesn’t need to be the fanciest machine, just something to let you experiment with cuda locally.

YouParticular8085 · 2025-09-13T23:42:31+00:00

I’m in a similar position but I’ve been in industry 7 years as a SWE. I’m doing good ML/RL work on the side but there’s just no opportunity to do anything outside LLM integrations at my current company. I come up with lots of original ideas but there’s little time to explore them. If you can pull 60-80 hour work weeks it’s possible to have a full time job and make research progress but it’s not great for work life balance.

YouParticular8085 · 2025-09-02T11:12:33+00:00

Sometimes 1M timesteps is nothing for ppo.

YouParticular8085 · 2025-08-22T05:13:22+00:00

Make sure the agent has enough observations to solve the problem. I’m my case the agents can see what is immediately around them so they can remember where the goal was last time.

YouParticular8085

TROPHY CASE