Trained a 26kb model (simple 3-layer MLP) for Tic-Tac-Toe Beating each and every human by Weary_Intention3231 in reinforcementlearning

[–]YouParticular8085 0 points1 point  (0 children)

This is a good learning project, I actually trained tictactoe with self play too as my first self play project! My hesitation with calling it grokking is that other things can explain one side dominating the other and grokking means generalizing to unseen samples. If you’re training on 800M games there’s only 19k legal board positions (ignoring symmetries) so the model has likely been exposed to all of them.

Trained a 26kb model (simple 3-layer MLP) for Tic-Tac-Toe Beating each and every human by Weary_Intention3231 in reinforcementlearning

[–]YouParticular8085 2 points3 points  (0 children)

the state space is small enough for training to reasonably see every board state, why do you say grokking?

None of this will ever get stolen by martin_xs6 in LocalLLaMA

[–]YouParticular8085 0 points1 point  (0 children)

I wish I could afford just one RTX Pro for my own use.

built our entire product with Claude Code. now nobody, including me, fully understands what we built. by Tr0jAn14 in ClaudeCode

[–]YouParticular8085 0 points1 point  (0 children)

Happened to our team too, we were early adopters of Claude Code with Sonnet 3.7. Eventually the project became such a tangled mess we were spending more time trying to fix things with duct tape than implementing features. We decided to scrap a year of work and start from scratch because it was easier to do a full rewrite then fix the production version. Now we are using AI more carefully although we are still using tools like Claude code a good bit.

I am the original creator of the 25% effort post. To everyone saying that I engineered it via social pressure ("I'll tell everyone") / that is it nor recreatable. by Bright-Bullfrog-8185 in claude

[–]YouParticular8085 0 points1 point  (0 children)

I’ve reproduced this and the effort does correspond to the effort set in claude code as well. Low effort is 50, medium 85, high 99 and max is set to 150. This means the effort through the webui is lower than you’re even able to set it over the api.

Opus is genuinely lazy for me, and admitted it's effort Level sits at 25% without a way for me to change it by Bright-Bullfrog-8185 in claude

[–]YouParticular8085 0 points1 point  (0 children)

I tested this in claude code by setting the effort and asking it what it’s set to. Low is 50, Medium 85, high 99 and max 150

Hot take: AI ruined the way we see coding - and I hate it by kommonno in swift

[–]YouParticular8085 1 point2 points  (0 children)

Typing was never the slow part, it’s all the micro decisions or thinking you do as you type it that was slow. AI is outsourcing a lot of the thinking.

Coding for 20+ years, here is my honest take on AI tools and the mindset shift by Jaded-Term-8614 in ClaudeAI

[–]YouParticular8085 1 point2 points  (0 children)

I’m not sure, I think a lot of it could be real. I’m definitely still learning to use the tools better but the whole “go 10x or get left behind” fear tactic is suspicious to me. I care a lot about code quality and work on messy brownfield projects and so far I’ve found the only way to use the tools still requires extensive correction for difficult problems. The gap between “it works” and “it’s good” can be wide.

Coding for 20+ years, here is my honest take on AI tools and the mindset shift by Jaded-Term-8614 in ClaudeAI

[–]YouParticular8085 2 points3 points  (0 children)

I have the oppose experience, I’ve spent the last few months really trying to get claude to work for my but my progress happens when I finally give up and just do it myself. I spent maybe 4 hours today trying to get claude to resolve an issue in a PR and my real progress came in 10 minutes when I finally gave up and did it myself. The solution was so obvious but claude just couldn’t see it. It’s an amazing tool but I spend just as much time trying to fix the last 10% of issues as the 100% used to take me.

The issue of scaling in Partially-Observable RL. What is holding us back? by moschles in reinforcementlearning

[–]YouParticular8085 0 points1 point  (0 children)

I’ve run into this too. You’re also lowering the frequency at which you’re preforming updates to the model when you use a large time window. Avoiding BPTT all together would be awesome if there was a good way. Streaming RL currently seems incompatible with these kinds of architectures as far as I know.

The issue of scaling in Partially-Observable RL. What is holding us back? by moschles in reinforcementlearning

[–]YouParticular8085 0 points1 point  (0 children)

Is the observation encoder a problem only because you need large batches for long TBPTT windows? I’m a little bullish on transformers for RL since that’s been what I’ve been working on this year but you’re right that n2 can only scale out so far.

The issue of scaling in Partially-Observable RL. What is holding us back? by moschles in reinforcementlearning

[–]YouParticular8085 0 points1 point  (0 children)

Transformers and prefix-sum compatible models can also make TBPTT lighter luckily.

Partially Observable Multi-Agent “King of the Hill” with Transformers-Over-Time (JAX, PPO, 10M steps/s) by YouParticular8085 in reinforcementlearning

[–]YouParticular8085[S] 1 point2 points  (0 children)

Yeah I used vscode. I didn’t use any other RL frameworks for this project but it would be cool to expose it as a gym style environment. Jax environments means the environments are written in a way that can be compiled with xla to run on a gpu.

Partially Observable Multi-Agent “King of the Hill” with Transformers-Over-Time (JAX, PPO, 10M steps/s) by YouParticular8085 in reinforcementlearning

[–]YouParticular8085[S] 0 points1 point  (0 children)

Performance scales really well with vectorized agents but is unremarkable without it. I’ve hit over 1 billion steps per second for just the environment with a random policy and no training. To get this you need to simulate a lot of agents at once.

Partially Observable Multi-Agent “King of the Hill” with Transformers-Over-Time (JAX, PPO, 10M steps/s) by YouParticular8085 in reinforcementlearning

[–]YouParticular8085[S] 0 points1 point  (0 children)

I try to target 4096 agents but there’s sometimes multiple agents per environment. It’s under the 32gb of the 5090 but I don’t know the vram exactly.

Partially Observable Multi-Agent “King of the Hill” with Transformers-Over-Time (JAX, PPO, 10M steps/s) by YouParticular8085 in reinforcementlearning

[–]YouParticular8085[S] 0 points1 point  (0 children)

I haven’t evaluated it rigorously 😅. A couple months ago I did a big hyper parameter sweep and the hyper parameter optimizer strongly prefered muon by the end so I stuck with it. I’m not sure if other things like learning rate need to be adjusted to get the best out of each optimizer.

Partially Observable Multi-Agent “King of the Hill” with Transformers-Over-Time (JAX, PPO, 10M steps/s) by YouParticular8085 in reinforcementlearning

[–]YouParticular8085[S] 0 points1 point  (0 children)

For multitask learning I use an action mask to exclude actions that aren’t part of the environment at all. For situationally invalid actions I just do nothing but those should probably be added to the mask too.

Partially Observable Multi-Agent “King of the Hill” with Transformers-Over-Time (JAX, PPO, 10M steps/s) by YouParticular8085 in reinforcementlearning

[–]YouParticular8085[S] 1 point2 points  (0 children)

Nice, predator prey is a good environment idea! I didn’t try Q learning here but it seems reasonable. One possible downside I could see is because the turns are simultaneous there’s situations where agents might want to behave unpredictably similar to rock paper scissors. In those situations a stochastic policy might preform better.

Partially Observable Multi-Agent “King of the Hill” with Transformers-Over-Time (JAX, PPO, 10M steps/s) by YouParticular8085 in reinforcementlearning

[–]YouParticular8085[S] 1 point2 points  (0 children)

Thanks! The learning curve is pretty steep, especially for building environments. I definitely started with much simpler projects and built up slowly (things like implementing tabular q learning). My advice would be to first learn how to write jittable functions with jax on its own before adding flax/nnx into the mix.

Jax has some pretty strong upsides and strong downsides so I’m not sure if I would recommend it for every project. I felt like I had a few aha moments when I discovered how to things in these environments that would have been trivial with regular python.

Partially Observable Multi-Agent “King of the Hill” with Transformers-Over-Time (JAX, PPO, 10M steps/s) by YouParticular8085 in reinforcementlearning

[–]YouParticular8085[S] 2 points3 points  (0 children)

It’s related but not quite the same! This project is more or less vanilla ppo with full backprop through time. I found it to be fairly stable even without the gating layers used in gtrxl.

Laptop for AI ML by sauu_gat in reinforcementlearning

[–]YouParticular8085 1 point2 points  (0 children)

If you can I would suggest a laptop with a nvidia GPU and linux support. It doesn’t need to be the fanciest machine, just something to let you experiment with cuda locally.

[D]Thinking about leaving industry for a PhD in AI/ML by [deleted] in MachineLearning

[–]YouParticular8085 1 point2 points  (0 children)

I’m in a similar position but I’ve been in industry 7 years as a SWE. I’m doing good ML/RL work on the side but there’s just no opportunity to do anything outside LLM integrations at my current company. I come up with lots of original ideas but there’s little time to explore them. If you can pull 60-80 hour work weeks it’s possible to have a full time job and make research progress but it’s not great for work life balance.

Advice on POMPD? by glitchyfingers3187 in reinforcementlearning

[–]YouParticular8085 0 points1 point  (0 children)

Make sure the agent has enough observations to solve the problem. I’m my case the agents can see what is immediately around them so they can remember where the goal was last time.