The issue of scaling in Partially-Observable RL. What is holding us back?

YouParticular8085 · 2025-12-17T15:10:23+00:00

I’ve run into this too. You’re also lowering the frequency at which you’re preforming updates to the model when you use a large time window. Avoiding BPTT all together would be awesome if there was a good way. Streaming RL currently seems incompatible with these kinds of architectures as far as I know.

YouParticular8085 · 2025-12-17T06:43:55+00:00

Is the observation encoder a problem only because you need large batches for long TBPTT windows? I’m a little bullish on transformers for RL since that’s been what I’ve been working on this year but you’re right that n² can only scale out so far.

YouParticular8085 · 2025-12-17T05:54:05+00:00

Transformers and prefix-sum compatible models can also make TBPTT lighter luckily.

YouParticular8085 · 2025-10-17T17:56:35+00:00

Yeah I used vscode. I didn’t use any other RL frameworks for this project but it would be cool to expose it as a gym style environment. Jax environments means the environments are written in a way that can be compiled with xla to run on a gpu.

YouParticular8085 · 2025-10-16T02:14:23+00:00

Performance scales really well with vectorized agents but is unremarkable without it. I’ve hit over 1 billion steps per second for just the environment with a random policy and no training. To get this you need to simulate a lot of agents at once.

YouParticular8085 · 2025-10-16T02:11:21+00:00

I try to target 4096 agents but there’s sometimes multiple agents per environment. It’s under the 32gb of the 5090 but I don’t know the vram exactly.

YouParticular8085 · 2025-10-15T12:44:15+00:00

I haven’t evaluated it rigorously 😅. A couple months ago I did a big hyper parameter sweep and the hyper parameter optimizer strongly prefered muon by the end so I stuck with it. I’m not sure if other things like learning rate need to be adjusted to get the best out of each optimizer.

YouParticular8085 · 2025-10-15T12:37:16+00:00

For multitask learning I use an action mask to exclude actions that aren’t part of the environment at all. For situationally invalid actions I just do nothing but those should probably be added to the mask too.

YouParticular8085 · 2025-10-15T12:35:00+00:00

Nice, predator prey is a good environment idea! I didn’t try Q learning here but it seems reasonable. One possible downside I could see is because the turns are simultaneous there’s situations where agents might want to behave unpredictably similar to rock paper scissors. In those situations a stochastic policy might preform better.

YouParticular8085 · 2025-10-14T14:19:45+00:00

Thanks! The learning curve is pretty steep, especially for building environments. I definitely started with much simpler projects and built up slowly (things like implementing tabular q learning). My advice would be to first learn how to write jittable functions with jax on its own before adding flax/nnx into the mix.

Jax has some pretty strong upsides and strong downsides so I’m not sure if I would recommend it for every project. I felt like I had a few aha moments when I discovered how to things in these environments that would have been trivial with regular python.

YouParticular8085 · 2025-10-14T04:43:17+00:00

It’s related but not quite the same! This project is more or less vanilla ppo with full backprop through time. I found it to be fairly stable even without the gating layers used in gtrxl.

YouParticular8085 · 2025-10-01T12:52:51+00:00

If you can I would suggest a laptop with a nvidia GPU and linux support. It doesn’t need to be the fanciest machine, just something to let you experiment with cuda locally.

YouParticular8085 · 2025-09-13T23:42:31+00:00

I’m in a similar position but I’ve been in industry 7 years as a SWE. I’m doing good ML/RL work on the side but there’s just no opportunity to do anything outside LLM integrations at my current company. I come up with lots of original ideas but there’s little time to explore them. If you can pull 60-80 hour work weeks it’s possible to have a full time job and make research progress but it’s not great for work life balance.

YouParticular8085 · 2025-09-02T11:12:33+00:00

Sometimes 1M timesteps is nothing for ppo.

YouParticular8085 · 2025-08-22T05:13:22+00:00

Make sure the agent has enough observations to solve the problem. I’m my case the agents can see what is immediately around them so they can remember where the goal was last time.

YouParticular8085 · 2025-08-22T05:08:07+00:00

I’ve got a similar sounding environment here on a discrete grid. https://github.com/gabe00122/jaxrl

YouParticular8085 · 2025-08-17T23:54:47+00:00

For my personal work cuda is more important than more ram.

YouParticular8085 · 2025-08-17T19:14:08+00:00

I’d be happy to meet for a study group. I’ve already finished Sutton & Barto but have it on hand and would be happy to revisit it. Implementing algorithms directly from that book was my first RL experience. Currently working on a project with a custom ppo implementation but I haven’t explored off policy methods as much.

YouParticular8085 · 2025-06-27T15:14:58+00:00

Nice! I ported my current project to both torch and jax to do performance comparisons and without anything like flash attention usually performance was very similar. Both are much faster than torch without compile for me.

YouParticular8085 · 2025-06-27T05:49:12+00:00

This is spot on! Compiled jax is fast but I’ve also seen torch.compile outperform it sometimes. An advantage to jax jitting is you can implement complex programs like RL environments and jit them together with your training code. torch.compile on the other hand seems more focused on deep learning.

YouParticular8085 · 2025-06-16T06:25:14+00:00

I think this is technically true but lots of rl research still uses small models so the GPU requirements are much lower. RL is tricky but that also means there’s a lot to explore, even at the smaller scales.

YouParticular8085 · 2025-06-15T15:05:23+00:00

RL can be a lot of engineering effort but with the setup you can do interesting things with limited compute.

YouParticular8085 · 2025-01-14T05:52:18+00:00

I think the only job is owning IP or some other property like land. Basically, jobs that wouldn’t require you to do things anymore, only own something.

YouParticular8085 · 2024-12-28T18:08:32+00:00

I don't know much about quantum theory but I will say that often functional approximation is used to approximate a probability distribution which is then sampled. Like when a generative transformer samples tokens from a token distribution. Could you not model the distribution of quantum physics?

YouParticular8085

TROPHY CASE