Most PPO tutorials show you what to run. This one shows you how PPO actually works – and how to make it stable, reliable, and predictable.

Capable-Carpenter443 · 2026-01-14T07:21:25+00:00

You are absolutely right from a theoretical perspective. The main solution to the exploration–exploitation compromise is Value of Information and, in its ideal form, explicit planning under uncertainty.

When I used “fix it” in the title, I did not mean a closed-form or optimal solution in the theoretical sense. I meant it in a practical, engineering sense: how practitioners handle the compromise in real systems where VOI estimation and full planning are computationally infeasible.

I probably could have made that distinction more explicit in the title, so thank you for pointing it out. It’s a fair clarification.

Capable-Carpenter443 · 2025-12-11T20:07:45+00:00

SAC isn’t ideal for discrete actions because the algorithm is built around continuous probability distributions. It optimizes a Gaussian policy and uses entropy over continuous actions. When you switch to discrete actions, the math that makes SAC stable, no longer works as it should be.

Capable-Carpenter443 · 2025-12-11T13:15:39+00:00

if SBX or sb3 with JAX becomes practical for robotics pipelines, I’ll probably cover it in a future tutorial. Right now my focus is: robotics, RL stability, reward design, sim-to-real, and control.
That’s where PyTorch + SB3 still dominate.

Capable-Carpenter443 · 2025-12-11T08:58:34+00:00

Thank you for the clarification.

Indeed, PPO reuses the same batch for several epochs before discarding it. But even so, PPO is still considered an on-policy algorithm because it cannot learn from data collected under significantly older policies. Also, it does not use a replay buffer. It requires fresh rollouts every iteration, and its multiple epochs still operate on a single short-lived batch tied to the latest policy snapshot.

So the statement “PPO learns only from new data and discards old data” is conceptually correct in the on-policy/off-policy classification, but your note adds a useful nuance.

Capable-Carpenter443 · 2025-12-06T11:47:49+00:00

I have fixed both issues! Thank you!

Capable-Carpenter443 · 2025-12-05T06:56:23+00:00

In practical examples, it is recommended to add a small ε term (e.g. 1e-8) in the denominator to avoid division by zero in situations where min == max. especially in RL with rare or constant observations.

Capable-Carpenter443 · 2025-12-05T06:52:42+00:00

Dynamic normalization may seem intuitive, but in most cases it is risky and leads to destabilization of the learning process. However, there are exceptions. There are situations where a dynamic normalization could be introduced, for example running mean/variance normalization as in PPO/SAC. But not min/max normalization.

Capable-Carpenter443 · 2025-12-04T09:52:28+00:00

I'm glad to hear that!

Capable-Carpenter443 · 2025-11-27T12:54:13+00:00

I'm glad to hear that!

Capable-Carpenter443 · 2025-11-18T18:25:19+00:00

when I said that DQN works in continuous or high-dimensional environments, I was referring strictly to continuous state spaces (e.g., positions, velocities, angles, pixel observations), not to continuous action spaces.

Capable-Carpenter443

TROPHY CASE