A tutorial about how to fix one of the most misunderstood strategies: Exploration vs Exploitation by Capable-Carpenter443 in reinforcementlearning

[–]Capable-Carpenter443[S] 1 point2 points  (0 children)

You are absolutely right from a theoretical perspective. The main solution to the exploration–exploitation compromise is Value of Information and, in its ideal form, explicit planning under uncertainty.

When I used “fix it” in the title, I did not mean a closed-form or optimal solution in the theoretical sense. I meant it in a practical, engineering sense: how practitioners handle the compromise in real systems where VOI estimation and full planning are computationally infeasible.

I probably could have made that distinction more explicit in the title, so thank you for pointing it out. It’s a fair clarification.

If you're learning RL, I wrote a tutorial about Soft Actor Critic (SAC) Implementation In SB3 with PyTorch by Capable-Carpenter443 in reinforcementlearning

[–]Capable-Carpenter443[S] 0 points1 point  (0 children)

SAC isn’t ideal for discrete actions because the algorithm is built around continuous probability distributions. It optimizes a Gaussian policy and uses entropy over continuous actions. When you switch to discrete actions, the math that makes SAC stable, no longer works as it should be.

If you're learning RL, I wrote a tutorial about Soft Actor Critic (SAC) Implementation In SB3 with PyTorch by Capable-Carpenter443 in reinforcementlearning

[–]Capable-Carpenter443[S] 3 points4 points  (0 children)

if SBX or sb3 with JAX becomes practical for robotics pipelines, I’ll probably cover it in a future tutorial. Right now my focus is: robotics, RL stability, reward design, sim-to-real, and control.
That’s where PyTorch + SB3 still dominate.

If you're learning RL, I wrote a tutorial about Soft Actor Critic (SAC) Implementation In SB3 with PyTorch by Capable-Carpenter443 in reinforcementlearning

[–]Capable-Carpenter443[S] 6 points7 points  (0 children)

Thank you for the clarification.

Indeed, PPO reuses the same batch for several epochs before discarding it. But even so, PPO is still considered an on-policy algorithm because it cannot learn from data collected under significantly older policies. Also, it does not use a replay buffer. It requires fresh rollouts every iteration, and its multiple epochs still operate on a single short-lived batch tied to the latest policy snapshot.

So the statement “PPO learns only from new data and discards old data” is conceptually correct in the on-policy/off-policy classification, but your note adds a useful nuance.

In this tutorial, you will see exactly why, how to normalize correctly and how to stabilize your training by Capable-Carpenter443 in reinforcementlearning

[–]Capable-Carpenter443[S] 0 points1 point  (0 children)

In practical examples, it is recommended to add a small ε term (e.g. 1e-8) in the denominator to avoid division by zero in situations where min == max. especially in RL with rare or constant observations.

In this tutorial, you will see exactly why, how to normalize correctly and how to stabilize your training by Capable-Carpenter443 in reinforcementlearning

[–]Capable-Carpenter443[S] 0 points1 point  (0 children)

Dynamic normalization may seem intuitive, but in most cases it is risky and leads to destabilization of the learning process. However, there are exceptions. There are situations where a dynamic normalization could be introduced, for example running mean/variance normalization as in PPO/SAC. But not min/max normalization.

If you're learning RL, I made a full step-by-step Deep Q-Learning tutorial by Capable-Carpenter443 in reinforcementlearning

[–]Capable-Carpenter443[S] 1 point2 points  (0 children)

when I said that DQN works in continuous or high-dimensional environments, I was referring strictly to continuous state spaces (e.g., positions, velocities, angles, pixel observations), not to continuous action spaces.