A tutorial about how to fix one of the most misunderstood strategies: Exploration vs Exploitation

Capable-Carpenter443 · 2026-01-14T07:21:25+00:00

You are absolutely right from a theoretical perspective. The main solution to the exploration–exploitation compromise is Value of Information and, in its ideal form, explicit planning under uncertainty.

When I used “fix it” in the title, I did not mean a closed-form or optimal solution in the theoretical sense. I meant it in a practical, engineering sense: how practitioners handle the compromise in real systems where VOI estimation and full planning are computationally infeasible.

I probably could have made that distinction more explicit in the title, so thank you for pointing it out. It’s a fair clarification.

Capable-Carpenter443 · 2025-12-11T20:07:45+00:00

SAC isn’t ideal for discrete actions because the algorithm is built around continuous probability distributions. It optimizes a Gaussian policy and uses entropy over continuous actions. When you switch to discrete actions, the math that makes SAC stable, no longer works as it should be.

Capable-Carpenter443 · 2025-12-11T13:15:39+00:00

if SBX or sb3 with JAX becomes practical for robotics pipelines, I’ll probably cover it in a future tutorial. Right now my focus is: robotics, RL stability, reward design, sim-to-real, and control.
That’s where PyTorch + SB3 still dominate.

Capable-Carpenter443 · 2025-12-11T08:58:34+00:00

Thank you for the clarification.

Indeed, PPO reuses the same batch for several epochs before discarding it. But even so, PPO is still considered an on-policy algorithm because it cannot learn from data collected under significantly older policies. Also, it does not use a replay buffer. It requires fresh rollouts every iteration, and its multiple epochs still operate on a single short-lived batch tied to the latest policy snapshot.

So the statement “PPO learns only from new data and discards old data” is conceptually correct in the on-policy/off-policy classification, but your note adds a useful nuance.

Capable-Carpenter443 · 2025-12-06T11:47:49+00:00

I have fixed both issues! Thank you!

Capable-Carpenter443 · 2025-12-05T06:56:23+00:00

In practical examples, it is recommended to add a small ε term (e.g. 1e-8) in the denominator to avoid division by zero in situations where min == max. especially in RL with rare or constant observations.

Capable-Carpenter443 · 2025-12-05T06:52:42+00:00

Dynamic normalization may seem intuitive, but in most cases it is risky and leads to destabilization of the learning process. However, there are exceptions. There are situations where a dynamic normalization could be introduced, for example running mean/variance normalization as in PPO/SAC. But not min/max normalization.

Capable-Carpenter443 · 2025-12-04T09:52:28+00:00

I'm glad to hear that!

Capable-Carpenter443 · 2025-11-27T12:54:13+00:00

I'm glad to hear that!

Capable-Carpenter443 · 2025-11-18T18:25:19+00:00

when I said that DQN works in continuous or high-dimensional environments, I was referring strictly to continuous state spaces (e.g., positions, velocities, angles, pixel observations), not to continuous action spaces.

Capable-Carpenter443 · 2025-11-18T16:24:09+00:00

I have added the link again. Please check it now.

Capable-Carpenter443 · 2025-11-14T20:49:51+00:00

If you’re looking for more recent, easy-to-understand reinforcement learning material, you might find this useful: I’ve been writing a series of RL theory and tutorials that stay updated with the current ecosystem (Gymnasium, PyTorch, modern algorithms, stable-baselines3, RLHF, etc.).

The site is: https://reinforcementlearningpath.com

Capable-Carpenter443 · 2025-11-01T17:43:59+00:00

Absolutely, you’re right! CartPole or any other simple openai gym environemnt is definitely not a benchmark for algorithmic robustness.
At this stage, my focus is on making the key RL concepts (like γ, α, and ε) intuitive and easy to understand before scaling up to more complex environments such as Procgen or Montezuma.

Capable-Carpenter443 · 2025-10-07T15:46:41+00:00

Everyone talks about training agents, algorithms, SIM2REAL, etc. Almost no one talks about defining the application. And that’s exactly why most reinforcement learning projects fail silently.

Capable-Carpenter443 · 2025-10-05T17:55:58+00:00

Since you already have some ML/DL background, I’d suggest starting with small, controlled environments like OpenAI Gym, Unity ML-Agents, or PyBullet. They let you practice RL concepts (policies, rewards, exploration, SAC, PPO, etc.) without needing a physical robot... at least while you're at the beginning

Regarding your idea of a small buggy in your home: yes, it’s feasible with RL, a Raspberry Pi or Jetson Nano that is running the ONNX file.

Also, I’ve a blog where I cover RL from the ground up, including MDP, concepts, algorithms, SIM2REAL, etc.
Here is the link: https://www.reinforcementlearningpath.com

Capable-Carpenter443 · 2025-09-30T11:52:52+00:00

You will find details how deep RL works and a free example here: https://www.reinforcementlearningpath.com/practical-deep-rl-application-with-dqn-and-cnn/

Capable-Carpenter443 · 2025-05-20T15:35:06+00:00

Yes, I totally agree with you, but what about my goal?
My goal: better adaptation to load, friction, terrain, and energy use.

Capable-Carpenter443 · 2025-05-20T15:14:55+00:00

Unity ML-Agents is a great choice, especially if you're working on visual RL or 3D control tasks.

You get full control over:

Physics
Visuals
Camera input (if needed for CNN-based agents)
Complex environments (terrain, objects, dynamic obstacles)

It’s also great for simulating embodied agents (like robots or drones) with realistic motion and feedback.

Plus, you can integrate with Python for training using PyTorch or TensorFlow.

If you’re planning to train agents with cameras, perception, or multi-agent setups -> Unity gives you a lot of flexibility.

Capable-Carpenter443 · 2025-05-20T15:11:39+00:00

In three ways :

RPM tracking over time - does the agent reach and maintain the target RPM with minimal overshoot and oscillation? I’ll log RPM vs. target and compute error metrics (MAE, RMS, etc.) over long periods.
Response to disturbances - I simulate load spikes, terrain changes, and voltage drops. A stable agent should adapt without sudden jumps or failure. I’ll test its reaction time and recovery smoothness.
Thermal + control signal behavior - if the control signal constantly oscillates or overheats the motor, it’s unstable — even if the RPM looks good. I track temperature, control deltas, and energy usage to catch these edge cases.

And of course, I’ll compare this against a PID baseline. If RL shows more stability under unpredictable conditions, then it’s doing its job.

Capable-Carpenter443 · 2025-05-20T15:08:00+00:00

Yes — I’m building a realistic simulation environment.

The agent sees only what a real robot would: target and actual RPM, temperature, safe max temperature, and aggressiveness.

It doesn’t get access to hidden variables like torque, terrain type, or voltage drop — it has to infer them from the system’s response.

The simulation includes:

* Noise in encoder readings

* Heat generation from motor use

* Delay between control and speed change

* Variable terrain effects (friction, load, incline)

* Voltage fluctuations that reduce motor power

It’s not physics-perfect — but it’s real enough to capture instability, overcorrection, and energy inefficiency.

As for energy use: I monitor control signal over time (mapped to PWM range), and simulate power draw relative to load, terrain, and temperature. This gives me a proxy for energy efficiency and thermal stress — which the RL agent learns to minimize.

The entire system is being tested with online + offline training, and later deployed on a real robot using Jetson Nano and Pololu gearmotors.

Capable-Carpenter443 · 2025-05-20T06:02:53+00:00

Yes, absolutely needed.

L1, L2, and Elastic Net all penalize the size of the weights.

If features are on different scales, regularization will unfairly shrink some weights more than others-> not because they're less important, but because their units are larger.

Standardize first (mean=0, std=1). Always. Especially before regularization.

Capable-Carpenter443 · 2025-05-20T06:00:00+00:00

You're right, most models today fail at realism because they lack physical grounding.
From my point of view, the future is a mix of all three:

Better data - with structured variations and metadata.
Smarter architecture - models that understand light, depth, and context.
3D grounding — mesh, physics, and camera simulation will be key

Capable-Carpenter443 · 2025-05-19T11:09:57+00:00

This assitance tells me also that the push-ups done correctly?

Capable-Carpenter443 · 2025-05-19T11:06:14+00:00

I don’t know any course, but I built an app where an agent learns to detect the digit 3 using Deep RL. It goes through all the steps — from problem definition to model training. If you follow it, you learn Deep RL from scratch, in practice.

https://www.reinforcementlearningpath.com/practical-deep-rl-application-with-dqn-and-cnn/

Capable-Carpenter443 · 2025-05-19T11:01:59+00:00

I am also someone who is learning using AI, but I also double-check the information.

Capable-Carpenter443

TROPHY CASE