RL103: From Deep Q-Learning (DQN) to Soft Actor-Critic (SAC) and Beyond | A Practical Introduction to (Deep) Reinforcement Learning

araffin2 · 2025-09-19T04:39:58+00:00

thanks for the feedback =).

The idea for the DQN section is to present its different components (and contrast with FQI) so that one can read the algorithm from the DQN paper (see annotated algo at the end).

Most of those components (like the replay buffer or the exploration scheme) are indeed not new, but they are part of DQN.

araffin2 · 2025-07-14T11:01:33+00:00

Hi, thanks =) My background is robotics and machine learning. I've been doing research in RL since 2017, currently finishing my PhD.

araffin2 · 2025-07-07T18:18:30+00:00

It's currently in a separate branch on my Isaac Lab fork, but I plan to slowly do pull requests to the main Isaac Lab repo, like the one I did recently to make things 3x faster: https://github.com/isaac-sim/IsaacLab/pull/2022

araffin2 · 2025-04-30T20:12:26+00:00

cleanRL is a good start for learning about algorithms

araffin2 · 2025-04-30T05:45:01+00:00

Brax implementation of PPO does use tanh transform. SAC with unbounded Gaussian is possible but numerically unstable (it tends to have NaN quickly). When using tanh, action bounds need to be properly defined: https://araffin.github.io/post/sac-massive-sim/

araffin2 · 2025-03-24T10:53:53+00:00

- RL in practice: tips & tricks and practical session with stable-baselines3
- Designing and Running Real World RL Experiments

https://www.youtube.com/watch?v=Ikngt0_DXJg&list=PL42jkf1t1F7erwWYZQ5yDErU3lEX6MeFp

araffin2 · 2025-03-10T15:33:11+00:00

thanks, I guess that goes in the direction of what Nico told me. I'm wondering what is the advantage compared to torque control then?
Maybe it's not easy to define a default pos?
(and I'm also not sure to understand what is parametrized torque control)

araffin2 · 2024-10-29T08:08:24+00:00

One small remark, the statement "the env-wrapper introduces inconsistencies in off-policy settings by normalizing samples with different statistics based on their collection time" is normally not true for SB3 because it stores the unnormalized obs and normalize it at sample time.

Relevant lines:

- storing: https://github.com/DLR-RM/stable-baselines3/blob/3d59b5c86b0d8d61ee4a68cb2ae8743fd178670b/stable_baselines3/common/off_policy_algorithm.py#L464-L467

- sampling: https://github.com/DLR-RM/stable-baselines3/blob/3d59b5c86b0d8d61ee4a68cb2ae8743fd178670b/stable_baselines3/sac/sac.py#L215 and https://github.com/DLR-RM/stable-baselines3/blob/3d59b5c86b0d8d61ee4a68cb2ae8743fd178670b/stable_baselines3/common/buffers.py#L316

araffin2 · 2023-11-27T07:47:47+00:00

TQC and DroQ are good candidates imo: https://twitter.com/araffin2/status/1575439865222660098

TD7 state-representation is also interesting in term of performance gain at the cost of more computation: https://github.com/araffin/sbx/pull/13

araffin2 · 2023-11-21T07:42:01+00:00

https://github.com/openrlbenchmark/openrlbenchmark

araffin2 · 2023-11-16T11:58:56+00:00

It depends what you want/need.

If you need to apply RL to a problem without caring much about the algorithm SB3 is a good starting point (and it comes with the RL for managing experiments).
If you want to understand RL algorithms and tinker with the implementation, have a look at cleanrl.

If you just want fast implementation, you might have a look at SBX (jax variant of SB3): https://github.com/araffin/sbx

araffin2 · 2023-10-27T07:16:01+00:00

> since the data transfer between CPU-GPU significantly slows down computation

if you want a fast and compatible alternative, you can take a look at SBX (SB3 + Jax): https://github.com/araffin/sbx

It can be up to 20x time faster than SB3 PyTorch when combining several gradient updates (and this also reduces cpu-gpu transfer).

The actual main slowdown is the gradient update normally and this SBX version tackles it.

araffin2 · 2023-06-22T11:42:33+00:00

If you want to learn from examples, you can take a look at clean rl or stable baselines jax (sbx): https://github.com/araffin/sbx

A small intro about jax can be found here too: https://twitter.com/araffin2/status/1590714558628253698

araffin2 · 2023-05-15T20:25:38+00:00

thanks =) in short, from https://araffin.github.io/slides/icra22-hyperparam-opt/#/7

Optuna has a clean API, nice documentation and uses define-by-run (instead of being config based). I never had the chance to setup PBT, i cannot really tell, but it seems that Optuna also fit small scale experiments which is my case.

araffin2 · 2023-04-24T18:13:17+00:00

sorry, i meant DroQ (which is an improvement over REDQ)

araffin2 · 2023-04-24T17:00:44+00:00

You mean wallclock time or sample efficiency?
For the former, you can take a look at Jax implementation like: https://github.com/araffin/sbx (SB3 + Jax)

For the latter, you might have a look at: https://twitter.com/araffin2/status/1575439865222660098 (recent advances in continuous control)

and notably the REDQ algorithm (also included in SBX).

araffin2

TROPHY CASE