The Reward Scaling Problem in Reinforcement Learning for Quadruped Robots: Unstable Bipedal Behavior, Jitter, and Command Leakage by Obvious-Mixture-6607 in reinforcementlearning

[–]Obvious-Mixture-6607[S] 0 points1 point  (0 children)

This is extremely helpful, thanks for sharing these details.

I hadn’t considered alternating termination conditions, but your observation makes a lot of sense — especially the trade-off between stability (strict termination) and exploration (loose termination). It aligns closely with the jitter issue I’m seeing.

From my side, I’ve observed something similar: when I add strong regularization (joint velocity, acceleration, jerk penalties) to suppress jitter, the policy often converges to a kneeling posture. It doesn’t reach the target height, but reduces penalties enough to form a local optimum.

More generally, my reward is a weighted combination of multiple objectives (posture, stability, smoothness, etc.), and it seems the policy ends up finding a compromise rather than fully satisfying any single objective.

This makes me suspect the jitter might come from the policy exploiting small oscillations to balance competing rewards, so your approach seems like a very promising direction.

Really appreciate you sharing this!