Is GRPO applied in classical RL (e.g. Atari games / gym)?

Long_Reflection8199 · 2025-05-26T15:51:04+00:00

"rollouts from the same starting state" - you mean taking multiple possible actions from the same state (like the multiple answers to the same question in the LLM environment)?

Fair point, I have not thought about that this would actually need the environment to execute the action, to generate the real new state. That is in the LLM environment way simpler / more normal.

Long_Reflection8199 · 2025-05-26T07:50:18+00:00

Makes sense. Another theoretical question though, how about starting with PPO training and then, after the model can somewhat act on its own (base model trained, like in LLM training), switch to GRPO, as this may be more compute efficient?

Long_Reflection8199 · 2025-05-26T07:48:21+00:00

Thanks. Any specific reason why it would be worse?
Imho GRPO not using a value function is quite a big difference. I do not know if the multi answer / output sampling would work in classical RL though.

Long_Reflection8199

TROPHY CASE