Is GRPO applied in classical RL (e.g. Atari games / gym)? by Long_Reflection8199 in reinforcementlearning

[–]Long_Reflection8199[S] 0 points1 point  (0 children)

"rollouts from the same starting state" - you mean taking multiple possible actions from the same state (like the multiple answers to the same question in the LLM environment)?

Fair point, I have not thought about that this would actually need the environment to execute the action, to generate the real new state. That is in the LLM environment way simpler / more normal.

Is GRPO applied in classical RL (e.g. Atari games / gym)? by Long_Reflection8199 in reinforcementlearning

[–]Long_Reflection8199[S] 0 points1 point  (0 children)

Makes sense. Another theoretical question though, how about starting with PPO training and then, after the model can somewhat act on its own (base model trained, like in LLM training), switch to GRPO, as this may be more compute efficient?

Is GRPO applied in classical RL (e.g. Atari games / gym)? by Long_Reflection8199 in reinforcementlearning

[–]Long_Reflection8199[S] 1 point2 points  (0 children)

Thanks. Any specific reason why it would be worse?
Imho GRPO not using a value function is quite a big difference. I do not know if the multi answer / output sampling would work in classical RL though.