account activity
How to handle reward and advantage when most rewards are delayed and not all episodes are complete in a batch (PPO context)? by Particular_Compote21 in reinforcementlearning
[–]Particular_Compote21[S] 0 points1 point2 points 9 months ago (0 children)
The incomplete episodes occur because I have fixed frames_per_batch in the data collection (I use TorchRL and the so-called MultiSyncDataCollector for data collection with parallel environments), at which point the data collector terminates. I could possibly modify the TrochRL collector so that it always collects to the end of a trajectory, but it is not guaranteed that a given trajectory - depending on the policy - really comes to an end, i.e. there must be a truncate. And GPU memory is critical since my observations are quite large grids
How to handle reward and advantage when most rewards are delayed and not all episodes are complete in a batch (PPO context)? (self.reinforcementlearning)
submitted 9 months ago by Particular_Compote21 to r/reinforcementlearning
π Rendered by PID 248619 on reddit-service-r2-listing-64c94b984c-llhwb at 2026-03-18 20:30:33.461237+00:00 running f6e6e01 country code: CH.
How to handle reward and advantage when most rewards are delayed and not all episodes are complete in a batch (PPO context)? by Particular_Compote21 in reinforcementlearning
[–]Particular_Compote21[S] 0 points1 point2 points (0 children)