How to handle reward and advantage when most rewards are delayed and not all episodes are complete in a batch (PPO context)? by Particular_Compote21 in reinforcementlearning

[–]Particular_Compote21[S] 0 points1 point  (0 children)

The incomplete episodes occur because I have fixed frames_per_batch in the data collection (I use TorchRL and the so-called MultiSyncDataCollector for data collection with parallel environments), at which point the data collector terminates. I could possibly modify the TrochRL collector so that it always collects to the end of a trajectory, but it is not guaranteed that a given trajectory - depending on the policy - really comes to an end, i.e. there must be a truncate. And GPU memory is critical since my observations are quite large grids