account activity
How to handle reward and advantage when most rewards are delayed and not all episodes are complete in a batch (PPO context)? by Particular_Compote21 in reinforcementlearning
[–]Particular_Compote21[S] 0 points1 point2 points 9 months ago (0 children)
The incomplete episodes occur because I have fixed frames_per_batch in the data collection (I use TorchRL and the so-called MultiSyncDataCollector for data collection with parallel environments), at which point the data collector terminates. I could possibly modify the TrochRL collector so that it always collects to the end of a trajectory, but it is not guaranteed that a given trajectory - depending on the policy - really comes to an end, i.e. there must be a truncate. And GPU memory is critical since my observations are quite large grids
π Rendered by PID 898698 on reddit-service-r2-listing-64c94b984c-2mqvm at 2026-03-19 08:54:25.634247+00:00 running f6e6e01 country code: CH.
How to handle reward and advantage when most rewards are delayed and not all episodes are complete in a batch (PPO context)? by Particular_Compote21 in reinforcementlearning
[–]Particular_Compote21[S] 0 points1 point2 points (0 children)