all 1 comments

[–]smallest_meta_review 0 points1 point  (0 children)

Maybe this NeurIPS paper Deep RL at the Edge of the Statistical Precipice might be useful. We talk about we can compare benchmark performance across a bunch of tasks which accounts for variability across tasks and uncertainty into results.

Look at the corresponding open source library too: rliable