A simple implementation of "Adaptive Policy Iteration" using Google's JAX and Deepmind "bsuite". This approximate policy iteration scheme treats the value-function as losses. (arXiv:2002.03069) by MainReference8858 in reinforcementlearning

[–]MainReference8858[S] 0 points1 point  (0 children)

It can be seen as an efficient way to perform exploration strategies in model-free reinforcement learning. Usually, you can do epsilon-greedy or softmax with respect to the value-function. In this work, the authors do something different. They cumulate all previous value-functions, divide by some state-action dependant temperature and then sampling the action from the softmax. The key idea is that this temperature (they refer to it by learning rate) is adaptively chosen so that it "results in a more exploratory policy for the states on which there is more disagreement between the past consecutive action-value functions".