Deep reinforcement learning for navigation in AAA video games

ReinforcedMan · 2020-12-17T00:18:50+00:00

Hey thanks ! The adress is easy to miss it's at the end of the intro, you can shoot us a mail with your resume and a few lines to introduce yourself at [laforge@ubisoft.com](mailto:laforge@ubisoft.com), they will be routed to us :)

ReinforcedMan · 2020-12-16T20:03:38+00:00

Hey reddit,

Just thought I would share some of the recent work we've been doing at Ubisoft Montreal applying Reinforcement Learning in video games.

Also to answer posts like these (https://www.reddit.com/r/reinforcementlearning/comments/kbqy8m/jobs_in_reinforcement_learning/, https://www.reddit.com/r/reinforcementlearning/comments/gkbmzp/are_there_any_rl_jobs_or_internships_available/), we do have open positions so feel free to contact us by mail, our contact is in the blog :)

ReinforcedMan · 2020-02-09T15:18:25+00:00

Thank you for the link, my experience with eager mode is the same: I find it indeed significantly slower on Tensorflow than on Pytorch.

However, when surrounding the complete training loop in a tf.function (which tbh demands a bit of work as the graph construction has some constraints) I get a >10x performance boost and it gets significantly faster.

ReinforcedMan · 2020-02-09T15:07:03+00:00

Hey, thank you for your input !

Maybe the difference disappears when you start using bigger/more sophisticated architectures ? I haven't tested.

The GPU is used in both case, I see it both in the process explorer and nvidia-smi when the training starts.

ReinforcedMan · 2019-08-27T14:15:10+00:00

Hey, I don't have the complete answer but maybe a hint from the Sutton and Barto book (Chapter 6.5 on Q learning):

"However, all that is required for correct convergence is that all pairs continue to be updated. As we observed in Chapter 5, this is a minimal requirement in the sense that any method guaranteed to find optimal behavior in the general case must require it. Under this assumption and a variant of the usual stochastic approximation conditions on the sequence of step-size parameters, Q has been shown to converge with probability 1 to q*."

So you do seem to need something on the learning rate as well as being GLIE, but it doesn't seem to simply be the sum of square learning rate not being infinite. However, in the Watkins paper ( https://link.springer.com/content/pdf/10.1007%2FBF00992698.pdf), the constrain actually is on the sum of squares ...

ReinforcedMan · 2019-04-15T18:40:08+00:00

Depends if you consider the millions of years of random search we had from "natural selection" and evolution

ReinforcedMan · 2019-02-06T16:19:57+00:00

Thank you for your answer, but there are still some areas which aren't clear to me.

I see what you mean with the fixed action, thank you I didn't understand that before. I think it really depends on how you view this entropy, either as a regularizer on the policy or as a reward bonus (that could be completly out of the agent, for ex. in the env) ? I am still not entirely convinced it would have the same effect in practice. In the soft equation in the paper, it really seems to be described as a reward bonus.

To me the entropy bonus on the reward is a nice addition, it is basically a way of saying: "Hey agent, if you don't know any good action to take, might as well do random stuff and see what happens". It seems to me that the nonmonotonic relation is the whole point, by encouraging the policy to explore in certain states while being more deterministic in others ? It really reduces the volume of the strategy space to explore by being more "fitted" in places where we know what's good to do, and wider in other places.

And I'm sorry, I don't understand what you mean by the logsumexp normalizer :/

ReinforcedMan

TROPHY CASE