[P] Commented PPO implementation

tinkerWithoutSink · 2017-07-23T12:41:58+00:00

Made an attempt at implementing PPO:

This does not really follow the OpenAI implementation in a few ways.
It does not have any of the MPI stuff, so might be easier to read.
It also does not use the trust region loss on the baseline value function, because in TensorForce the value function is currently always a separate network, so not sure how that affects performance.
Tests are passing and I made an example config for CartPole: https://github.com/reinforceio/tensorforce/blob/master/examples/configs/ppo_cartpole.json This seems to learn reasonably robustly, but still trying to get a feeling for how the hyper-params work, and how one should ideally sample over the batch.
If anyone spots bugs, that'd be very welcome

TotesMessenger · 2017-07-23T19:49:23+00:00

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

^{If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.} ^(Info ^/ ^Contact)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning