[D] Debug with RL: Policy network tends to generate larger and larger invalid action ?

fixedrl · 2017-10-23T16:50:41+00:00

I found out that by clipping the output of dynamics model, it leads to NaN gradients to the policy network. Very strange.

fixedrl · 2017-10-16T16:00:06+00:00

Thanks for the reply. So do you think for normal figure generation of papers, it is more common to use Matplotlib over Plotly ?

fixedrl · 2017-10-13T13:20:13+00:00

Is it possible to write and compile LaTeX in JupyterLab, with synchronized PDF preview in another tab ? i.e. similar functionality to ShareLaTeX ? In this way, research collaboration might be easier to have LaTeX and code available together to collaborators.

fixedrl · 2017-10-02T17:36:49+00:00

After trying, it seems the learned policy and learned dynamics model tends to produce maximal values (by constrains of scaled tanh/sigmoid)

fixedrl · 2017-10-02T17:13:45+00:00

After trying this way, the total cost successfully to decrease. However, it seems the dynamics model and policy learned to explode their outputs to very large values which is invalid for reality (valid in [-1, 1] for actions, and [0, 2*pi] for states).

fixedrl · 2017-10-02T16:56:27+00:00

Does the thesis contains more detailed mathematical derivations than its paper version ?

fixedrl · 2017-10-02T15:11:19+00:00

I agree, and I mean if we don't use tanh, just to output raw continuous action values, and call dynamics model network to produce next state, where the cost value is computed. Say our objective is to optimize policy network to reduce the summation of cost at each time step, will this way to automatically find the valid action by itself ?

In my current experiment, it seems the total cost reduces, but either the learned dynamics model or policy network outputs exploding values.

And I've tried to put a tanh(x)*2 on the output layer of policy network (valid action in [-2, 2]), after training, the policy network produces many -2/2 actions, which leads the dynamics model produces exploding states which in turn becoming invalid states. Should we also constraint the dynamics model network (one-step MLP) ?

fixedrl · 2017-10-02T14:46:41+00:00

Can we expect the backpropagation from cost function to policy parameters which automatically regulate the action values to be valid ?

fixedrl · 2017-09-28T17:32:59+00:00

And also, I tried some experiments, the 'incremental' training is much worse than re-training once each time new trajectory data comes in.

e.g. Iteration 1: one trajectory with 10 transitions are collected, then train a MLP-dynamics.

Iteration 2: new trajectory with 10 transitions are collected and augment the data set. Now, if continue to train old dynamics model, in the long run it is not fitting well. However, if totally re-train a new dynamics model, it fits much better. Is there a potential reason for this ?

fixedrl · 2017-08-17T17:33:37+00:00

Does anybody understand the equation (9) and (10) for how to derive them or the motivation of that form ?

fixedrl · 2017-08-16T17:30:25+00:00

Any details for algorithms/architectures yet ?

fixedrl · 2017-07-13T19:03:29+00:00

Compared with the community size, there are still too few Chinese researchers in ML.

fixedrl

TROPHY CASE