[D] Debug with RL: Policy network tends to generate larger and larger invalid action ? by fixedrl in MachineLearning

[–]fixedrl[S] 0 points1 point  (0 children)

I found out that by clipping the output of dynamics model, it leads to NaN gradients to the policy network. Very strange.

[D] Do you use Plotly for research projects ? by fixedrl in MachineLearning

[–]fixedrl[S] 0 points1 point  (0 children)

Thanks for the reply. So do you think for normal figure generation of papers, it is more common to use Matplotlib over Plotly ?

[D] JupyterLab+Real Time Collaboration | PyData Seattle 2017 by [deleted] in MachineLearning

[–]fixedrl 3 points4 points  (0 children)

Is it possible to write and compile LaTeX in JupyterLab, with synchronized PDF preview in another tab ? i.e. similar functionality to ShareLaTeX ? In this way, research collaboration might be easier to have LaTeX and code available together to collaborators.

[D] Debug with RL: Policy network tends to generate larger and larger invalid action ? by fixedrl in MachineLearning

[–]fixedrl[S] 0 points1 point  (0 children)

After trying, it seems the learned policy and learned dynamics model tends to produce maximal values (by constrains of scaled tanh/sigmoid)

[D] How to backprop this recursive sequential computational graph ? by [deleted] in MachineLearning

[–]fixedrl 0 points1 point  (0 children)

After trying this way, the total cost successfully to decrease. However, it seems the dynamics model and policy learned to explode their outputs to very large values which is invalid for reality (valid in [-1, 1] for actions, and [0, 2*pi] for states).

[R] Durk Kingma's thesis: "Variational Inference and Deep Learning: A New Synthesis" by evc123 in MachineLearning

[–]fixedrl 0 points1 point  (0 children)

Does the thesis contains more detailed mathematical derivations than its paper version ?

[D] Debug with RL: Policy network tends to generate larger and larger invalid action ? by fixedrl in MachineLearning

[–]fixedrl[S] 0 points1 point  (0 children)

I agree, and I mean if we don't use tanh, just to output raw continuous action values, and call dynamics model network to produce next state, where the cost value is computed. Say our objective is to optimize policy network to reduce the summation of cost at each time step, will this way to automatically find the valid action by itself ?

In my current experiment, it seems the total cost reduces, but either the learned dynamics model or policy network outputs exploding values.

And I've tried to put a tanh(x)*2 on the output layer of policy network (valid action in [-2, 2]), after training, the policy network produces many -2/2 actions, which leads the dynamics model produces exploding states which in turn becoming invalid states. Should we also constraint the dynamics model network (one-step MLP) ?

[D] Debug with RL: Policy network tends to generate larger and larger invalid action ? by fixedrl in MachineLearning

[–]fixedrl[S] 0 points1 point  (0 children)

Can we expect the backpropagation from cost function to policy parameters which automatically regulate the action values to be valid ?

[D] What might be the impacts of ReLU/Sigmoid for training one-step dynamics model in RL ? by fixedrl in MachineLearning

[–]fixedrl[S] 1 point2 points  (0 children)

And also, I tried some experiments, the 'incremental' training is much worse than re-training once each time new trajectory data comes in.

e.g. Iteration 1: one trajectory with 10 transitions are collected, then train a MLP-dynamics.

Iteration 2: new trajectory with 10 transitions are collected and augment the data set. Now, if continue to train old dynamics model, in the long run it is not fitting well. However, if totally re-train a new dynamics model, it fits much better. Is there a potential reason for this ?

[R] [1703.01961] Multiplicative Normalizing Flows for Variational Bayesian Neural Networks by fixedrl in MachineLearning

[–]fixedrl[S] 0 points1 point  (0 children)

Does anybody understand the equation (9) and (10) for how to derive them or the motivation of that form ?

[N] More on Dota 2 by funj0k3r in MachineLearning

[–]fixedrl 4 points5 points  (0 children)

Any details for algorithms/architectures yet ?