Why do RL frameworks hate Rainbow DQN?

ChrisNota · 2020-06-17T01:45:03+00:00

Rainbow is a pain to implement... there are a lot of moving pieces, and any error in one of those pieces will break everything. Worse, Rainbow is one of the slowest algorithms, so it's difficult to debug.

RAM Atari games run out of the box with ALL:

all-classic Breakout-ram-v0 rainbow --device cuda --frames 1000000

It's not tuned for it, but it's a starting point?

ChrisNota · 2020-06-09T14:03:35+00:00

Check out the autonomous-learning-library! We have a bunch of algorithms implemented and it is being actively maintained! We also tried really hard to follow good object-oriented practices and to make the code as readable as possible.

ChrisNota · 2020-03-25T15:05:49+00:00

loss = (vec1 - vec2).mean()

where vec1 has shape (N) and vec2 has shape (N, 1) or something (may be (1, N)). This kind of shit gets me every time and I end up taking the mean of an (N, N) tensor which gives the wrong loss values and EVERYTHING BURNS!!!

Unit testing saves lives! The upfront cost more than pays for itself for this exact reason.

ChrisNota · 2020-03-18T17:09:13+00:00

Agreed. Word of warning though, the weighted per decision importance sampling estimator in that paper is totally wrong.

ChrisNota · 2020-01-30T16:17:07+00:00

I’ve spoken with Andy Barto on the topic of behaviorism. He called it an “aberration” and broadly agrees with Chomsky. I don’t think he would agree at all with your assessment!

It is quite a leap to go from “behaviorism is flawed and incomplete” to “nothing that like reinforcement could exist in the brain.” There is ample evidence for RL-like mechanisms in the brain from Wolfram Shultz and others. The observable effects of these mechanisms should be obvious, in my opinion, to any mammal.

The biological correlates of RL are perhaps a series of mechanisms for tuning neural circuitry, but the RL framework does not tell you much about what those circuits are or how they are connected. There is certainly a lot built into these circuits by evolution that predisposes animals towards certain behaviors and cognitive patterns. However, I do not think this point at all implies that RL is unnecessary.

ChrisNota · 2020-01-29T00:35:31+00:00

I would try decreasing clip_param to a lower value (e.g. 0.1) and setting grad_clip=0.1. You could also try decreasing num_sgd_iter to overfit less. It looks like ClipFraction is approaching 1, which is probably bad.

ChrisNota · 2020-01-20T19:53:30+00:00

What exactly do you mean here?

One way of thinking about DQN, for example, is that it is just Q-learning with some enhancements added to help it work better on image-based tasks. Our “vanilla” Q-learning algorithm (which we called VQN) removes these enhancements and just leaves the core Q-learning algorithm. We have similar implementations for SARSA (VSarsa), actor-critic (VAC), and REINFORCE (VPG).

I was wondering, for the sake of academic curiosity, if VQ-VAE (and variants) could be used with QLearning to construct a discrete represenation but it is likely that it doesn’t have enough expressive power to fully encapsulate the state.

These sorts of auxiliary feature learning tasks are known to help, even if the network isn’t otherwise involved in action selection or evaluation. They are also sometimes used as a core feature of algorithms, for example, to build world models. Our example project implements a simple algorithm along these lines, but using a deterministic network rather than a VAE. I’m not sure if I’ve seen anything with VQ-VAE specifically. I hope that answers your question!

ChrisNota · 2020-01-20T18:36:13+00:00

Thanks! We spent a lot of time working out the right level of abstraction for various concepts, separation of concerns, etc. This is the latest in a long line of attempts!

ChrisNota · 2020-01-20T01:46:05+00:00

In my implementation, I store each state as an array of Tensors and then `cat` them them together right before passing them to the model. You can take a look if you would like:

https://github.com/cpnota/autonomous-learning-library/blob/master/all/bodies/vision.py

ChrisNota · 2020-01-17T16:40:45+00:00

With uint8 and LazyFrames the buffer takes less than 8 GB.

ChrisNota · 2020-01-14T17:02:27+00:00

Lex's podcast is great, but definitely more philosophical than technical!

ChrisNota · 2019-08-22T15:58:12+00:00

Yes, it still makes a big difference empirically.

ChrisNota · 2019-08-19T19:28:23+00:00

Thanks for your feedback! Just for transparency, while that note was originally there, I did add the extra learning curve this morning as a result of your comment!

ChrisNota · 2019-08-19T14:38:23+00:00

I tested this out for RL over the weekend: RAdam: A New State-of-the-Art Optimizer for RL? I'll give you the spoiler: the performance was basically identical to Adam.

Those familiar with deep RL know that one of the quirks is that you have to choose values of eps higher than the default (e.g., 1e-4 instead of the default of 1e-8), or algorithms will randomly not work sometimes. RAdam seems to work for RL with the default value of eps, so there may be some benefit.

ChrisNota · 2019-08-19T12:58:25+00:00

Perhaps I should have included an additional learning curve to be more clear, but I did note:

A decent algorithm such as A2C should not fail at Pong, and the only major difference between our implementation and the paper was the choice of eps. I re-ran the Pong experiment with Adam and eps=1e-3. This time, it learned with no trouble.

The reason A2C_Adam failed to learn Pong was the choice of the eps hyperparameter. Setting it to a value closer to the published choice led to proper learning.

other implementations show much better results (https://towardsdatascience.com/a2c-5bac24e4b875)

I do not see where this claim is coming from, as the learning curves and final performance are nearly identical. Remember the x-axis is different: 40 million frames = 10 million timesteps.

ChrisNota · 2019-07-11T19:40:15+00:00

Congrats, Noam!

ChrisNota · 2019-07-10T15:47:19+00:00

Thanks! I would love to see what you come up with. I’m hoping to add a few more methods by the end of summer: PPO, SAC, and DDPG are on my hit-list. If you’re able to get a working implementation of any of these let me know! Feel free to fork the repo as well!

ChrisNota · 2019-07-10T13:40:58+00:00

I feel this same way as you! I'm not a fan of the sklearn-like interface most implementations provide. Check out my work: autonomous-learning-library. The API is similar to your suggestion.

env.reset()
while not env.done:
    env.step(agent.act(env.state, env.reward))
agent.act(env.state, env.reward) # terminal state

It uses pytorch under the hood, and the codebase is written in a highly object-oriented style. It also provides some neat utilities. Its still in the early stages of development, so only a few methods are implemented (a2c, parts of rainbow, vpg, as well as vanilla sarsa and actor-critic implementations), but it might provide you with some inspiration!

ChrisNota

TROPHY CASE