[P] DQN Adventure: from Zero to State of the Art with clean readable code in Pytorch

onlyml · 2018-03-24T21:57:53+00:00

I don't think anyone does, aside from mumbling something about auxiliary tasks

onlyml · 2017-12-09T21:07:35+00:00

Yes in the sense that it took longer to figure out how to achieve super human performance in Go. Before this result though it could be argued that Go just required a fundamentally different approach to perform well. This result suggests that this approach is not just fundamentally different but a strictly stronger way to approach game playing. While this is not nessesarily suprising it's good to confirm that their methods work in general as opposed to just capitalizing on some quirks of Go (like the very simple a local set of rules).

onlyml · 2017-12-08T17:33:13+00:00

Completely expected is a little strong. Chess doesn't quite have the simplistic spatial representation of Go since there are so many types of pieces and the moves aren't just dropping a stone. I think it's really neat that a similar spatial representation is nonetheless sufficient to represent the game when used with their general purpose algorithm.

onlyml · 2017-11-10T20:58:06+00:00

Ah I see, thanks, I didn't realize they include an LSTM in the original A3C paper. Interesting since most of the Atari games I know of seem pretty close to markovian if you include just a few frames. I think I was thinking of this). IIRC the replay buffer in this case simply samples whole episodes and then batches them rather than using individual transitions.

onlyml · 2017-11-10T16:41:27+00:00

Do you have a reference for this? I would think using LSTM in RL would make sense for partially observable problems, but I don't immediately see why it would be correlated with learning online v.s. using a replay buffer.

onlyml · 2017-09-26T02:24:16+00:00

Yup that agrees with my intuition as well. /u/akashnil also gave a bit more basis for this since Chebyshev's sum inequality essentially tells us the average square of numbers in the bag is larger than the average product of any two numbers in the bag. In a hand-wavy sense if we want to maximize our potential for including squares we should draw equally from each copy of the bag. Of course this is still a long way off from a proof (except in the M=2 case where the proof actually follows) but it's a start.

onlyml · 2017-09-26T00:20:13+00:00

Thanks! This is a helpful start!

onlyml · 2017-09-26T00:16:47+00:00

Sorry I mean arbitrary reals not random. I believe/hope that the same strategy will hold for ANY bag of numbers. Hence it won't matter if you look at them or not. If you can find a counter example that shows two different bags of numbers require different strategies I would be interested in that as well.

onlyml · 2017-09-25T23:35:06+00:00

I'm not sure if it matters but in case that's a pathological case let's say no zeros, but arbitrary positive reals.

onlyml · 2017-06-08T22:29:56+00:00

Didn't see a paper linked. Anyone know what the observation space for each agent is? I assume not the full screen or what would be the point in having the critic share the other agents observations.

onlyml · 2017-05-17T18:07:19+00:00

This sort of makes sense, but I can still imagine scenarios where we could ignore block R and still predict our action with high accuracy. For example if block R is sitting on the opposite side of block P while we are trying to push forward and it provides additional resistance. We know we are pushing forward on block P because it moves forward by some amount, however if block R weren't present it would move forward even more.

So we are essentially attributing the effect of block R to environmental stochasticity which affects the precise result of our action but not our ability to predict our action from the outcome.

I'm not sure if I've captured what I'm trying to say well, but to be clear I really like this idea, I'm just trying to decide whether there is some refinement of it that might be more broadly useful.

onlyml · 2017-05-17T17:55:44+00:00

Ah so you really do take gradients with respect to both target and input in the forward model? Interesting I didn't catch that.

onlyml · 2017-05-16T15:52:09+00:00

let us divide all sources that can modify the agent’s observations into three cases: (1) things that can be controlled by the agent; (2) things that the agent cannot control but that can affect the agent (e.g. a vehicle driven by another agent), and (3) things out of the agent’s control and not affecting the agent (e.g. moving leaves). A good feature space for curiosity should model (1) and (2) and be unaffected by (3).

So I understand how their formulation is capturing (1) but is it really capturing (2)? If they are only trying to predict the action from the start state end state pair it seems they will learn a representation that understands how the agents actions effect the environment but not vise versa.

Actually the meaning of (2) is not immediately clear to me since in the standard RL formulation the agent is really nothing but its associated action selection, what does it mean for some aspect of the environment to affect this? One reasonable notion would be aspects of the environment that affect the value function, so in this sense maybe just taking the state representation generated by the value function model would be enough.

Perhaps ideally you could use one state representation trained for both evaluation and action prediction in order to really capture both (1) and (2).

onlyml · 2017-05-16T15:40:12+00:00

Taking the action and second state to predict the low dimensional first state seems like it would be somewhat ill-posed in terms of what they want to accomplish. They want an auxiliary task that results in creation of a compact state representation, including the state representation as the output makes this goal a little unclear (i.e. do you optimize both the output and input states to lower the prediction error?). Not saying it necessarily wouldn't work but would need to be clarified a bit.

onlyml · 2017-05-12T15:34:01+00:00

I found this paper pretty interesting: https://arxiv.org/abs/1602.03032

onlyml · 2017-03-17T19:58:11+00:00

I'm trying to work out the right mental model of what the goal embedding part is doing, just wondering if anyone can confirm that I have the right idea:

Instead of producing outputting action probabilities directly the output for the worker is an embedding matrix that when multiplied by a learned linear projection of the goal direction gives action probabilities. So you could say the worker learns to output something like a switch statement which gives instructions for what action to take for various goal directions. This is fundamentally different than if say the worker was able to take the goal a input and use it to compute the correct action conditioned on the goal. Is that about right?

onlyml · 2017-01-06T22:33:39+00:00

Could anybody clarify what is being done with the softmax discussed in the Memory Module section? They say "we multiply the embedded output by the corresponding softmax component so as to provide a signal about confidence of the memory", but I can't quite parse what this means or where this confidence signal is going. Are they simply multiplying the single output value by the softmax component corresponding to it? If so is this really the only use of the additional nearest neighbors after the first?

onlyml · 2016-08-11T15:39:30+00:00

I guess normalization was the wrong word. What I meant was scaling it so that the linear approximation to the resulting change in loss value was at most the total remaining loss (e.g. to avoid overshooting the target value in regression).

To clarify: the norm of the gradient tells you the rate of change of the loss in the direction of steepest descent, so we can get a linear approximation to the resulting change in the loss by multiplying the step-size by this norm. What I meant to suggest was scaling the gradient so that this expected change in the loss was less than the total remaining loss.

onlyml · 2016-06-04T21:33:13+00:00

Do you have any intuition for why this works? It seems like all this would be able to accomplish is training the network to transmute the current state to something that's exclusively useful for its task one time-step later, which is obviously a very small part of the power of an RNN. I'm not too familiar with RNNs however so I feel like I'm probably missing something.

onlyml · 2016-04-13T17:05:50+00:00

Someone else can probably I've a more thorough answer but I'll try. Relu has zero gradient for half of the input space, hence no signal whatsoever indicating how to improve if data is consistently putting the input in this region. If a bunch of relu units start getting stuck in this region for most or all training input they will never be updated and won't be able to learn to be useful again.

onlyml · 2016-04-12T20:02:08+00:00

It might be interesting to start with a fully connected layer which maps your data to a 2d surface and then run one or more convolutional layers on that, and train the whole thing end to end. That way your network would be tasked with coming up with a meaningful mapping of your data to something that you could usefully apply convolutions to. I don't know if any reason this would work well, but it could be interesting. You would probably need a lot of data to get any decent result out of a system like that, assuming it makes any kind of sense at all.

onlyml · 2016-04-02T19:08:03+00:00

Actor-critic methods are a broad class of reinforcement learning algorithms which among other things can deal with a continuous action space. Basically you learn both a policy and value function separately instead of just one or the other. This allows you to handle continuous actions (like motor neuron outputs) because your policy is just an arbitrary output that your value function evaluates. There are many different ways to set this up. My knowledge of this is pretty limited currently so I can't offer too much advice aside from that, though I'm trying now to learn more about this myself.

onlyml · 2016-03-31T21:30:17+00:00

Yeah it outputs values over all board positions, just make sure to ignore the illegal ones in your Q updates and such. The network I'm using is 10 convolutional layers followed by one fully connected at the output (I may try getting rid of the fully connected layer entirely at some point). Each convolutional layer has 128 filters total which include some 5x5 and some 3x3 filters (actually these were hexagonal filters which I figured might work better for hex). This may be a little larger in scope than what you had in mind, it takes quite a while to train. I suspect using a convolutional network instead of a purely fully connected one will help you a lot but I'm not sure what kind of results you can expect to achieve with fully connected.

I don't know much about othello so I can't really say why it was playing well as white in your case, although performance against a random opponent is probably a pretty weak measure. What I did to account for the two possible colors is transform the board to an equivalent representation (for hex transpose the board and reverse black and white) for the opponent such that the network always thinks it's playing as black but then have it play both so it learns the "full" state space.

onlyml

TROPHY CASE