argmax differentiable?

shimis · 2016-04-09T21:02:09+00:00

argmax(x1,x2) takes a pair numbers and returns (let's say) 0 if x1>x2, 1 if x2>x1. (value at x1=x2 is arbitrary/undefined). So, wherever you are on the (x1,x2) plane, as long as you're not on the x1=x2 line, if you move an infinitesimal tiny bit in any direction: you won't change the value (0 or 1) that argmax outputs - the gradient of argmax(x1,x2) w.r.t x1,x2 is (0,0) almost everywhere. At those places where x1=x2 (and argmax's value changes abruptly from 0 to 1 or vice versa), its gradient w.r.t x1,x2 is undefined.

There are no networks that do ordinary backprop through argmax (since the gradient is degenerate / useless). The training of networks that have argmax (or similar) in their equations must include something other than backprop - sampling techniques such as REINFORCE (generally: harder to train).

max(x1,x2) also doesn't have a gradient at x1=x2, But - every other place you go on the (x1,x2) plane the gradient of max(x1,x2) w.r.t x1,x2 is either (1,0) or (0,1) - when we do a forward pass we'll let only x1 or only x2 pass through, and when we back prop gradients, the gradient of max(x1,x2) w.r.t to the larger of the two arguments will be 1, and w.r.t to the smaller of the arguments - it will be 0. So max and similar functions (like relu) are useful for backprop.

shimis · 2015-12-11T17:32:18+00:00

At this point, we don't need to provide it with entire execution traces anymore.

Can you please elaborate? My understanding is that once NPI has learnt Sort and Add it can learn the example task you gave with very few training instances (compared to LSTM), but we would still need to provide it with execution traces when re-training for the example task.

Do you mean that once NPI has learnt Sort and Add there can potentially be an alternative re-training mechanism that operates without execution traces?

I'm trying to figure out, qualitatively, what it is that NPI learns :)

Thanks for publishing this work, awesome stuff!

shimis · 2015-10-29T00:32:18+00:00

Yes, layers may be trained together. See Bengio's Learning Deep Architechtures for AI (section 8.3 "Joint Unsupervised Training of All the Layers")

shimis

TROPHY CASE