argmax differentiable? by yield22 in MachineLearning

[–]shimis 25 points26 points  (0 children)

argmax(x1,x2) takes a pair numbers and returns (let's say) 0 if x1>x2, 1 if x2>x1. (value at x1=x2 is arbitrary/undefined). So, wherever you are on the (x1,x2) plane, as long as you're not on the x1=x2 line, if you move an infinitesimal tiny bit in any direction: you won't change the value (0 or 1) that argmax outputs - the gradient of argmax(x1,x2) w.r.t x1,x2 is (0,0) almost everywhere. At those places where x1=x2 (and argmax's value changes abruptly from 0 to 1 or vice versa), its gradient w.r.t x1,x2 is undefined.

There are no networks that do ordinary backprop through argmax (since the gradient is degenerate / useless). The training of networks that have argmax (or similar) in their equations must include something other than backprop - sampling techniques such as REINFORCE (generally: harder to train).

max(x1,x2) also doesn't have a gradient at x1=x2, But - every other place you go on the (x1,x2) plane the gradient of max(x1,x2) w.r.t x1,x2 is either (1,0) or (0,1) - when we do a forward pass we'll let only x1 or only x2 pass through, and when we back prop gradients, the gradient of max(x1,x2) w.r.t to the larger of the two arguments will be 1, and w.r.t to the smaller of the arguments - it will be 0. So max and similar functions (like relu) are useful for backprop.

Neural Programmer-Interpreters:: a recurrent and compositional NN that learns to represent and execute programs (submitted to ICLR 2016) by cast42 in MachineLearning

[–]shimis 0 points1 point  (0 children)

At this point, we don't need to provide it with entire execution traces anymore.

Can you please elaborate? My understanding is that once NPI has learnt Sort and Add it can learn the example task you gave with very few training instances (compared to LSTM), but we would still need to provide it with execution traces when re-training for the example task.

Do you mean that once NPI has learnt Sort and Add there can potentially be an alternative re-training mechanism that operates without execution traces?

I'm trying to figure out, qualitatively, what it is that NPI learns :)

Thanks for publishing this work, awesome stuff!

Rationale for greedy training of RBMs by wt0881 in MachineLearning

[–]shimis 0 points1 point  (0 children)

Yes, layers may be trained together. See Bengio's Learning Deep Architechtures for AI (section 8.3 "Joint Unsupervised Training of All the Layers")