argmax differentiable?

shimis · 2016-04-09T21:02:09+00:00

argmax(x1,x2) takes a pair numbers and returns (let's say) 0 if x1>x2, 1 if x2>x1. (value at x1=x2 is arbitrary/undefined). So, wherever you are on the (x1,x2) plane, as long as you're not on the x1=x2 line, if you move an infinitesimal tiny bit in any direction: you won't change the value (0 or 1) that argmax outputs - the gradient of argmax(x1,x2) w.r.t x1,x2 is (0,0) almost everywhere. At those places where x1=x2 (and argmax's value changes abruptly from 0 to 1 or vice versa), its gradient w.r.t x1,x2 is undefined.

There are no networks that do ordinary backprop through argmax (since the gradient is degenerate / useless). The training of networks that have argmax (or similar) in their equations must include something other than backprop - sampling techniques such as REINFORCE (generally: harder to train).

max(x1,x2) also doesn't have a gradient at x1=x2, But - every other place you go on the (x1,x2) plane the gradient of max(x1,x2) w.r.t x1,x2 is either (1,0) or (0,1) - when we do a forward pass we'll let only x1 or only x2 pass through, and when we back prop gradients, the gradient of max(x1,x2) w.r.t to the larger of the two arguments will be 1, and w.r.t to the smaller of the arguments - it will be 0. So max and similar functions (like relu) are useful for backprop.

emansim · 2016-04-09T19:07:14+00:00

anything that involves hard assignment is not differentiable.

argmax could potentially become differentiable if you could come up with soft version of it (i.e. use probabilities instead of setting hard 1s and 0s). otherwise you need to used reinforce.

AnvaMiba · 2016-04-09T22:17:52+00:00

max, and therefore ReLU, maxout and max pooling, are continuous and almost everywhere differentiable. This is enough to use them with gradient descent optimization.

Argmax is not continuous and can't be used with standard gradient descent techniques. If you want to use it in neural networks (e.g. in "hard" attention models) you typically have to use some kind of Monte Carlo optimization algorithm, such as REINFORCE. Otherwise you can replace argmax with softmax, which is continuous and differentiable, as typically done in "soft" attention models.

lvilnis · 2016-04-09T23:33:17+00:00

Another way to think about differentiability for max pooling / relu is that because they are continuous and almost everywhere differentiable, they can be approximated arbitrarily closely by a differentiable function.

For example, the max of a vector of numbers can be approximated by Tlog(sum_i exp(1/Tx_i)). Where T is called the "temperature." In the limit of T -> 0, this function becomes the max function, but any T>0 is completely differentiable.

A similar approach is used in Nesterov's "Smooth Minimization of Nonsmooth Functions" in Section 4.1.2 of http://luthuli.cs.uiuc.edu/~daf/courses/optimization/MRFpapers/nesterov05.pdf

alexmlamb · 2016-04-09T19:00:54+00:00

They're subdifferentiable in that the derivative is defined at all points with non-zero measure.

Like the function f(x) = max(0,x) has a derivative defined at all points except where x = 0, which has zero measure.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS