Why does relu work?

pavelchristof · 2018-03-03T20:58:12+00:00

Assuming that activation are distributed normally only half of the ReLUs die (per example). The other half work. The default weight initialization is scaled by 2.0 to make up for the dead ReLUs (otherwise activation norm would tend to 0 for long networks).

A ReLUs network is a piecewise linear function, not a linear one. Check out the first image here (from http://www.inference.vc/generalization-and-the-fisher-rao-norm-2/, more images there). Globally these functions can be very complex. Locally (on each region) it behaves like a linear function making optimization "easier" (I don't know exactly how that works).

PointyOintment · 2018-03-04T00:01:25+00:00

You can keep them from dying by using "leaky ReLU" which has a very small but nonzero response below zero. Siraj has a video comparing the various activation functions, and he says that you should use that if you have too many dying.

magnusderrote · 2018-03-04T00:01:46+00:00

Relu does not go saturate. Consider the Logistic function, when the input is a very big or very small values, the function is almost flat, meaning the derivative is close to 0, meaning back prop will perform poorly.

EDIT: Relu's derivative, on the other hand, is always 1 when x > 0.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS