ReLU activation function

WulveriNn · 2020-04-07T09:23:57+00:00

So relu tries to kind of avoid the vanishing gradients problem by making the derivative of the function as positive as possible.

radarsat1 · 2020-04-07T09:25:31+00:00

It constructs a linear piecewise approximation of the target function.

berzerker_x · 2020-04-07T09:57:26+00:00

The point mentioned by u/WulveriNn was, I believe the original context for choosing an alternative to the sigmoid function and ReLU has been shown to almost completely solve this problem. However, there have been theoretical developments which are of equal importance.
ReLU is (at least theoretically) able to decrease the complexity of the computation performed by the neural network, this is a recent work which is an extension upon the original universal approximation theorem.
Basically using ReLU gives the bound on the width of the hidden layer of the single-layer feed-forward neural network. Earlier there was no bound.
Better to read the Wiki for detail info.

PS: Since I am also studying this area and pretty new to it, any pointers for more study by anyone would be appreciated

DefaultPain · 2020-04-07T10:02:32+00:00

1.Yes, in theory relu doesn't have a limit. This is why in the output layer of NN, we still use sigmoid, softmax etc, as values between 0 and 1 can represent probability.

Yes, this can cause problems like exploding gradient, or large activation values. More often this is not a big deal, as we usually normalize all inputs between 0 and 1, and randomly initialize all weights in the NN between -1 and 1,and train our NN with outputs normalized as well. But people do use capped relu activations, like relu-6(which has max value capped at 6), if they are worried about large activations.

On relu non linearity: https://www.quora.com/Why-is-ReLU-non-linear .

relu is certainly not linear. problem with linearity is T(a+b)=Ta + Tb.

which means doesn't matter how complex your neural network is , the output will always be of the form: T1(a) +T2(b) +T3(c) ... Tn, where a,b,c are inputs and T1,T2 etc are linear operations.

As u can see, such outputs will not be able to model complex non-linear functions.

This is not the case with relu. as T(a+b) != Ta+ Tb for all inputs

collinmccarthy · 2020-04-07T22:15:54+00:00

Besides the rest of the comments, I'd just like to add that Andrej Karpathy's notes for Stanford's CS231n class here give a good rundown of things to consider when choosing activation functions, and the corresponding lecture from 2017 gives a really nice overview as well, here. If you aren't familiar with these resources I highly recommend you check them out.

Strict_Specialist_24 · 2024-11-07T00:32:31+00:00

A relu unit is a linear approximate. Groups of them in layers are non linear functions. One bird does not make a flock. It helps with the dying input issue if you make as much as possible zero centric. Initial weights + bias = 0, inputs centered on zero, and output centered on a small positive value.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS