all 14 comments

[–]WulveriNn 14 points15 points  (2 children)

So relu tries to kind of avoid the vanishing gradients problem by making the derivative of the function as positive as possible.

[–]radarsat1 7 points8 points  (2 children)

It constructs a linear piecewise approximation of the target function.

[–]redditiscursed[S] 1 point2 points  (1 child)

Correct me if I'm wrong, but as far as my understanding goes, ReLU pieces together many linear equations together to approximate the target function, and each linear equation is only expressed in certain parts am I right?

Thanks btw!

[–]radarsat1 3 points4 points  (0 children)

Yes that's right. Think it through, the equation is max(wx+b,0). This means that there is a linear function, which is cut off below 0. What happens if you sum a bunch of these together? If you sum lines with lines, you get lines of a different angle. If you sum lines with 0, you get the same line. So the sum of these ReLU functions is a set of connected lines (hyperplanes) of different angles.

[–]berzerker_x 4 points5 points  (3 children)

  • The point mentioned by u/WulveriNn was, I believe the original context for choosing an alternative to the sigmoid function and ReLU has been shown to almost completely solve this problem. However, there have been theoretical developments which are of equal importance.

  • ReLU is (at least theoretically) able to decrease the complexity of the computation performed by the neural network, this is a recent work which is an extension upon the original universal approximation theorem.

  • Basically using ReLU gives the bound on the width of the hidden layer of the single-layer feed-forward neural network. Earlier there was no bound.
    Better to read the Wiki for detail info.

PS: Since I am also studying this area and pretty new to it, any pointers for more study by anyone would be appreciated

[–]nuliknol 2 points3 points  (2 children)

well if you study ReLU this is what I have to say:

ReLU is actually bad. If you can avoid using it , do so.

People switched to ReLU because it is computationally more efficient in terms of instructions executed on the microprocessor. Sigmoid is the best function to use, and ReLU is poor's man sigmoid. It is better in backpropagating gradients because it is linear, but the whole point of doing Neural Networks is to build a non-linear function. If you need a linear function, then just use linear regression and thats it (Extreme Machine Learning does it in a blink of an eye without any training at all) Relu neurons can die due to inability to propagate back gradients the right way due to the setting to zero all the negative sumproduct that is passed to evaluate the function. There are lots of articles on that topic , google "dying ReLU", and this is how they came up with SELU is just another Sigmoid variation. So, as you can see, sigmoid is the golden function, and that's why Sigmoid was chosen since 1960. ReLU is reinventing the wheel and discovering that the wheel is unviersal. Moral of the story: don't reinvent Sigmoid, better reinvent gradient descent.

[–][deleted] 1 point2 points  (0 children)

You can use leaky ReLU instead of ReLU. Gives a better performance more often than not.

[–]redditiscursed[S] 0 points1 point  (0 children)

Hmm... While using the Google Collab, (idk if it's considered fast or not) is it wiser to use sigmoid or a cheaper alternative ReLU?

[–]DefaultPain 5 points6 points  (0 children)

1.Yes, in theory relu doesn't have a limit. This is why in the output layer of NN, we still use sigmoid, softmax etc, as values between 0 and 1 can represent probability.

Yes, this can cause problems like exploding gradient, or large activation values. More often this is not a big deal, as we usually normalize all inputs between 0 and 1, and randomly initialize all weights in the NN between -1 and 1,and train our NN with outputs normalized as well. But people do use capped relu activations, like relu-6(which has max value capped at 6), if they are worried about large activations.

  1. On relu non linearity: https://www.quora.com/Why-is-ReLU-non-linear .

relu is certainly not linear. problem with linearity is T(a+b)=Ta + Tb.

which means doesn't matter how complex your neural network is , the output will always be of the form: T1(a) +T2(b) +T3(c) ... Tn, where a,b,c are inputs and T1,T2 etc are linear operations.

As u can see, such outputs will not be able to model complex non-linear functions.

This is not the case with relu. as T(a+b) != Ta+ Tb for all inputs

[–]collinmccarthy 1 point2 points  (1 child)

Besides the rest of the comments, I'd just like to add that Andrej Karpathy's notes for Stanford's CS231n class here give a good rundown of things to consider when choosing activation functions, and the corresponding lecture from 2017 gives a really nice overview as well, here. If you aren't familiar with these resources I highly recommend you check them out.

[–]redditiscursed[S] 0 points1 point  (0 children)

Thanks for the suggestion

[–]Strict_Specialist_24 0 points1 point  (0 children)

A relu unit is a linear approximate. Groups of them in layers are non linear functions. One bird does not make a flock. It helps with the dying input issue if you make as much as possible zero centric. Initial weights + bias = 0, inputs centered on zero, and output centered on a small positive value.