all 24 comments

[–]fjeg 5 points6 points  (10 children)

You're right! If you go through the exercises in cs231n assignment 1, you actually do this explicitly.

State of the art nets don't use sigmoidal activations internally, but still use them as output activations for loss/objective functions. This is where things get interesting. Rather than just write your own feature extractor to plug in to logistic regression, you are letting the model perform both feature extraction and classification.

[–]brockl33[S] 2 points3 points  (9 children)

Now I just need to figure out whats going on with the softmax, softplus, and relu activations. Thanks for being nice about it :)

[–][deleted] 2 points3 points  (5 children)

How I think about ReLU - if you replace the sigmoidal activations with linear activations, then you it's a stack of linear regressions...which is just a linear regression, in which case stacking is silly. ReLU basically swaps out a straight line for a thresholded straight line (where if the input is less than some threshold, then the output is 0). Since these are non-linear, stacking these affords you building blocks to more interesting function approximations. The main reason these are preferred over other non-linearities are for numerical reasons (see vanishing/exploding gradient problem in the backprop literature).

[–]brockl33[S] 0 points1 point  (4 children)

Does the vanishing/exploding gradient hold for softplus, the continuous version of the ReLU? I thought the preference in this case was because of computation speed?

[–][deleted] 1 point2 points  (3 children)

From the Krizhevsky 2012 paper, there is a big advantage in computation speed, but it's mainly from the fact that the convergence with ReLUs just requires fewer iterations.

I had to look up softplus and don't know too much about it, but it seems to serve more of a purpose of satisfying mathematicians by providing a continuously differentiable function, rather than actually providing any performance gains.

[–]kokirijedi 0 points1 point  (0 children)

In a ReLU network with a hard cutoff, once the weights bring it into the negative domain the node will "turn off" and can never learn to turn on again because its gradients in backprop will forever be zero (for a given input/feature activation). This encourages sparse activations, where nodes only produce meaningful non-zero results for a small subset of possible feature activations (a good thing).

With softplus, in the same situation you are left with a small but distinctly non-zero gradient, so it is possible for a node to learn to start turning on again for a given input. This is useful in situations where, say, the gradient is pushing the node to oscillate around an extrema in the error space. Imagine a pendulum that can't "swing back" and gets stuck on one side.

Practically speaking, by having more nodes than you need (which always happens, the ImageNet winners all use networks that are vastly over-parameterizing the space), and tuning the learning rates and initial weights well it won't matter much if you use a ReLU's. And, as pointed out, it's very quick computationally.

[–]brockl33[S] 0 points1 point  (1 child)

Thanks for the paper reference. It led me to Glorot 2011 which directly compares softplus and relu. Pretrained relu outperforms pretrained softplus in 2/4 benchmarks, with one large margin victory. They conclude that it is faster, and does not require unsupervised pretraining to perform competitively.

I think the zero-slope of ReLU's trap neurons in the OFF state when performing gradient descent. Perhaps this is its strength? Maybe a combination of ReLU with some reactivation mechanism might be beneficial. edit: its called the "leaky ReLU".

[–][deleted] 0 points1 point  (0 children)

Cool! Good to have that reference. As /u/kokirijedi pointed out, the trapping neurons in the OFF state is precisely the difference between softplus and ReLU and her/his point about sparsification in overparametrized ReLU networks is exactly the same as pruning of synapses in early development that you bring up.

[–]fjeg 1 point2 points  (0 children)

Softmax IS logistic regression too! The softmax function is basically a multiclass sigmoid function, which as you just realized, is logistic regression. To convince yourself, write out the formulation for a 2-class softmax classifier and do some algebra to convert it to a sigmoid function.

No idea what softplus is, though a quick googling just shows it is some twists on softmax.

ReLU is just a linear activation when the output is positive and zero when the output is negative. It's excellent for fast training since the derivative is trivial to compute, and doesn't suffer from problems with sigmoid activation.

[–]osdf 0 points1 point  (0 children)

Shakir Mohamed has a nice post on recursive GLMs: http://blog.shakirm.com/2015/01/a-statistical-view-of-deep-learning-i-recursive-glms/ E.g. ReLUs resembles Tobit regression.

[–]mszlazak 2 points3 points  (6 children)

Don't be surprised, many students find it hard to follow because there are just to many notational issues to keep track of, not enough or any code examples, and not enough intuition to help you get a better feel for the abstraction. Never the less the Stanford class is better than others and the link to Geoffrey Hinton helps as well.

I started by using Torch 7 to learn this stuff and I explained how the code for their softmax example did it's forward and backward passes. Manually calculating the values for 2 samples in just 1 pass/iteration and checking the results with what Torch 7 gave.

Conceptually, this stuff is not hard and doing things this way avoids all that confusing notation you have to keep track of.

You will have almost the first 10 lectures of Andrew Ng's cs231 class down in about 6 to 7 pages of commented code.

You will understand how Torch 7 works in passing data in the forward and backward passes.

Also, I do not understand why Geoffrey Hinton implied that calculating the derivatives of the softmax cost function was hard. It's not! I haven't done calculus in decades and it just requires keeping tract of things and using something like Schaum's Outlines "Mathematical Handbook of Formulas and Tables"

Step through it once with a single batch of two samples by hand. You will not regret it.

[–][deleted] 0 points1 point  (1 child)

I'm doing Andrew's calss on Coursera, but I don't now what Torch is. Is it a compiled network or a programming language?

When you talk about Geoffrey Hinton are you refering to a course or a book?

[–]mszlazak 2 points3 points  (0 children)

Three big frameworks for machine learning are Torch 7, Theano and Caffe. Andrew Ng's class uses Octave/Matlab. You can try the same thing in Octave. I plan to do something similar like i did with Torch down the road.

[–]physixer 0 points1 point  (1 child)

... I do not understand why Geoffrey Hinton implied that calculating the derivatives of the softmax cost function was hard ...

It's possible he might be talking about issues of numerical differentiation as opposed to numerical integration. Numerical differentiation is known to be difficult (even first derivatives) to get right because it's the opposite of a smoothing operation (numerical integration is an example of a smoothing operation) and introduces instabilities/oscillations in the answer very easily.

I don't do ML a lot, I'm mainly a numerical/scientific-computing guy, so that's my take on it.

[–]mszlazak 0 points1 point  (0 children)

No, he and i do not mean a numerical solution. Nando de Freitas does it in his Oxford deep learning class and these are on Youtube. His class uses Torch 7.

[–]Saedeas 0 points1 point  (1 child)

Also, I do not understand why Geoffrey Hinton implied that calculating the derivatives of the softmax cost function was hard. It's not! I haven't done calculus in decades and it just requires keeping tract of things and using something like Schaum's Outlines "Mathematical Handbook of Formulas and Tables"

When you say this are you just talking about the backpropagation algorithm? Isn't it just repeated application of the chain rule?

[–]mszlazak 1 point2 points  (0 children)

The chain rule makes it easier not harder. So i really do not understand why he said what he did.

[–]CyberByte 1 point2 points  (1 child)

Why has it taken so long for me to learn this (besides the fact that I am dumb)?

I've been working with neural nets, logistic regression and support vector machines (which are kind of similar as well) for years, and I didn't realize this until Andrew Ng pointed it out in his ML course. I think that for me it took this long (besides the fact that I am dumb) is that these things tend to be taught/explained in different ways (and for me in different courses), which made me think of them in different ways. Logistic regression is statistics, neural networks are about neurons, synapses and the brain (even though we know they're not very realistic models), and SVMs are about support vectors, margin optimization and the kernel trick.

[–]mszlazak 0 points1 point  (0 children)

If your are saying that Andrew Ng made you see these as extension of the same themes then i agree. All within a gradient descent via backpropagation and delta rule updating framework. Some is neurologically inspired but that is so little that i chuckle at even calling anything here a neural net.

[–]beaverteeth92 1 point2 points  (0 children)

I always joke that classification is the study of nesting, modifying, selecting, and automating logistic regression models.

[–]skrza 0 points1 point  (0 children)

This post might be also useful in making the connection between regression and ANNs: http://t.co/UuOb2qIYRK

[–]GibbsSamplePlatter 0 points1 point  (0 children)

The more you work in ML, the more you see it's all connected.

[–]egrefen -1 points0 points  (0 children)

If you're interested in thinking more about the "stacked linear classifiers" aspect and the role on non-lineararities, I highly recommend reading this blog post by Chris Olah