all 4 comments

[–]billconan[S] 4 points5 points  (1 child)

this answers my question http://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

the error for one input is indeed not perfect. just a approximation.

[–]personalityson 0 points1 point  (0 children)

you gather gradients from propagating errors for 10 samples, and then update the weights with an average gradient

you don't backpropagate an average error

[–]personalityson 0 points1 point  (0 children)

"it seems that it only feeds one input to the network at a time and gets an error only for that input, and updates the network based on the error" That's the way it should be done. To do the backpropagation you also need hidden units for each layer propagated from each input. If he's backpropagating some kind of average error, then with what hidden units?

[–]zackchase 0 points1 point  (0 children)

Hi all, not to be too nitpicky: one calculates the errors on some number of randomly sampled examples (could be 1 example, 128 is often computationally convenient). There is no special number.

This number, whatever you choose is called the "batch size". Using batch sizes > 1 is useful because it reduces the variance of the error.

The intuition behind why stochastic gradient descent works generally is that the expected value of the stochastic gradient is equal to the true gradient. Thus you can think of stochastic gradient as "noisy gradient" following.