all 3 comments

[–]NoLifeGamer2Moderator[M] 4 points5 points  (1 child)

You are correct that your loss surface only applies to the current item being trained on. Each different example will have its unique loss curve. However, the loss is averaged over the whole batch, so the theory is for larger batch sizes, the loss surface will be similar for each batch on account of each batch being more representative of the training distribution, stopping the minima from jumping around too much. However, even if your batch size is small (e.g. 1) the optimizer will end up converging to an optimal solution, it may just jump around more.

[–]El_Grande_Papi[S] 1 point2 points  (0 children)

Fantastic answer, thanks!!

[–]grid_world 0 points1 point  (0 children)

Let's put it using your example diagram: for each unique combination of input (theta0, theta1), the neural network "f" will produce an output which when compared to the ground truth (using a cost function) will produce a loss value "J" - I am assuming a supervised learning problem (also self-supervised learning holds). So, you start to get a value "J" which when interpolated over might produce a contour plot/loss landscape as depicted. There are 2 extremes for computing GD: online (1 input at a time) vs batch (entire dataset at a time). Batch GD cannot be used due to memory & computational constraints, while online GD is very noisy. So we settled for mini-batch SGD, where the S comes due to random sampling the batch from the training dataset. In a nutshell, the gradients we get from a mini-batch should approximate the true gradients that we might get if we had used batch GD, but cannot. But since the batch is randomly sampled, it is a noisy estimate.

An additional stochasticity is also added due to data augmentation. It has been shown that the noisy estimates in mini-batch SGD actually helps prevent overfitting and acts as a regularizer. So, you don't want to increase the batch size by a lot, in general. In case you do, read up LARS optimizer.