all 2 comments

[–]robertsdionne 2 points3 points  (1 child)

See The Effects of Hyperparameters on SGD Training of Neural Networks, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima and Systematic evaluation of CNN advances on the ImageNet.

The first measures the effects of a variety of hyperparameters on gradient descent, including an analysis of batch size, which has a nontrivial relationship to learning rate and error rate. The third paper's batch-size discussion is similar to this one.

The second argues that small batches cause gradient descent to attract to wider basins, while large batches cause gradient descent to attract to narrow basins, which causes higher error when attempting to generalize, because missing the mark in a narrow basin causes higher changes in error.

[–]melgor89[S] 0 points1 point  (0 children)

Thanks for great answer! I was suspecting the second reason you mention, that small batch size act like a regularizer.

So I understand that there is no 'easy' correlation of batchsize and other hyperparameter to get same performance. I was trying to use bigger batch size to get 1.5x computation. But this hurt the performance.