machine-learningneural-networkgradient-descentstochasticmini-batch

Stochastic Gradient Descent(SGD) vs Mini-batch size 1


Is stochastic gradient descent basically the name given to mini-batch training where batch size = 1 and selecting random training rows? i.e. it is the same as 'normal' gradient descent, it's just the manner in which the training data is supplied that makes the difference?

One thing that confuses me is I've seen people say that even with SGD you can supply more than 1 data point, and have larger batches, so won't that just make it 'normal' mini-batch gradient descent?


Solution

  • On Optimization Terminology

    Optimization algorithms that use only a single example at a time are sometimes called stochastic, as you mentioned. Optimization algorithms that use the entire training set are called batch or deterministic gradient methods.

    Most algorithms used for deep learning fall somewhere in between, using more than one but fewer than all the training examples. These were traditionally called minibatch or minibatch stochastic methods, and it is now common to call them simply stochastic methods.

    Hope that makes the terminology clearer:

    Deeplearningbook by Goodfellow p.275-276