The key differentiator between traditional gradient descent and stochastic gradient descent is that SGD updates model weights by using a single training example at a time. The example is randomly picked at each iteration.1 Gradient descent uses the entire training dataset to compute the gradient before each parameter update. This difference in data usage is what makes SGD much less computationally expensive and easier to scale for large datasets. Alternatively, the convergence behavior of SGD is noisier than the noise of GD because the one example datapoint might not be a good representation of the dataset. This misrepresentation updates the points in a slightly “wrong” direction. However, this randomness is what makes SGD faster and sometimes better for nonconvex optimization problems because it can escape shallow local minima, or saddle points.

Strictly speaking, SGD was originally defined to update parameters by using exactly one training sample at a time. In modern usage, the term “SGD” is used loosely to mean “minibatch gradient descent,” a variant of GD in which small batches of training data are used at a time. The major advantage to using subsets of data rather than a singular sample is a lower noise level, because the gradient is equal to the average of losses from the minibatch. For this reason, minibatch gradient descent is the default in deep learning. Contrarily, strict SGD is rarely used in practice. These terms are even conflated by most machine learning libraries such as PyTorch and TensorFlow; optimizers are often called “SGD,” even though they typically use minibatches.

The following illustration provides a clearer depiction of how increasing the sample size of training data reduces oscillations and “noise.”