Distributed training leverages distributed ML techniques to spread model training across devices. For example, this technique is often used with large neural networks. When either the network, the training dataset or both is too large for one processor, distributed training spreads the workload across multiple servers, GPUs or machines.
Stochastic gradient descent (SGD) is a learning algorithm that splits the dataset into mini-batches and computes the gradient of the loss function after each batch. Using mini-batches instead of the full dataset makes training more efficient.
The loss function measures the error in the model’s predictions, and SGD’s goal is to descend the gradient to minimize the value of the function. As with standard model training, the training process is deemed complete when the model reaches convergence: when the SGD algorithm successfully minimizes the function’s value.
Nodes process mini-batches in parallel, which is possible because each batch is processed independently of the others within each iteration. Each node computes its gradient, then pushes the updated gradient value to the other nodes in the network. The other worker nodes implement the updates that they receive into their own models, helping ensure that all copies of the model remain identical throughout the training process.
The AllReduce function is a collective communication operation which enables each node to share its results and propagate the aggregated results through the network. AllReduce allows all nodes to synchronize model parameter updates and maintain consistency. AllReduce, long used in high-performance computing, was popularized in ML frameworks such as Horovod.
SGD can be run synchronously or asynchronously. Synchronous SGD updates all nodes at the same time, which maintains consistency at the cost of potential delays if some nodes lag behind. Asynchronous SGD updates parameters as soon as an update is ready, but some nodes might receive updates that don’t include the most current values.
By reducing the computational resources needed per device, distributed training can speed up training times. Because it is so compute-intensive, training is one of the primary use cases for distributed ML.