The latest AI trends, brought to you by experts
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Distributed machine learning (ML) is an approach to large-scale ML tasks where workloads are spread across multiple devices or processors instead of running on a single computer. Distributed ML is most often used for training large and complex models, such as deep neural networks, where computational demands are especially high.
As artificial intelligence (AI) models scale in size and complexity, the computational resources needed for training can exceed the capacity of a single device. Likewise, complex large-scale tasks can be too demanding or require more memory than one device can provide. Distributed systems partition either the dataset or the AI model across multiple devices, or nodes, to improve performance, accuracy and efficiency.
Distributed ML is a subset of distributed computing, which encompasses broader efforts to spread computing across multiple processing systems.
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Approaches to distributed machine learning include:
Data parallelism: splitting the dataset across multiple devices
Model parallelism: partitioning a model across multiple devices
Hybrid parallelism: combining aspects of data and model parallelism
The parameter server framework for distributed ML uses one or more server nodes to maintain model parameter updates, while worker nodes handle the data and computation.
Data parallelism partitions a large dataset being used as training data into a number of groups equal to the number of nodes in the system. Each node hosts a copy of the AI model and feeds it a subset of the distributed data. Data parallelism operations can be run either synchronously or asynchronously. Synchronous training is more common with deep learning models because it ensures consistency, but at the cost of speed.
Because each node runs a full copy of the model, the hardware must have enough memory to host it. However, the AI workload per node is reduced since each processes only part of the dataset.
Federated learning resembles data parallelism in that multiple nodes train copies of a model on subsets of data. However, the data partitions are not created from a centralized data pool but consist of local data to each node. Model updates are aggregated between nodes, while the data never leaves its original source.
Because each dataset is localized, data scientists often use federated learning for distributed deep learning applications involving high data security or legal regulations. Centralized data pools can accidentally expose personally identifiable information (PII) during transmission. Federated learning mitigates this risk by isolating each dataset to its own node.
In model parallelism, a model is split into component portions, such as layers of a deep neural network, that can run independently and simultaneously across separate nodes. Each portion of the model uses the same data. Model parallelism is also known as network parallelism.
When a single ML model is too large for a single machine to handle, model parallelism allows the model to run across multiple nodes. Each node handles a portion of the model, so its compute and memory requirements are lower than if one single machine had to run the full model.
Model parallelism is typically more difficult to implement than data parallelism. The selected distributed machine learning algorithm directly affects the scalability of the method, with some lending themselves to the technique more than others. The system must be built in a way that minimizes the amount of data sharing between nodes. High-performance model parallelism systems require expert-level design and optimization.
Model parallelism is often combined with data parallelism so that each segment of the model processes a different portion of the input data. The results are aggregated across the network.
Pipeline parallelism is a type of model parallelism that partitions a model sequentially. Each stage of the model is hosted on its own node. Batches of data are processed in order through the stages—similar to how an old-fashioned bucket brigade would pass a bucket of water from one person to the next, from the water source to the fire.
Pipeline parallelism can improve throughput—the volume of data a model can process at once. However, this boost comes with the cost of increased latency, which is the amount of time it takes to generate a result after receiving an input.
The increase in latency is due to the ramp-up phase during which the initial micro-batches of data are passed through the sequence. Stages later in the machine learning pipeline cannot initialize until the first micro-batches of data pass through the previous stages.
Tensor parallelism comes into play when even a single layer of a deep neural network requires too much computational power or takes up too much space for a single device. Self-attention and embedding layers—two cornerstones of the transformer architecture—can grow very large, which means that virtually all large language model (LLM) development involves tensor parallelism. Without tensor parallelism, it would be practically impossible to train models LLMs because the layers are too big for any one device.
In tensor parallelism, the parameters of a single layer are housed on multiple GPUs (graphics processing units) or TPUs (tensor processing units). Each device computes part of a single layer’s operations, and the partial results are aggregated to produce the layer’s output.
Compared to data parallelism and many other types of model parallelism, tensor parallelism workflows require much more communication between nodes. High-bandwidth networks can help reduce communication bottlenecks.
Data and model parallelism techniques are not often used in isolation. They are often combined in various hybrid parallelism configurations. The open source deep learning frameworks PyTorch and TensorFlow, both of which support Python, are commonly used to construct distributed machine learning systems.
Most large-scale language models, including the GPT family, rely on hybrid parallelism to efficiently train at scale.
Distributed ML can augment each stage of the ML pipeline: the process of building, training and deploying machine learning models. Preprocessing, training, fine-tuning, validation, inference and deployment
Distributed data preprocessing
Distributed training
Distributed fine-tuning
Distributed validation
Distributed inference
Distributed deployment
Distributed data preprocessing uses linked networks of nodes—the multiple processors or devices, not the individual neurons of the neural network, which are also sometimes referred to as “nodes”—to prepare large datasets for analysis and further use.
A central controlling node manages the workflow, splitting the data and assigning it to worker nodes. Decentralizing the work through parallel processing increases scalability and efficiency when compared to traditional models that use a single device.
Distributed training leverages distributed ML techniques to spread model training across devices. For example, this technique is often used with large neural networks. When either the network, the training dataset or both is too large for one processor, distributed training spreads the workload across multiple servers, GPUs or machines.
Stochastic gradient descent (SGD) is a learning algorithm that splits the dataset into mini-batches and computes the gradient of the loss function after each batch. Using mini-batches instead of the full dataset makes training more efficient.
The loss function measures the error in the model’s predictions, and SGD’s goal is to descend the gradient to minimize the value of the function. As with standard model training, the training process is deemed complete when the model reaches convergence: when the SGD algorithm successfully minimizes the function’s value.
Nodes process mini-batches in parallel, which is possible because each batch is processed independently of the others within each iteration. Each node computes its gradient, then pushes the updated gradient value to the other nodes in the network. The other worker nodes implement the updates that they receive into their own models, helping ensure that all copies of the model remain identical throughout the training process.
The AllReduce function is a collective communication operation which enables each node to share its results and propagate the aggregated results through the network. AllReduce allows all nodes to synchronize model parameter updates and maintain consistency. AllReduce, long used in high-performance computing, was popularized in ML frameworks such as Horovod.
SGD can be run synchronously or asynchronously. Synchronous SGD updates all nodes at the same time, which maintains consistency at the cost of potential delays if some nodes lag behind. Asynchronous SGD updates parameters as soon as an update is ready, but some nodes might receive updates that don’t include the most current values.
By reducing the computational resources needed per device, distributed training can speed up training times. Because it is so compute-intensive, training is one of the primary use cases for distributed ML.
The same principles and benefits of distributed training apply to distributed fine-tuning. Fine-tuning further trains a pre-trained model to specialize in more specific tasks. Applying distributed ML techniques makes the process faster, more efficient and more scalable.
Validation is the process of evaluating the performance of a trained model. Distributing the validation dataset or a large model across multiple nodes provides the same benefits as the rest of the distributed training process.
Inference is the process by which a trained AI model processes new data to recognize patterns and generate outputs or predictions. Distributing the workload across multiple devices makes it possible to operate AI models that are too large for a single machine. Distributed inference can also facilitate greater throughput and lower latency.
Distributed deployment manages the operation of a software application across a network of nodes. Load balancing across worker nodes helps mitigate bottlenecks and optimize resource efficiency, increasing throughput and lowering latency.
The benefits of distributed machine learning include:
Efficiency
Scalability
Redundancy
Automating and dividing the workload among multiple devices reduces the burden on any one device. Nodes can work in parallel to complete long-term tasks faster, then aggregate their outputs into the final result.
Training an image recognition model on ImageNet (a dataset with over 14 million labeled images), would take weeks on a single GPU. With distributed ML, even a small startup could perform this task in hours.
Enterprises do not need to invest in ultra-powerful computers to run LLMs and other resource intensive systems. Distributed computing enables cloud services providers to coordinate vast infrastructure across many servers and data centers, making it available to enterprise customers on demand.
Intensive data science tasks such as data processing with big data can be completed without significant infrastructure investment. Large-scale data processing enables even smaller startups to offer their users services such as recommendation systems or chatbots.
Meanwhile, hyperscale data centers lie at the other end of the scalability spectrum. Where resources allow, organizations are building massive server clusters to run the most advanced deep neural networks. It wouldn’t be possible to operate models this large across thousands of GPUs without distributed ML.
Many real-world systems rely on fault tolerance—the ability to continue operating even if individual nodes fail. Model providers need to ensure that individual users as well as applications connected with APIs can enjoy uninterrupted access. In distributed ML, redundancy preserves uptime by replicating processes across nodes so that failures don’t interrupt service.
The challenges of effective distributed ML implementation include:
Network bottlenecks
Synchronization overhead
Energy consumption
Insufficient bandwidth between nodes is a frequent cause of bottlenecks in distributed ML networks. Methods such as tensor parallelism that require more communication are the most bandwidth-hungry. If the network is unable to provide the required bandwidth, distributed ML projects will face increased training times and reduced scalability.
Synchronization overhead is a delay that occurs when one task cannot begin until another task is complete. In a synchronous system, all worker nodes must upload shared data before moving to the next stage of training. This moment is known as a synchronization barrier because the next phase does not begin until all nodes are synchronized.
Straggler nodes—workers that take longer than others to complete their task—thereby slow down the entire process. Asynchronous systems remove the barrier at the risk of having some nodes operating with obsolete parameter configurations.
Distributed networks can be high energy consumers, not only due to the nodes but also the communication between them. Energy consumption can vary depending on the implementation and architecture of the distributed ML system.
Computational requirements: The high-performance GPUs needed for many challenging ML tasks are energy-intensive.
Communication: Nodes need high-speed networks to communicate effectively and minimize synchronization overhead.
Cloud or edge computing: The centralized hyperscale data centers powering the leading AI models consume massive amounts of energy. Edge computing can help lower network costs.
Algorithm and data processing choices: Choosing the right algorithm and following good data processing practices such as feature engineering can make models more efficient.
Distributed ML frameworks implement machine learning while optimizing memory and compute resource use. They can also help scale ML implementations, shorten training times and control costs.
Notable distributed ML frameworks include:
PyTorch Distributed
Apache Spark
TensorFlow Distributed
Ray Train
InstructLab
Available in the popular PyTorch ML framework, PyTorch Distributed is a set of tools for building and scaling deep learning models across multiple devices. The torch.distributed package covers intra-node communication, such as with AllReduce. Built-in support for data parallelism and model parallelism allows for a range of distributed training approaches.
Apache Spark is a longstanding ML framework with support for distributed training. Users can build end-to-end ML pipelines that integrate with the greater Spark ecosystem, including Spark SQL for database manipulation. Spark offers two libraries: the original MLlib and the newer SparkML.
The tf.distribute.Strategy API within TensorFlow opens the door to distributed ML and contains support for a range of schemes: multiple GPUs on one machine, multiple GPUs on multiple machines and more. For example, ParameterServerStrategy stores parameters on dedicated servers, from which worker nodes access them.
Ray Train is the scalable distributed training and fine-tuning library within the Ray ML framework for distributed computing. Ray Train is compatible with both PyTorch and TensorFlow. The Ray Tune library supports distributed hyperparameter tuning across multiple devices.
InstructLab’s novel approach to distributed ML eschews the traditional GPU cluster in favor of a community-based approach—almost like crowd-sourced fundraising. Community members contribute parameter updates to a centralized repository and can fine-tune models on their personal devices.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.