What is distributed machine learning?

Authors

Ivan Belcic

Staff writer

Cole Stryker

Staff Editor, AI Models

IBM Think

What is distributed machine learning?

Distributed machine learning (ML) is an approach to large-scale ML tasks where workloads are spread across multiple devices or processors instead of running on a single computer. Distributed ML is most often used for training large and complex models, such as deep neural networks, where computational demands are especially high. 

As artificial intelligence (AI) models scale in size and complexity, the computational resources needed for training can exceed the capacity of a single device. Likewise, complex large-scale tasks can be too demanding or require more memory than one device can provide. Distributed systems partition either the dataset or the AI model across multiple devices, or nodes, to improve performance, accuracy and efficiency. 

Distributed ML is a subset of distributed computing, which encompasses broader efforts to spread computing across multiple processing systems.

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

Types of distributed machine learning

Approaches to distributed machine learning include: 

  • Data parallelism: splitting the dataset across multiple devices 

  • Model parallelism: partitioning a model across multiple devices 

  • Hybrid parallelism: combining aspects of data and model parallelism

The parameter server framework for distributed ML uses one or more server nodes to maintain model parameter updates, while worker nodes handle the data and computation.

Data parallelism

Data parallelism partitions a large dataset being used as training data into a number of groups equal to the number of nodes in the system. Each node hosts a copy of the AI model and feeds it a subset of the distributed data. Data parallelism operations can be run either synchronously or asynchronously. Synchronous training is more common with deep learning models because it ensures consistency, but at the cost of speed. 

Because each node runs a full copy of the model, the hardware must have enough memory to host it. However, the AI workload per node is reduced since each processes only part of the dataset.

Federated learning

Federated learning resembles data parallelism in that multiple nodes train copies of a model on subsets of data. However, the data partitions are not created from a centralized data pool but consist of local data to each node. Model updates are aggregated between nodes, while the data never leaves its original source. 

Because each dataset is localized, data scientists often use federated learning for distributed deep learning applications involving high data security or legal regulations. Centralized data pools can accidentally expose personally identifiable information (PII) during transmission. Federated learning mitigates this risk by isolating each dataset to its own node.

AI Academy

Become an AI expert

Gain the knowledge to prioritize AI investments that drive business growth. Get started with our free AI Academy today and lead the future of AI in your organization.

Model parallelism

In model parallelism, a model is split into component portions, such as layers of a deep neural network, that can run independently and simultaneously across separate nodes. Each portion of the model uses the same data. Model parallelism is also known as network parallelism

When a single ML model is too large for a single machine to handle, model parallelism allows the model to run across multiple nodes. Each node handles a portion of the model, so its compute and memory requirements are lower than if one single machine had to run the full model. 

Model parallelism is typically more difficult to implement than data parallelism. The selected distributed machine learning algorithm directly affects the scalability of the method, with some lending themselves to the technique more than others. The system must be built in a way that minimizes the amount of data sharing between nodes. High-performance model parallelism systems require expert-level design and optimization. 

Model parallelism is often combined with data parallelism so that each segment of the model processes a different portion of the input data. The results are aggregated across the network.

Pipeline parallelism 

Pipeline parallelism is a type of model parallelism that partitions a model sequentially. Each stage of the model is hosted on its own node. Batches of data are processed in order through the stages—similar to how an old-fashioned bucket brigade would pass a bucket of water from one person to the next, from the water source to the fire. 

Pipeline parallelism can improve throughput—the volume of data a model can process at once. However, this boost comes with the cost of increased latency, which is the amount of time it takes to generate a result after receiving an input. 

The increase in latency is due to the ramp-up phase during which the initial micro-batches of data are passed through the sequence. Stages later in the machine learning pipeline cannot initialize until the first micro-batches of data pass through the previous stages.

Tensor parallelism

Tensor parallelism comes into play when even a single layer of a deep neural network requires too much computational power or takes up too much space for a single device. Self-attention and embedding layers—two cornerstones of the transformer architecture—can grow very large, which means that virtually all large language model (LLM) development involves tensor parallelism. Without tensor parallelism, it would be practically impossible to train models LLMs because the layers are too big for any one device. 

In tensor parallelism, the parameters of a single layer are housed on multiple GPUs (graphics processing units) or TPUs (tensor processing units). Each device computes part of a single layer’s operations, and the partial results are aggregated to produce the layer’s output. 

Compared to data parallelism and many other types of model parallelism, tensor parallelism workflows require much more communication between nodes. High-bandwidth networks can help reduce communication bottlenecks.

Hybrid parallelism

Data and model parallelism techniques are not often used in isolation. They are often combined in various hybrid parallelism configurations. The open source deep learning frameworks PyTorch and TensorFlow, both of which support Python, are commonly used to construct distributed machine learning systems. 

Most large-scale language models, including the GPT family, rely on hybrid parallelism to efficiently train at scale.

How distributed machine learning works

Distributed ML can augment each stage of the ML pipeline: the process of building, training and deploying machine learning models. Preprocessing, training, fine-tuning, validation, inference and deployment 

  • Distributed data preprocessing 

  • Distributed training

  • Distributed fine-tuning 

  • Distributed validation 

  • Distributed inference 

  • Distributed deployment 

Distributed data preprocessing

Distributed data preprocessing uses linked networks of nodes—the multiple processors or devices, not the individual neurons of the neural network, which are also sometimes referred to as “nodes”—to prepare large datasets for analysis and further use. 

A central controlling node manages the workflow, splitting the data and assigning it to worker nodes. Decentralizing the work through parallel processing increases scalability and efficiency when compared to traditional models that use a single device.

Distributed training

Distributed training leverages distributed ML techniques to spread model training across devices. For example, this technique is often used with large neural networks. When either the network, the training dataset or both is too large for one processor, distributed training spreads the workload across multiple servers, GPUs or machines. 

Stochastic gradient descent (SGD) is a learning algorithm that splits the dataset into mini-batches and computes the gradient of the loss function after each batch. Using mini-batches instead of the full dataset makes training more efficient. 

The loss function measures the error in the model’s predictions, and SGD’s goal is to descend the gradient to minimize the value of the function. As with standard model training, the training process is deemed complete when the model reaches convergence: when the SGD algorithm successfully minimizes the function’s value. 

Nodes process mini-batches in parallel, which is possible because each batch is processed independently of the others within each iteration. Each node computes its gradient, then pushes the updated gradient value to the other nodes in the network. The other worker nodes implement the updates that they receive into their own models, helping ensure that all copies of the model remain identical throughout the training process. 

The AllReduce function is a collective communication operation which enables each node to share its results and propagate the aggregated results through the network. AllReduce allows all nodes to synchronize model parameter updates and maintain consistency. AllReduce, long used in high-performance computing, was popularized in ML frameworks such as Horovod. 

SGD can be run synchronously or asynchronously. Synchronous SGD updates all nodes at the same time, which maintains consistency at the cost of potential delays if some nodes lag behind. Asynchronous SGD updates parameters as soon as an update is ready, but some nodes might receive updates that don’t include the most current values. 

By reducing the computational resources needed per device, distributed training can speed up training times. Because it is so compute-intensive, training is one of the primary use cases for distributed ML.

Distributed fine-tuning 

The same principles and benefits of distributed training apply to distributed fine-tuning. Fine-tuning further trains a pre-trained model to specialize in more specific tasks. Applying distributed ML techniques makes the process faster, more efficient and more scalable.

Distributed validation 

Validation is the process of evaluating the performance of a trained model. Distributing the validation dataset or a large model across multiple nodes provides the same benefits as the rest of the distributed training process. 

Distributed inference

Inference is the process by which a trained AI model processes new data to recognize patterns and generate outputs or predictions. Distributing the workload across multiple devices makes it possible to operate AI models that are too large for a single machine. Distributed inference can also facilitate greater throughput and lower latency.

Distributed deployment 

Distributed deployment manages the operation of a software application across a network of nodes. Load balancing across worker nodes helps mitigate bottlenecks and optimize resource efficiency, increasing throughput and lowering latency.

Benefits of distributed ML

The benefits of distributed machine learning include: 

  • Efficiency 

  • Scalability

  • Redundancy

Efficiency

Automating and dividing the workload among multiple devices reduces the burden on any one device. Nodes can work in parallel to complete long-term tasks faster, then aggregate their outputs into the final result. 

Training an image recognition model on ImageNet (a dataset with over 14 million labeled images), would take weeks on a single GPU. With distributed ML, even a small startup could perform this task in hours.

Scalability

Enterprises do not need to invest in ultra-powerful computers to run  LLMs and other resource intensive systems. Distributed computing enables cloud services providers to coordinate vast infrastructure across many servers and data centers, making it available to enterprise customers on demand.

Intensive data science tasks such as data processing with big data can be completed without significant infrastructure investment. Large-scale data processing enables even smaller startups to offer their users services such as recommendation systems or chatbots

Meanwhile, hyperscale data centers lie at the other end of the scalability spectrum. Where resources allow, organizations are building massive server clusters to run the most advanced deep neural networks. It wouldn’t be possible to operate models this large across thousands of GPUs without distributed ML.

Redundancy

Many real-world systems rely on fault tolerance—the ability to continue operating even if individual nodes fail. Model providers need to ensure that individual users as well as applications connected with APIs can enjoy uninterrupted access. In distributed ML, redundancy preserves uptime by replicating processes across nodes so that failures don’t interrupt service.

Challenges of distributed ML

The challenges of effective distributed ML implementation include: 

  • Network bottlenecks

  • Synchronization overhead 

  • Energy consumption

Network bottlenecks

Insufficient bandwidth between nodes is a frequent cause of bottlenecks in distributed ML networks. Methods such as tensor parallelism that require more communication are the most bandwidth-hungry. If the network is unable to provide the required bandwidth, distributed ML projects will face increased training times and reduced scalability.

Synchronization overhead

Synchronization overhead is a delay that occurs when one task cannot begin until another task is complete. In a synchronous system, all worker nodes must upload shared data before moving to the next stage of training. This moment is known as a synchronization barrier because the next phase does not begin until all nodes are synchronized. 

Straggler nodes—workers that take longer than others to complete their task—thereby slow down the entire process. Asynchronous systems remove the barrier at the risk of having some nodes operating with obsolete parameter configurations.

Energy consumption

Distributed networks can be high energy consumers, not only due to the nodes but also the communication between them. Energy consumption can vary depending on the implementation and architecture of the distributed ML system. 

  • Computational requirements: The high-performance GPUs needed for many challenging ML tasks are energy-intensive. 

  • Communication: Nodes need high-speed networks to communicate effectively and minimize synchronization overhead. 

  • Cloud or edge computing: The centralized hyperscale data centers powering the leading AI models consume massive amounts of energy. Edge computing can help lower network costs. 

  • Algorithm and data processing choices: Choosing the right algorithm and following good data processing practices such as feature engineering can make models more efficient.

Distributed machine learning frameworks

Distributed ML frameworks implement machine learning while optimizing memory and compute resource use. They can also help scale ML implementations, shorten training times and control costs.

Notable distributed ML frameworks include:

  • PyTorch Distributed
     

  • Apache Spark
     

  • TensorFlow Distributed
     

  • Ray Train
     

  • InstructLab

PyTorch Distributed

Available in the popular PyTorch ML framework, PyTorch Distributed is a set of tools for building and scaling deep learning models across multiple devices. The torch.distributed package covers intra-node communication, such as with AllReduce. Built-in support for data parallelism and model parallelism allows for a range of distributed training approaches.

Apache Spark

Apache Spark is a longstanding ML framework with support for distributed training. Users can build end-to-end ML pipelines that integrate with the greater Spark ecosystem, including Spark SQL for database manipulation. Spark offers two libraries: the original MLlib and the newer SparkML.

TensorFlow Distributed

The tf.distribute.Strategy API within TensorFlow opens the door to distributed ML and contains support for a range of schemes: multiple GPUs on one machine, multiple GPUs on multiple machines and more. For example, ParameterServerStrategy stores parameters on dedicated servers, from which worker nodes access them.

Ray Train

Ray Train is the scalable distributed training and fine-tuning library within the Ray ML framework for distributed computing. Ray Train is compatible with both PyTorch and TensorFlow. The Ray Tune library supports distributed hyperparameter tuning across multiple devices.

InstructLab

InstructLab’s novel approach to distributed ML eschews the traditional GPU cluster in favor of a community-based approach—almost like crowd-sourced fundraising. Community members contribute parameter updates to a centralized repository and can fine-tune models on their personal devices.

Abstract portrayal of AI agent, shown in isometric view, acting as bridge between two systems
Related solutions
IBM watsonx.ai™

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo