Featured Carousel

Scaling TensorFlow and Caffe to 256 GPUs

Share this post:

Deep learning has taken the world by storm in the last four years, powering hundreds of consumer web and mobile applications that we use every day. But the extremely long training times in most frameworks present a hurdle that’s curtailing the broader proliferation of deep learning. It currently may take days or even weeks to train large AI models with big data sets to get the right accuracy levels.

At the crux of this problem is a technical limitation. The popular open-source deep-learning frameworks do not seem to run as efficiently across multiple servers. So, while most data scientists are using servers with four or eight GPUs, they can’t scale beyond that single node. For example, when we tried to train a model with the ImageNet-22K data set using a ResNet-101 model, it took us 16 days on a single Power Systems server (S822LC for High Performance Computing) with four NVIDIA P100 GPU accelerators.

16 days – that’s a lot of time you could be spending elsewhere.

And since model training is an iterative task, where a data scientist tweaks hyper-parameters, models, and even the input data, and trains the AI models multiple times, these kinds of long training runs delay time to insight and can limit productivity.

IBM Research invents the jet engine of deep learning

The IBM Research team took on this challenge, and through innovative clustering methods has built a “Distributed Deep Learning” (DDL) library that hooks into popular open source machine learning frameworks like TensorFlow, Caffe, Torch and Chainer.   DDL enables these frameworks to scale to tens of IBM servers leveraging hundreds of GPUs. Effectively, IBM Research has invented the jet engine of deep learning.

With the DDL library, it took us just 7 hours to train ImageNet-22K using ResNet-101 on 64 IBM Power Systems servers that have a total of 256 NVIDIA P100 GPU accelerators in them. 16 days down to 7 hours changes the workflow of data scientists. That’s a 58x speedup!

The distributed deep learning (DDL) library is available as a technology preview in our latest version 4 release of the PowerAI deep learning software distribution. DDL presents an application programming interface (API) that each of the deep learning frameworks can hook into, to scale across multiple servers. PowerAI makes this cluster scaling feature available to organizations using deep learning for training their AI models.

And it scales efficiently – Running across multiple nodes is only half the battle, doing it efficiently is the most important half. Fortunately, building upon a rich experience in HPC and analytics, IBM Research was able to scale deep learning frameworks across up to 256 GPUs with up to 95 percent efficiency!

 ibm distributed deep learning scaling efficiency

Figure 1: Scaling results using Caffe to train a ResNet-50 model using the ImageNet-1K data set on 64 Power Systems servers that have a total of 256 NVIDIA P100 GPU accelerators in them.

PowerAI is evolving rapidly – It’s hard to believe that it’s been less than a year since we launched PowerAI, especially with the pace of innovations and enhancements we’ve added to our Deep Learning Suite since it was initially released. We announced our first results just 10 months ago. Today, I’m proud to announce our 4th revision to PowerAI. This release includes the distributed deep learning library and a technology preview for the vision capability that we announced in May. The vision capability in PowerAI generates trained deep learning models when given labelled video or image input datasets.

PowerAI is available to try on the Nimbix Power cloud available at https://power.jarvice.com/ or it can be downloaded from https://www.ibm.biz/powerai to try on an IBM Power Systems server.

VP, HPC, AI & Machine Learning, IBM Cognitive Systems

Add Comment

Leave a Reply

Your email address will not be published.Required fields are marked *

Andree Jacobson

Sumit, this is impressive! Congrats. Were the tests run on Nimbix also or some internal IBM cluster? I’m curious, how much if any, did the interconnect play a part in this scaling? Is there any further information, like a white paper available on these results?

Sumit Gupta

Hi Andree

This was run on an internal IBM Cluster that has 64 Minsky Power S822LC systems, connected to each other using Mellanox Infiniband. Interconnect definitely plays an important part in achieving the efficiency that we did in the scaling. The software scales even without a very good network, but not to the same level of efficiency.

The technical paper is at


Sumit – this is a significant achievement indeed. Do you think we can extrapolate the same scaling efficiency, speedup and accuracy to other ML/DL workloads ?

Sumit Gupta

Hi Sunil

We build the Distributed Deep Learning (DDL) as a library and have incorporated it into TensorFlow, Caffe, and Torch. But it can be put into any framework.

Ziju Feng

Great work! Any plan when this feature will be publicly available as part of the cloud service and what the pricing structure will be for like up to 256 GPUs?

Steve Hebert


This is available now publicly in the Nimbix Cloud. Please contact us if we can assist.


More Power Systems stories

2018 is the year of security: Are you secure enough?

Multiple security advisories in the recent past have shown how tirelessly security researchers work to identify and address vulnerabilities. With the extensive reach of the Internet, the digital space has become a hacker’s world. Keeping your IT environment protected is an important part of any enterprise IT strategy. Having a vulnerability doesn’t necessarily mean something […]

Continue reading

Nine best practices for a successful migration

Anyone who’s done migrations probably has a list of dos and don’ts to ensure their migration projects are successful. Our team in IBM Systems Lab Services Migration Factory is no different, and in this final post in the top application and system migration worries blog series, I’d like to share a few best practices we’ve […]

Continue reading

Top data backup and recovery challenges for hybrid cloud

Backups are one of the oldest IT tasks, and most business leaders assume that their backup and recovery strategy is working properly. But when events occur that put an organization’s backup strategy to the test, its success or failure to deliver can have a lasting impact on business revenue, brand value, resiliency and of course […]

Continue reading