Summary: IBM Research publishes in arXiv close to ideal scaling with new distributed deep learning software which achieved record communication overhead and 95% scaling efficiency on the Caffe deep learning framework over 256 NVIDIA GPUs in 64 IBM Power systems. Previous best scaling was demonstrated by Facebook AI Research of 89% for a training run on Caffe2, at higher communication overhead. IBM Research also beat Facebook’s time by training the model in 50 minutes, versus the 1 hour Facebook took. Using this software, IBM Research achieved a new image recognition accuracy of 33.8% for a neural network trained on a very large data set (7.5M images). The previous record published by Microsoft demonstrated 29.8% accuracy.
A technical preview of this IBM Research Distributed Deep Learning code is available today in IBM PowerAI 4.0 distribution for TensorFlow and Caffe.
Deep learning is a widely used AI method to help computers understand and extract meaning from images and sounds through which humans experience much of the world. It holds promise to fuel breakthroughs in everything from consumer mobile app experiences to medical imaging diagnostics. But progress in accuracy and the practicality of deploying deep learning at scale is gated by technical challenges, such as the need to run massive and complex deep learning based AI models – a process for which training times are measured in days and weeks.
For our part, my team in IBM Research has been focused on reducing these training times for large models with large data sets. Our objective is to reduce the wait-time associated with deep learning training from days or hours to minutes or seconds, and enable improved accuracy of these AI models. To achieve this, we are tackling grand-challenge scale issues in distributing deep learning across large numbers of servers and NVIDIA GPUs.
Most popular deep learning frameworks scale to multiple GPUs in a server, but not to multiple servers with GPUs. Specifically, our team (Minsik Cho, Uli Finkler, David Kung, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar) wrote software and algorithms that automate and optimize the parallelization of this very large and complex computing task across hundreds of GPU accelerators attached to dozens of servers.
Our software does deep learning training fully synchronously with very low communication overhead. As a result, when we scaled to a large cluster with 100s of NVIDAI GPUs, it yielded record image recognition accuracy of 33.8% on 7.5M images from the ImageNet-22k dataset vs the previous best published result of 29.8% by Microsoft. A 4% increase in accuracy is a big leap forward; typical improvements in the past have been less than 1%. Our innovative distributed deep learning (DDL) approach enabled us to not just improve accuracy, but also to train a ResNet-101 neural network model in just 7 hours, by leveraging the power of 10s of servers, equipped with 100s of NVIDIA GPUs; Microsoft took 10 days to train the same model. This achievement required we create the DDL code and algorithms to overcome issues inherent to scaling these otherwise powerful deep learning frameworks.
These results are on a benchmark designed to test deep learning algorithms and systems to the extreme, so while 33.8% might not sound like a lot, it’s a result that is noticeably higher than prior publications. Given any random image, this trained AI model will gives its top choice object (Top-1 accuracy), amongst 22,000 options, with an accuracy of 33.8%. Our technology will enable other AI models trained for specific tasks, such as detecting cancer cells in medical images, to be much more accurate and trained in hours, re-trained in seconds.
As Facebook AI Research described the problem in a June 2017 research paper explaining their own excellent result using a smaller data set (ImageNet 1k) and a smaller neural network (ResNet 50):
“Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress.”
Ironically, this problem of orchestrating and optimizing a deep learning problem across many servers is made much more difficult as GPUs get faster. This has created a functional gap in deep learning systems that drove us to create a new class of DDL software to make it possible to run popular open source codes like Tensorflow, Caffe, Torch and Chainer over massive scale neural networks and data sets with very high performance and very high accuracy.
Here a variant of the “Blind Men and the Elephant” parable is helpful in describing the problem that we are solving and context for the promising early results we have achieved. Per Wikipedia:
“…Each blind man feels a different part of the elephant body, but only one part, such as the side or the tusk. They then describe the elephant based on their partial experience and their descriptions are in complete disagreement on what an elephant is.”
Now, despite initial disagreement, if these people are given enough time, they can share enough information to piece together a pretty accurate collective picture of an elephant.
Similarly, if you have a bunch of GPUs slogging through the task of processing elements of a deep learning training problem – in parallel over days or weeks, as is typically the case today – you can synch these learning results fairly easily.
But as GPUs get much faster, they learn much faster, and they have to share their learning with all of the other GPUs at a rate that isn’t possible with conventional software. This puts stress on the system network and is a tough technical problem. Basically, smarter and faster learners (the GPUs) need a better means of communicating, or they get out of sync and spend the majority of time waiting for each other’s results. So, you get no speedup–and potentially even degraded performance–from using more, faster-learning GPUs.
Our ability to address this functional gap with (DDL) software is most visible when you look at scaling efficiency or how close to perfect system performance scales as you add GPUs. This measurement provides a view into how well the 256 GPUs were “talking” about what each other were learning.
The best scaling for 256 GPUs shown before is by a team from Facebook AI Research (FAIR). FAIR used a smaller deep learning model, ResNet-50, on a smaller dataset ImageNet-1K, which has about 1.3 million images, both of which reduce computational complexity, and used a larger batch size of 8192, and achieved 89% scaling efficiency on a 256 NVIDIA P100 GPU accelerated cluster using the Caffe2 deep learning software. For a ResNet-50 model and same dataset as Facebook, the IBM Research DDL software achieved an efficiency of 95% using Caffe as shown in the chart below. This was run on a cluster of 64 “Minsky” Power S822LC systems, with four NVIDIA P100 GPUs each.
Scaling Performance of IBM DDL across 256 GPUs (log scale)
For training the much larger ResNet-101 model on 7.5M images from the ImageNet-22K data set, with an image batch size of 5120, we achieved a scaling efficiency of 88%.
We also achieved a record in fastest absolute training time of 50 minutes compared to Facebook’s previous record of 1 hour. We trained the ResNet-50 model with ImageNet-1K model by scaling Torch using DDL to 256 GPUs. Facebook trained a similar model using Caffe2.
For developers and data scientists, the IBM Research (DDL) software presents an API (application programming interface) that each of the deep learning frameworks can hook into, to scale to multiple servers. A technical preview is available now in version 4 of the PowerAI enterprise deep learning software offering, making this cluster scaling feature available to any organization using deep learning for training their AI models. We expect that by making this DDL feature available to the AI community, we will see many more higher accuracy runs as others leverage the power of clusters for AI model training.