Deep Learning Advances from IBM Research

Share this post:

Today, with contributions made by IBM scientists, IBM introduces Deep Learning as a Service within Watson Studio, a rich set of cloud-based tools for developers and data scientists to help remove the barriers of training deep learning models in the enterprise.

Deep learning and machine learning require expensive hardware and software resources as well as more expensive skilled scientists and developers.  Deep learning, in particular, requires users to be experts at different levels of the stack, from neural network design to new hardware.  Allowing them to be more effective requires cross-stack innovation and software/hardware co-design. The challenges faced in creating AI models and applications have recently been gaining more attention, with Berkeley’s Joe Hellerstein highlighting the AI Engineering Gap,  a new academic conference (SysML) focused solely on the intersection of systems and machine learning, and the Stanford DAWN initiative calling out the “lack of systems and tools for end to end machine learning development.”

Deep Learning as a Service within Watson Studio make strides in addressing such challenges and increasing the productivity of data scientists and software engineers, as well as the quality and maintainability of their AI creations.  In this blog, we take a deeper dive into some of the innovations developed by IBM Research scientists in partnership with our product teams.

IBM's deep learning as a service architecture

The Deep Learning as a Service architecture spans several layers, including hardware accelerators, open source DL frameworks, Kubernetes for container orchestration, and services to manage and monitor the training runs.

Training in the cloud

Many teams working with deep learning models involve people who don’t spend their time on data science tasks, but on configuring esoteric hardware, installing drivers, managing distributed processes, dealing with failures, or figuring how how to fund the money to buy specialized hardware, such as GPUs. We want to let users continue to design their models in the framework of their choice, but train them in an optimized hardware and software stack offered as a managed cloud service.

The challenges here stem from the fact that off-the-shelf deep learning frameworks aren’t designed to run in a multi-tenant cloud environment, and cloud software typically handle stateless Web apps, which have a distinctly different profile than those of compute and data intensive deep learning training jobs. In creating this capability, we have tackled cross-stack performance optimizations, support for heterogeneous hardware, security, scale-out and usability challenges so data scientists can focus on their models and the data, getting them closer to a ‘serverless’ deep learning experience. We have recently released some of this core capability [1, 2] out as the Fabric for Deep Learning (FfDL), an open source cloud-native micro-services based fabric on top of Kubernetes and invite the community to participate, experiment, and contribute to the innovation possibilities in this exciting space.

Automating the parameters of neural networks

Determining the parameters of a neural-network effectively is a challenging problem due to the extremely large configuration space (for instance: how many nodes per layer, activation functions, learning rates, drop-out rates, filter sizes, etc.) and the computational cost of evaluating a proposed configuration (e.g., evaluating a single configuration can take hours to days). To address this challenging problem we use a model-based global optimization algorithm called RBFOpt [3] that does not require derivatives. Similarly to Bayesian optimization – which fits a Gaussian model to the unknown objective function – our approach fits a radial basis function model.  While collaborating closely with the product team to integrate our technology, we continue to explore novel ways to build neural-networks based on, for example, incremental learning [4] and bandit-based search for neural-networks [5].

A dashboard for checking experiments

Machine learning models are increasingly at the core of applications and systems. The process around developing these models is highly iterative and experiment-driven. The often non-linear and non-deterministic nature of implementing ML models results in a large number of diverse models. We found that data scientists tend to manage models using ad hoc methods such as notebooks, spreadsheets, file system folders, or PowerPoint slides. However, these ad hoc methods record the models themselves, but not the higher-level experiment. As a result, a data scientist’s dashboard [6] emerged that we refined continuously with our internal end users to arrive at the current design with the product teams. This dashboard allows data scientists to compare versions of models across experimental runs and to visually see how each individual parameter affects the resulting accuracy of the model. The system can also display samples of the input data,  plot accuracy/loss curves for the individual runs, and do provenance tracking of the model’s data and code artifacts.

Visual programming for deep learning

The main research challenge we addressed in this work [7, 8] is to promote and instill a visual programming paradigm for deep learning. The rate at which the domain of deep learning is growing is faster than the rate at which developers or software engineers can be trained to build applications using deep learning. We are abstracting the programming paradigm for building deep learning models, such that, the learning curve for adopting deep learning for development is drastically reduced. In this work, we:

  1. Developed a visual programming paradigm using intuitive drag and drop interface
  2. Provided a platform agnostic representation for capturing deep learning model design / architecture

Effectively , the developer should be able to design a deep learning architecture in an abstract way, agnostic of the library in which the code is generated or model is trained. In the labs, we are experimenting with expediting the creation of neural nets through Auto-generation of Code from Deep Learning Research Papers.

With these innovations now in the hands of customers, we will continue to collaborate with the IBM Watson team, pushing the barriers on what is possible with deep learning, and helping advance what people can do with AI technologies.

To learn more about Deep Learning as a Service within Watson Studio visit

Authors: Rania Khalaf, David Kung, Senthil K Mani, Todd Mummert, Vinod Muthusamy, and Horst Samulowitz, IBM Research


[1] B. Bhattacharjee et al., “IBM Deep Learning Service,” in IBM Journal of Research and Development, vol. 61, no. 4, pp. 10:1-10:11, July-Sept. 1 2017.

[2] Scott Boag, et al. Scalable Multi-Framework Multi-Tenant Lifecycle Management of Deep Learning Training Jobs, In Workshop on ML Systems at NIPS’17, 2017.

[3] RbfOpt: A blackbox optimization library in Python,

[4] Roxana Istrate, A. Cristiano I. Malossi, Costas Bekas, Dimitrios S. Nikolopoulos: Incremental Training of Deep Convolutional Neural Networks. AutoML@PKDD/ECML 2017: 41-48

[5] Martin Wistuba, Finding Competitive Network Architectures Within a Day Using UCT.

[6]  Runway: machine learning model experiment management tool

[7] Anush Sankaran, Rahul Aralikatte, Senthil Mani, Shreya Khare, Naveen Panwar, Neelamadhav Gantayat: DARVIZ: Deep Abstract Representation, Visualization, and Verification of Deep Learning Models. ICSE-NIER 2017

[8] Akshay Sethi, Anush Sankaran, Naveen Panwar, Shreya Khare, Senthil Mani: DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers, AAAI 2018

More AI stories

Image Captioning as an Assistive Technology

IBM Research's Science for Social Good team recently participated in the 2020 VizWiz Grand Challenge to design and improve systems that make the world more accessible for the blind.

Continue reading

Reducing Speech-to-Text Model Training Time on Switchboard-2000 from a Week to Under Two Hours

Published in our recent ICASSP 2020 paper in which we successfully shorten the training time on the 2000-hour Switchboard dataset, which is one of the largest public ASR benchmarks, from over a week to less than two hours on a 128-GPU IBM high-performance computing cluster. To the best of our knowledge, this is the fastest training time recorded on this dataset.

Continue reading

IBM & MIT Roundtable: Solving AI’s Big Challenges Requires a Hybrid Approach

At IBM Research’s recent “The Path to More Flexible AI” virtual roundtable, a panel of MIT and IBM experts discussed some of the biggest obstacles they face in developing artificial intelligence that can perform optimally in real-world situations.

Continue reading