Expert Q&A: Accelerate deep learning on IBM Cloud Pak for Data

Reduce costs and speed innovation by integrating deep learning on a unified data and AI platform

By | 4 minute read | October 28, 2020

From speech recognition and sentiment analysis to risk management, medical diagnostics and public safety, deep learning is playing a crucial role across industries. Deep learning empowers companies to accelerate pattern recognition and streamline time-consuming operations, helping to reduce costs and innovate toward digital transformation.

Implementing deep learning on a data and AI platform helps organizations automate AI lifecycles and accelerate model training and inference on a multicloud architecture. With IBM® Watson® Machine Learning Accelerator, a key capability within IBM Watson Studio on IBM Cloud Pak® for Data, organizations can speed time to results and minimize the total cost of ownership using a shared GPU service for deep learning.

To discuss this solution available in IBM Cloud Pak for Data, I spoke with two offering managers from IBM Data and AI, Steve Roberts and Kelvin Lui.

Why is deep learning becoming more important than ever to business?

Classical machine learning works well when there is a well-understood set of features, variables and rules used to train an AI model. As the number of features increases, especially when business prediction rules are not well-defined or may not even exist, deep learning can be a game changer. For example, building a competence for a human to identify defects can take years — a person must recognize the sight, sound or other attributes with an almost infinite set of variations on what constitutes a defect.  In contrast, you can give a set of known defect examples to a deep learning model and identify new defects, including those never before recognized through iterations. That’s because deep learning uses neural networks which mimic the behavior of the human brain to find patterns and make predictions. Deep learning often involves extremely large, complex and high-value data sets and types such as speech, audio, images and video as well as unstructured text such as emails or handwritten notes.

How does IBM Cloud Pak for Data address deep learning?

IBM Cloud Pak for Data provides a unified platform that integrates data and AI services to help you build, run and manage AI. Watson Machine Learning Accelerator is a capability designed to accelerate deep learning with end-to-end transparency and visibility. Running deep learning workloads in a platform simplifies the distribution of training and inference workloads. GPUs can be distributed based on fair share allocation or priority scheduling without interrupting jobs. Data scientists can share GPUs and avoid GPU scheduling delays. Resource sharing and usage monitoring happen behind the scenes to optimize deployment.

Watson Machine Learning Accelerator is like the wizard behind the curtain from The Wizard of Oz!

What is unique about Watson Machine Learning Accelerator?

Hardened over decades, Watson Machine Learning Accelerator brings an execution engine from high performance computing (HPC) to AI training and inference. IBM acquired Platform Computing in 2011 for its HPC technology, and its IBM Spectrum® Computing technology is now part of IBM Cloud Pak for Data. Our job execution technology runs on production grids with more than 100,000 compute cores, thousands of jobs and hundreds of users. For example, one of our clients ran more than 200,000 hyperparameter optimization jobs from hundreds of users over a period of six months. Another proof point of unique innovation is our elastic distributed inferencing engine with autoscaling, load balancing and high availability, which has demonstrated 45% more throughput than open source inference engines.

How are IBM clients using deep learning?

Deep learning can help clients improve safety, reduce risk, meet compliance and improve process efficiency. A national railway visually inspects trains for defects as they pass through inspection stations, improving passenger and employee safety and reducing costs through proactive maintenance. In healthcare, research hospitals rely on deep learning to assist radiologists with diagnosis to help improve patient outcomes. A retailer uses deep learning to help forecast inventory demands for better efficiency and customer satisfaction. In investment banking, businesses can run hundreds of thousands of models to help meet their regulatory requirements, including capital compliance.

Why is it important for data scientists to share GPU resources?

GPUs are still expensive and it’s not economically practical to provide each data scientist an individual server or a small cluster with tens or hundreds of GPUs. Dynamic sharing of GPUs helps bridge the “productivity vs. cost” divide. Idle GPUs can be reassigned to finish jobs faster. If all GPUs are busy and another data scientist launches a new training run, GPUs can be reallocated to run the new job without disruption. When the jobs don’t need a dedicated GPU, data scientists can share the same GPU through a feature called GPU packing. GPU sharing increases agility and minimizes management overhead.

What is new with Watson Machine Learning Accelerator?

Watson Machine Learning Accelerator will be available on IBM Cloud Pak for Data later in the year, running on Red Hat® OpenShift® Container Platform. This offering can help businesses simplify the deep learning process from data preparation to model training, deployment, inference and governance as part of the AI lifecycle. New capabilities include GPU-optimized elastic distributed training and inference, GPU packing, multi-tenant resource policies, hyperparameter optimization for deep learning and improved monitoring of resource allocations and training results.

As open source frameworks and tooling constantly evolve, we maintain our focus on supporting the current, widely adopted technologies and apply our unique innovation to support enterprise use cases. We are also working with the Kubernetes community to enable parallel job execution and resource quota management for deep learning use cases. Further, we are exploring how to extend our elastic distributed inferencing engine to support inference at the network edge such as a handheld device or instrumented equipment.

Next steps: