Learn from Watson: How Containers Scale AI Workloads

6 min read

IBM Cloud Kubernetes Service hosting Watson AI workloads

I wanted to share an AI story I shared with a group of analysts. AI workloads are growing exponentially, and containers are the place to host these workloads. So, let’s look at how the IBM Cloud Kubernetes Service has been hosting Watson‘s AI workload so that you can see how AI workloads are different and better in the cloud. 


The rise of AI has coincided with the rise of the cloud. Limitless capacity, elasticity, and specialized hardware delivered from the cloud have enabled massive improvements in machine learning technology. Companies are now seeing the benefit of infusing AI into a variety of cloud-native apps, and, thus, they are growing AI in existing workloads.

In this article, you’ll hear how containers provide a natural choice for AI workloads, particularly machine learning. The secret is in how you design containers and Kubernetes clusters for these unique workloads so that you can focus on building your app. You’ll see how Watson uses IBM Cloud Kubernetes Service to do the following:

  • Provide AI-ready capabilities for building cloud-native apps

  • Add Cloud services to existing apps

  • Relieve pains around security, scale, and infrastructure management

By 2020, 85% of CIOs will pilot AI programs

The cloud has enabled massive improvements in machine learning technology. Gartner predicts that by 2020, 85% of CIOs will pilot AI programs (1).

Yet, AI brings a new set of challenges to traditional hardware procurement. Machine learning requires specialized hardware, and that hardware quickly becomes obsolete. Forrester states: “AI chips you buy or use in the cloud today will be obsolete in about one year because AI chip innovation is so rapid.” (2)

What’s different about AI workloads?

Let’s explore how AI workloads are familiar while also having unique challenges. To start, AI projects, such as deep learning and machine learning, require the strength of high-performance computing (HPC) and graphics processing units (GPUs). Normally, that’s Nvidia GPUs. And other chips are emerging in the market too: Intel, startups, and Google’s cloud offerings.

For your AI workloads on IBM Cloud, you can choose which IBM Cloud Kubernetes Service option fits your processing needs:

  • GPU bare metal, mg1c.16×128: 1 Tesla K80 physical card that has 2 graphics processing units (GPUs) per card for a total of 2 GPUs

  • GPU bare metal, mg1c.28×256: 2 Tesla K80 physical cards that have 2 GPUs per card for a total of 4 GPUs

AI apps, like many modern apps, grow over time. The ability to scale your AI workload in tandem with your AI apps is essential. The public cloud gives you scaling—at your own pace.

Whether it’s healthcare data or financial information, training data can include sensitive information. But that doesn’t mean you can’t safely host deep-learning training workloads on the cloud, including the data itself. Cloud-based machine learning tools now offer built-in security. To wit, your options for data storage on the cloud are easier to use and more secure than ever.

Enterprise workload example: Watson scale and complexity

So let’s get more specific and use Watson as an example of AI workloads. Watson’s cloud offerings have components that are common to most AI workloads:

  • An umbrella framework—Watson Studio

  • APIs for learning—data access, search, ingest, curate, enrich, training, and monitor/manage

  • Catalog of services—conversation, natural language processing (NLP), deep learning as a service (DLaaS), conversation, sentiment, tone, and so on

  • AI models—both the data and training

  • SaaS offerings—Watson Customer Care and Watson Cybersecurity are examples of how AI workloads become SaaS apps

Watson’s AI workload is a great story of cloud-native. These workloads are best served by the marriage of the following:

  • Microservices and service mesh: Example services, recommendation engine, conversational store, UIs, and dialog

  • Containers and Kubernetes orchestration: Including tolerance for changes and outages

  • DevOps: For the continuous delivery and integration pipeline

Hardened Kubernetes: Up-time and elasticity

The key indicator of Watson’s success is its ability to run at scale, showing how Kubernetes has both high reliability and availability. In fact, IBM Cloud Kubernetes Service today successfully runs over 27,000 clusters across a variety of workloads. And Watson was one of our first big customers.

Watson’s biggest win—and a win for many customers—is not managing Kubernetes themselves. Their cloud hosting means that the Watson Project Team gets to focus on its AI mission since IBM manages the Kubernetes and infrastructure upgrades, security, and more.

  • 12 Watson services/apps represented as 800+ Kubernetes services

  • One deployment example: 3000+ pods on 500+ nodes

  • “We no longer worry about managing the infrastructure because IBM Cloud Kubernetes Service takes care of that for us.” – Watson Project Team

Designing containers and clusters for AI: IBM Cloud Kubernetes Service + open-source DL framework

Now that we’ve explored how Watson runs its AI workloads, it’s time to turn to how you design containers and clusters for AI. IBM has ready-made assets to make it easy: IBM Cloud Kubernetes Service plus an open-source DL framework.

The beginning: An IBM Design Thinking partnership used an iterative process to refine both the AI workload and the underlying Kubernetes Service that’s hardened for production today. The architecture for the workload reflects the IBM One Cloud architecture for AI—infrastructure on bottom, Kubernetes, IBM Cloud Services, and the Watson AI layers across the top:

IBM Cloud Kubernetes Service

AI developers naturally are attracted to IBM Cloud Kubernetes Service for many reasons:

  • They can collaborate easily in Dev/Test clusters with DL frameworks deployed in them.

  • The size of clusters’ RAM/CPU/memory is fit to the DL framework. For example, they can pick storage and processor flavors and add as needed.

  • They focus on coding DL apps instead of managing Kubernetes.

A few tips for developers:

  • Use Nvidia container images whenever possible (available on IBM Cloud).

  • Use shared file systems for long-running training jobs, or, if there are many training sessions, use the same training data set.

  • Mount buckets to kick off a training job immediately without the need to download the training data set into shared file system, which could take many hours.


AI’s moving fast, so you need to deploy your apps even faster. Be prepared with tools that work on IBM Cloud Kubernetes Service:

  • Toolchains with IBM Cloud Continuous Delivery to automate builds and deployments

  • Helm charts for repeatable processes

  • Rollouts/rollbacks with kubectl for rolling deployments

  • Other options: Codeship or Istio

Fabric for Deep Learning (FfDL)

Developers don’t have to start from scratch for AI. FfDL jumpstarts your training models. FfDL (pronounced “fiddle”) is an open source project that makes deep learning easily accessible to the people it matters the most (e.g., data scientists and AI developers). FfDL provides a consistent way to deploy, train, and visualize deep learning jobs across multiple frameworks like TensorFlow, Caffe, PyTorch, and Keras. FfDL is being developed in close collaboration with IBM Research and IBM Watson. It forms the core of Watson’s Deep Learning service in open source. Get started with FfDL.

Secrets of Kubernetes for AI

I’ll let you in on some insider’s secrets for designing containers and clusters for AI’s unique workloads.

The first secret of AI on Kubernetes is how you achieve higher availability. Multizone clusters, along with other Kubernetes features, ensure HA for AI workloads, including the following:

  • Resiliency and fault tolerance

  • Intelligent scheduling

  • Self-healing

  • Service discovery and load balancing

The next secret of AI on Kubernetes is scaling. You have quick access to the specialized AI chips. Also, Kubernetes provides easy horizontal scaling. You too can scale easily to meet AI demands.

Another secret of AI on Kubernetes is tailored infrastructure. You can choose bare metal for mathematically intensive workloads such as high-performance computing, machine learning, 3D applications, and significant storage needs. Access to specialized hardware is key when you need new infrastructure. You can pick what infrastructure you need now and add on more later. Portability of your tailored infrastructure is inherent to containerizing workloads.

The next secret of AI on Kubernetes is low-code/no-code with IBM Cloud services instead of net-new coding. Low-code frees developers up whether you want to:

  • Add onto your AI models with FfDL

  • Extend app functionality with IBM Storage, Watson, or Analytics services

  • Keep using your favorite turn-key and open-source tools

The final secret is no secret at all – running workloads on the cloud means security needs to come baked in. IBM Cloud Kubernetes Service provides a full spectrum of security features:

IBM Cloud Kubernetes Service

Bringing this full circle, containers + IBM Cloud Kubernetes Service are at the heart of the AI cloud journey. AI developers aren’t distracted by host administration when IBM simplifies infrastructure management by managing Kubernetes master, IaaS, and operational components, such as Ingress and storage; monitoring health and recovery for worker nodes; and providing global compute, so developers don’t have to stand up infrastructure in geographies where they need workloads and data to reside.

Get coding. We’ll manage the rest.


(1): Predicts 2018: Stimulate Creativity to Generate Success — A Gartner Trend Insight Report

(2): AI Deep Learning Workloads Demand A New Approach To Infrastructure: GPUs Dominate Now, But A Broader Landscape Of AI Chips And Systems Is Evolving Quickly (Forrester)

Be the first to hear about news, product updates, and innovation from IBM Cloud