August 25, 2020 By Jason McGee 3 min read

After years of running Kubernetes, I’ve learned a few things about scale.

Automating the management of more than 1,000,000 containerized applications across the globe tends to expose weaknesses in your management approach, as standard systems and solutions fail under workloads of that magnitude.

But that upsurge in scale also revealed some critical best practices.

In its six-year lifespan, Kubernetes has solved a fundamental challenge—how to build a platform that lets app developers focus on building their apps instead of focusing on all the plumbing and the infrastructure for running those apps.

What we did 20 years ago in Java with app servers, we’re doing now with cloud. And we’re doing it on top of containers. Kubernetes is the open source container orchestration platform of choice. Surveys find that more than 70% of enterprises than run DevOps and use containers are using Kubernetes for web and API applications, databases, data warehouses, machine learning, blockchain, IoT applications, and high-volume websites.

Of course, you can manage one cluster by hand. Add a few more–in the 2-10 cluster-range—and your familiar tools will function satisfactorily. But more than 10? That’s when those stable tools are pressure-tested. They fail. Now, it’s time for help.

Many organizations initially lifted and shifted their applications to the cloud as monoliths. The next wave of cloud native applications are being built using microservices based on containers that primarily run of Kubernetes. Currently, monolithic and cloud native applications are being deployed in roughly equal numbers, while the early monolithic applications are often being modified with extensions based on microservices that make it easier to add additional functionality.

On a growth curve where the number of users dramatically outpaced the size of our development team, we learned that running cloud at scale required a solution to two persistent issues.

  1. How does the team manage such a vast system?
  2. How do we gain visibility into what’s running in our clusters and update them?

Adapt the system, not the team—build all operational work where the team is

If you want a small team to be able to manage a large environment, you have to make everything as efficient for the people as possible. This means adapting the technology to the people. Switching between tools and systems was slowing us down. The “aha” moment was when we realized that the team spent their entire day talking to each other on Slack and if we could bring the management system to Slack, we could all go faster.

The insight led to a ChatOps model, where all of the team data and operations to manage production could take place using bots integrated in the conversations the team was having. Now, pushing an update, handling an incident, identifying the right runbook, access systems, and collecting audit data can all happen without ever having to leave the conversation.

Focus on managing change efficiently

At the start, we used a traditional Jenkins based CI/CD model to update our systems. But it didn’t scale and was too slow to deploy. It was fragile. Its rules over deployment decisions became too complex. So, we built a different system to help us manage and inventory deployments at scale.

Switched to pull-based self-updated cluster model

Instead of pushing applications into production, all clusters could pull changes and update themselves. This allowed us to scale easily and maintain control over what was running.

Flexible rule- and label based-configuration

Having tens of thousands of clusters means you’re not doing anything on an individual cluster, but fleets of systems. To do this, we established rules to decide where applications ran and used labels within the environment to give us the fine-grained controls we needed over the system.

We also needed the systems to report for themselves: what was running in every cluster and what capabilities were deployed in each system around the world.

This new approach allowed us to grow to tens of thousands of managed clusters that we can reliably update 1000s of times every week without having to grow our team. In the spirit of sharing what we learned, we even open-sourced our tools at

Beyond one million

Operations teams are typically tasked with running deployments across at least hundreds—usually thousands—of containers. In any enterprise’s IT infrastructure, the need to schedule and automate deployment, availability, and scalability is critical. Kubernetes is the de facto solution.

So, no matter how high the growth curve climbs, where users outnumber developers, Kubernetes empowers small teams to operate at scale in public and hybrid environments.

Discover how Red Hat® OpenShift® on IBM does this with velocity, market responsiveness, scalability, and reliability.

Was this article helpful?

More from Cloud

Enhance your data security posture with a no-code approach to application-level encryption

4 min read - Data is the lifeblood of every organization. As your organization’s data footprint expands across the clouds and between your own business lines to drive value, it is essential to secure data at all stages of the cloud adoption and throughout the data lifecycle. While there are different mechanisms available to encrypt data throughout its lifecycle (in transit, at rest and in use), application-level encryption (ALE) provides an additional layer of protection by encrypting data at its source. ALE can enhance…

Attention new clients: exciting financial incentives for VMware Cloud Foundation on IBM Cloud

4 min read - New client specials: Get up to 50% off when you commit to a 1- or 3-year term contract on new VCF-as-a-Service offerings, plus an additional value of up to USD 200K in credits through 30 June 2025 when you migrate your VMware workloads to IBM Cloud®.1 Low starting prices: On-demand VCF-as-a-Service deployments begin under USD 200 per month.2 The IBM Cloud benefit: See the potential for a 201%3 return on investment (ROI) over 3 years with reduced downtime, cost and…

The history of the central processing unit (CPU)

10 min read - The central processing unit (CPU) is the computer’s brain. It handles the assignment and processing of tasks, in addition to functions that make a computer run. There’s no way to overstate the importance of the CPU to computing. Virtually all computer systems contain, at the least, some type of basic CPU. Regardless of whether they’re used in personal computers (PCs), laptops, tablets, smartphones or even in supercomputers whose output is so strong it must be measured in floating-point operations per…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters