After years of running Kubernetes, I’ve learned a few things about scale.

Automating the management of more than 1,000,000 containerized applications across the globe tends to expose weaknesses in your management approach, as standard systems and solutions fail under workloads of that magnitude.

But that upsurge in scale also revealed some critical best practices.

In its six-year lifespan, Kubernetes has solved a fundamental challenge—how to build a platform that lets app developers focus on building their apps instead of focusing on all the plumbing and the infrastructure for running those apps.

What we did 20 years ago in Java with app servers, we’re doing now with cloud. And we’re doing it on top of containers. Kubernetes is the open source container orchestration platform of choice. Surveys find that more than 70% of enterprises than run DevOps and use containers are using Kubernetes for web and API applications, databases, data warehouses, machine learning, blockchain, IoT applications, and high-volume websites.

Of course, you can manage one cluster by hand. Add a few more–in the 2-10 cluster-range—and your familiar tools will function satisfactorily. But more than 10? That’s when those stable tools are pressure-tested. They fail. Now, it’s time for help.

Many organizations initially lifted and shifted their applications to the cloud as monoliths. The next wave of cloud native applications are being built using microservices based on containers that primarily run of Kubernetes. Currently, monolithic and cloud native applications are being deployed in roughly equal numbers, while the early monolithic applications are often being modified with extensions based on microservices that make it easier to add additional functionality.

On a growth curve where the number of users dramatically outpaced the size of our development team, we learned that running cloud at scale required a solution to two persistent issues.

  1. How does the team manage such a vast system?
  2. How do we gain visibility into what’s running in our clusters and update them?

Adapt the system, not the team—build all operational work where the team is

If you want a small team to be able to manage a large environment, you have to make everything as efficient for the people as possible. This means adapting the technology to the people. Switching between tools and systems was slowing us down. The “aha” moment was when we realized that the team spent their entire day talking to each other on Slack and if we could bring the management system to Slack, we could all go faster.

The insight led to a ChatOps model, where all of the team data and operations to manage production could take place using bots integrated in the conversations the team was having. Now, pushing an update, handling an incident, identifying the right runbook, access systems, and collecting audit data can all happen without ever having to leave the conversation.

Focus on managing change efficiently

At the start, we used a traditional Jenkins based CI/CD model to update our systems. But it didn’t scale and was too slow to deploy. It was fragile. Its rules over deployment decisions became too complex. So, we built a different system to help us manage and inventory deployments at scale.

Switched to pull-based self-updated cluster model

Instead of pushing applications into production, all clusters could pull changes and update themselves. This allowed us to scale easily and maintain control over what was running.

Flexible rule- and label based-configuration

Having tens of thousands of clusters means you’re not doing anything on an individual cluster, but fleets of systems. To do this, we established rules to decide where applications ran and used labels within the environment to give us the fine-grained controls we needed over the system.

We also needed the systems to report for themselves: what was running in every cluster and what capabilities were deployed in each system around the world.

This new approach allowed us to grow to tens of thousands of managed clusters that we can reliably update 1000s of times every week without having to grow our team. In the spirit of sharing what we learned, we even open-sourced our tools at

Beyond one million

Operations teams are typically tasked with running deployments across at least hundreds—usually thousands—of containers. In any enterprise’s IT infrastructure, the need to schedule and automate deployment, availability, and scalability is critical. Kubernetes is the de facto solution.

So, no matter how high the growth curve climbs, where users outnumber developers, Kubernetes empowers small teams to operate at scale in public and hybrid environments.

Discover how Red Hat® OpenShift® on IBM does this with velocity, market responsiveness, scalability, and reliability.


More from Cloud

Kubernetes version 1.28 now available in IBM Cloud Kubernetes Service

2 min read - We are excited to announce the availability of Kubernetes version 1.28 for your clusters that are running in IBM Cloud Kubernetes Service. This is our 23rd release of Kubernetes. With our Kubernetes service, you can easily upgrade your clusters without the need for deep Kubernetes knowledge. When you deploy new clusters, the default Kubernetes version remains 1.27 (soon to be 1.28); you can also choose to immediately deploy version 1.28. Learn more about deploying clusters here. Kubernetes version 1.28 In…

Temenos brings innovative payments capabilities to IBM Cloud to help banks transform

3 min read - The payments ecosystem is at an inflection point for transformation, and we believe now is the time for change. As banks look to modernize their payments journeys, Temenos Payments Hub has become the first dedicated payments solution to deliver innovative payments capabilities on the IBM Cloud for Financial Services®—an industry-specific platform designed to accelerate financial institutions' digital transformations with security at the forefront. This is the latest initiative in our long history together helping clients transform. With the Temenos Payments…

Foundational models at the edge

7 min read - Foundational models (FMs) are marking the beginning of a new era in machine learning (ML) and artificial intelligence (AI), which is leading to faster development of AI that can be adapted to a wide range of downstream tasks and fine-tuned for an array of applications.  With the increasing importance of processing data where work is being performed, serving AI models at the enterprise edge enables near-real-time predictions, while abiding by data sovereignty and privacy requirements. By combining the IBM watsonx data…

The next wave of payments modernization: Minimizing complexity to elevate customer experience

3 min read - The payments ecosystem is at an inflection point for transformation, especially as we see the rise of disruptive digital entrants who are introducing new payment methods, such as cryptocurrency and central bank digital currencies (CDBC). With more choices for customers, capturing share of wallet is becoming more competitive for traditional banks. This is just one of many examples that show how the payments space has evolved. At the same time, we are increasingly seeing regulators more closely monitor the industry’s…