August 25, 2020 By Jason McGee 3 min read

After years of running Kubernetes, I’ve learned a few things about scale.

Automating the management of more than 1,000,000 containerized applications across the globe tends to expose weaknesses in your management approach, as standard systems and solutions fail under workloads of that magnitude.

But that upsurge in scale also revealed some critical best practices.

In its six-year lifespan, Kubernetes has solved a fundamental challenge—how to build a platform that lets app developers focus on building their apps instead of focusing on all the plumbing and the infrastructure for running those apps.

What we did 20 years ago in Java with app servers, we’re doing now with cloud. And we’re doing it on top of containers. Kubernetes is the open source container orchestration platform of choice. Surveys find that more than 70% of enterprises than run DevOps and use containers are using Kubernetes for web and API applications, databases, data warehouses, machine learning, blockchain, IoT applications, and high-volume websites.

Of course, you can manage one cluster by hand. Add a few more–in the 2-10 cluster-range—and your familiar tools will function satisfactorily. But more than 10? That’s when those stable tools are pressure-tested. They fail. Now, it’s time for help.

Many organizations initially lifted and shifted their applications to the cloud as monoliths. The next wave of cloud native applications are being built using microservices based on containers that primarily run of Kubernetes. Currently, monolithic and cloud native applications are being deployed in roughly equal numbers, while the early monolithic applications are often being modified with extensions based on microservices that make it easier to add additional functionality.

On a growth curve where the number of users dramatically outpaced the size of our development team, we learned that running cloud at scale required a solution to two persistent issues.

  1. How does the team manage such a vast system?
  2. How do we gain visibility into what’s running in our clusters and update them?

Adapt the system, not the team—build all operational work where the team is

If you want a small team to be able to manage a large environment, you have to make everything as efficient for the people as possible. This means adapting the technology to the people. Switching between tools and systems was slowing us down. The “aha” moment was when we realized that the team spent their entire day talking to each other on Slack and if we could bring the management system to Slack, we could all go faster.

The insight led to a ChatOps model, where all of the team data and operations to manage production could take place using bots integrated in the conversations the team was having. Now, pushing an update, handling an incident, identifying the right runbook, access systems, and collecting audit data can all happen without ever having to leave the conversation.

Focus on managing change efficiently

At the start, we used a traditional Jenkins based CI/CD model to update our systems. But it didn’t scale and was too slow to deploy. It was fragile. Its rules over deployment decisions became too complex. So, we built a different system to help us manage and inventory deployments at scale.

Switched to pull-based self-updated cluster model

Instead of pushing applications into production, all clusters could pull changes and update themselves. This allowed us to scale easily and maintain control over what was running.

Flexible rule- and label based-configuration

Having tens of thousands of clusters means you’re not doing anything on an individual cluster, but fleets of systems. To do this, we established rules to decide where applications ran and used labels within the environment to give us the fine-grained controls we needed over the system.

We also needed the systems to report for themselves: what was running in every cluster and what capabilities were deployed in each system around the world.

This new approach allowed us to grow to tens of thousands of managed clusters that we can reliably update 1000s of times every week without having to grow our team. In the spirit of sharing what we learned, we even open-sourced our tools at razee.io.

Beyond one million

Operations teams are typically tasked with running deployments across at least hundreds—usually thousands—of containers. In any enterprise’s IT infrastructure, the need to schedule and automate deployment, availability, and scalability is critical. Kubernetes is the de facto solution.

So, no matter how high the growth curve climbs, where users outnumber developers, Kubernetes empowers small teams to operate at scale in public and hybrid environments.

Discover how Red Hat® OpenShift® on IBM does this with velocity, market responsiveness, scalability, and reliability.

Was this article helpful?
YesNo

More from Cloud

IBM Tech Now: April 8, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 96 On this episode, we're covering the following topics: IBM Cloud Logs A collaboration with IBM watsonx.ai and Anaconda IBM offerings in the G2 Spring Reports Stay plugged in You can check out the…

The advantages and disadvantages of private cloud 

6 min read - The popularity of private cloud is growing, primarily driven by the need for greater data security. Across industries like education, retail and government, organizations are choosing private cloud settings to conduct business use cases involving workloads with sensitive information and to comply with data privacy and compliance needs. In a report from Technavio (link resides outside ibm.com), the private cloud services market size is estimated to grow at a CAGR of 26.71% between 2023 and 2028, and it is forecast to increase by…

Optimize observability with IBM Cloud Logs to help improve infrastructure and app performance

5 min read - There is a dilemma facing infrastructure and app performance—as workloads generate an expanding amount of observability data, it puts increased pressure on collection tool abilities to process it all. The resulting data stress becomes expensive to manage and makes it harder to obtain actionable insights from the data itself, making it harder to have fast, effective, and cost-efficient performance management. A recent IDC study found that 57% of large enterprises are either collecting too much or too little observability data.…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters