After years of running Kubernetes, I’ve learned a few things about scale.
Automating the management of more than 1,000,000 containerized applications across the globe tends to expose weaknesses in your management approach, as standard systems and solutions fail under workloads of that magnitude.
But that upsurge in scale also revealed some critical best practices.
In its six-year lifespan, Kubernetes has solved a fundamental challenge—how to build a platform that lets app developers focus on building their apps instead of focusing on all the plumbing and the infrastructure for running those apps.
What we did 20 years ago in Java with app servers, we’re doing now with cloud. And we’re doing it on top of containers. Kubernetes is the open source container orchestration platform of choice. Surveys find that more than 70% of enterprises than run DevOps and use containers are using Kubernetes for web and API applications, databases, data warehouses, machine learning, blockchain, IoT applications, and high-volume websites.
Of course, you can manage one cluster by hand. Add a few more–in the 2-10 cluster-range—and your familiar tools will function satisfactorily. But more than 10? That’s when those stable tools are pressure-tested. They fail. Now, it’s time for help.
Many organizations initially lifted and shifted their applications to the cloud as monoliths. The next wave of cloud native applications are being built using microservices based on containers that primarily run of Kubernetes. Currently, monolithic and cloud native applications are being deployed in roughly equal numbers, while the early monolithic applications are often being modified with extensions based on microservices that make it easier to add additional functionality.
On a growth curve where the number of users dramatically outpaced the size of our development team, we learned that running cloud at scale required a solution to two persistent issues.
- How does the team manage such a vast system?
- How do we gain visibility into what’s running in our clusters and update them?
Adapt the system, not the team—build all operational work where the team is
If you want a small team to be able to manage a large environment, you have to make everything as efficient for the people as possible. This means adapting the technology to the people. Switching between tools and systems was slowing us down. The “aha” moment was when we realized that the team spent their entire day talking to each other on Slack and if we could bring the management system to Slack, we could all go faster.
The insight led to a ChatOps model, where all of the team data and operations to manage production could take place using bots integrated in the conversations the team was having. Now, pushing an update, handling an incident, identifying the right runbook, access systems, and collecting audit data can all happen without ever having to leave the conversation.
Focus on managing change efficiently
At the start, we used a traditional Jenkins based CI/CD model to update our systems. But it didn’t scale and was too slow to deploy. It was fragile. Its rules over deployment decisions became too complex. So, we built a different system to help us manage and inventory deployments at scale.
Switched to pull-based self-updated cluster model
Instead of pushing applications into production, all clusters could pull changes and update themselves. This allowed us to scale easily and maintain control over what was running.
Flexible rule- and label based-configuration
Having tens of thousands of clusters means you’re not doing anything on an individual cluster, but fleets of systems. To do this, we established rules to decide where applications ran and used labels within the environment to give us the fine-grained controls we needed over the system.
We also needed the systems to report for themselves: what was running in every cluster and what capabilities were deployed in each system around the world.
This new approach allowed us to grow to tens of thousands of managed clusters that we can reliably update 1000s of times every week without having to grow our team. In the spirit of sharing what we learned, we even open-sourced our tools at razee.io.
Beyond one million
Operations teams are typically tasked with running deployments across at least hundreds—usually thousands—of containers. In any enterprise’s IT infrastructure, the need to schedule and automate deployment, availability, and scalability is critical. Kubernetes is the de facto solution.
So, no matter how high the growth curve climbs, where users outnumber developers, Kubernetes empowers small teams to operate at scale in public and hybrid environments.