Cloud native applications and the “golden age” of service management
I’ve spent the last year helping clients migrate to IBM’s Cloud environments, specifically the management and monitoring of them. Some clients expressed the concern that adopting microservices-oriented cloud architectures will mean operations of these environments will be more complicated due to multiple moving parts, dynamic topologies, and higher SLA requirements.
Coming from a background which began in development and transitioned into Monitoring/Service Management, I have found just the opposite—monitoring operations are easier with a cloud-native environment.
Back in the so-called “traditional era” when things were presumably simpler, it wasn’t easy to convince development teams to sit down and discuss the inner workings of their services, expose APIs to make monitoring easier, and (the worst!) keep their documentation and runbooks up-to-date. The topology of the monolith application may have been reasonably static, but it could change one night when a new version was deployed and suddenly the operations personnel would be trying to monitor processes that no longer existed!
When we tried to implement Application Performance Management (APM) or perform deep monitoring of a monolithic application, it meant getting the development teams to adopt whatever tools the operations team knew and embedding them in the application during development or just prior to deployment into production. These tools were usually not used in development environments and the development team had little interest in them. Worse yet, the dev teams would sometimes have their own APM tools and skipped the standard tools (aka “shadow IT” (*)).
(*) The use of multiple tools is not in-and-of-itself a bad thing, but a “wild west” environment where one team ignores the requirements of other teams is a problem!
Moving forward with DevOps and cloud-native deployments
The widespread adoption of DevOps was a significant step forward, but the emphasis remained on the development cycle—often the “Ops” of DevOps was limited to successful deployment. Cloud-native applications naturally favor a microservices architecture; these approaches to application development and deployment complimented each other, simplifying the job of both the development and operations team.
Looking at the “golden triangle” of people, processes and technologies depicted below, let’s consider what these changes mean to Cloud Service Management & Operations (CSMO) in general and monitoring in particular.
Although I do recognize that technology’s complexity has increased, tools have more than helped handle these issues. For example:
Containers can be monitored externally or monitoring code embedded into the container itself, reducing the complexity of the overall solution. In contrast, older monitoring tools required deploying agents into VMs.
Modern development frameworks enable developers to expose the inner workings of the services in a way that would have required significant effort in the past (“observability” is the operative buzzword).
Looking at the process part of the triangle, DevOps and accompanying concepts such as blue/green deployment actually make handling problems simpler and easier than before, because if the service is properly configured, customers can easily be routed from a failed section to a working one.
The most important part of the triangle is, of course, the people. I’ve spent some time talking to developers who are starting the migration process from monolithic DevOps to cloud-native development on IBM Cloud Private. Without exception, they were excited by the control that cloud-native development puts into their hands, especially in the domains of Monitoring and Service Management.
Let’s consider some of the advantages that the cloud-native approach brings to Service Management in more detail.
Health checks are easier with cloud-native apps
Let’s look at one of the most typical operations issues in a monolithic application: Is a given process actually doing work? How can I distinguish between a process that is idling, waiting for new transactions, and a zombie that has choked on a transaction?
Traditionally, answering these questions required an examination of the externalities of the process. Is there anything waiting in the queue? Is the memory/CPU consumption flat-lining or changing? Should I send a synthetic transaction at regular intervals and see if it succeeds?
All these tests are useful, but can never give you full confidence that “Yes, the process will complete the next transaction”. Furthermore, mapping out the different requirements and metrics to measure was a lengthy, human-centric process requiring coordination between operations and development (“Which queue is this process feeding off?”, “What kind of synthetic test can I run against the APIs?”, etc.). The obvious solution to this issue is a health check that will do the internal diagnosis.
The first step towards the modern development practice, DevOps, means that health checks enter the consciousness of development as part of their responsibility and not just “something the operations guys are asking for”.
The next step, which is the move to microservices, means that health checks are not just part of a best practice, but are part of the development framework. This is especially true for microservices that are orchestrated under frameworks such as Kubernetes.
The health checks are a natural, integrated product of the development cycle – not an afterthought or a burden, as they all too often are in monolithic development.
Consolidated logs are easier with cloud-native apps
Below are two questions that operators often ask themselves:
Why did the storage run out on this disk?
Where is the log for that process?!
All too often, the answer to both questions would be that a runaway (often undocumented) log file was taking up too much space. Adding insult to injury, invariably the “missing log entry” is the one containing the message needed to find the cause of the problem!
Of course, a monolithic application could have a well-defined catalog of logs that are scooped up by a collection agent, avoiding this irksome scenario. But traditional development leaves the log writing to whatever practices the developer chooses. In contrast, cloud-native/microservices-based practices (aka 12 factor development) requires the developers to treat all their logs as a simple event stream. Moreover, while in a traditional environment you can get away with not sending your logs to the central repository since they were recorded “somewhere”, in a cloud-native environment, the logs will not be persisted unless they are collected properly.
Bad development practices in the traditional world lead to logs getting lost, which leads to more work for the operators. This inefficiency may even be implicitly tolerated, since “throwing manpower” at a problem may seem easier than solving the root cause. In contrast, in the cloud-native world, a log that’s not written correctly will never be recovered. If that leads to unsolved issues, it must be properly addressed at the root cause.
Fortunately, as shown in the following table, writing logs the “right way” in cloud-native environments is a win-win scenario—it’s both the best way and easiest way of doing it! On the flip side, the penalty for writing logs the “wrong way” is not only extra work, but a severe handicap in debugging. The result is a development team that’s motivated to write code that’s naturally easier to monitor post-deployment and an operations team that shares a common vision of proper monitoring and logging.
Cloud-native applications are naturally more resilient
No matter what we do, there will always be infrastructure problems and application bugs. How do we solve these problems? More importantly, how do we avoid exposing them to the consumers of the services? In a traditional environment, you would spend time and effort designing your application to be highly available when there’s a problem from which the application can recover; you would apply business continuity patterns to it so the application continues handling customer’s requests, even during the recovery period. However, every technology, company and team invariably has their own approach and ownership of the problem falls somewhere between development and operations.
In the case of cloud-native development, most of this is “baked in” the development and operational frameworks. For example, an application that is developed using 12 factor procedures will inherently be able to run in multiple instances and each instance can be restarted and continue working. A cloud platform, such as IBM Cloud or IBM Cloud Private, will be able to leverage the combination of multiple instances of the application, harmless application restarts, and built-in health checks as mentioned earlier to keep the entire service healthy, even if a few instances are down.
This is outstanding! It means that some failures are expected and will not even be noticed by end users. There may actually not be less failures in the absolute sense, but this resiliency means that operators have more time and less urgency to solve the issue because it’s less likely to impact customers.
Change & problem management is easier with cloud-native apps
The previous examples showed how cloud-native development improves incident management – problems will both easier to diagnose and to solve, leading to a reduction in the Mean-Time-To-Repair (MTTR). But how about reducing Mean-Time-Between-Failure (MTBF)? How do the new development concepts of DevOps and cloud-native reduce the number of failures and improve service stability?
This example will cover the change process of deploying a new fix to production. Let’s assume the root cause is an out-of-memory condition. How does the diagnosis and resolution differ for a monolithic application and a microservices-based one?
In both cases, a post-mortem was done using the 5-whys technique. The following tasks were created:
Add a monitoring threshold to for early detection of the issue.
Preemptive restarts of the application to avoid the possibility of the memory leak leading to a problem during regular working hours.
Fix the bug!
Operations adds a new monitoring threshold. Operations updates their runbooks with problem resolution instructions.
Operations restarts the application every midnight. If the application has been developed with High Availability and/or Business Continuity in mind, there is little to no disruption.
Development creates the bug fix in their backlog; it will be deployed with the next version in a few months.
In the meantime, the operations team automates the application restart.
After a few months, Development deploys the fix and the memory leak never re-occurs.
With any luck, operations remembers to stop the automated nightly restart.
Cloud Native Process:
Development updates the internal health check to check for memory leaks and restart as soon as memory exceeds a threshold (or simply sets the memory limit of the container and leaves the orchestration to do the checking and restart).
Since the service has been designed as state-less and disposable, there is no disruption when processes are restarted.
Development creates the bug fix in their backlog; it will be deployed with the next release at the end of the week
At the end of the week, Development deploys the fix and the memory leak never re-occurs.
Note these key differences:
Resolution of the root cause is addressed sooner with cloud-native.
Since the monolithic service was not designed with Business Continuity in mind, there may be a service disruption.
Most importantly! In the cloud-native environment, no actions are required for operations when the root cause is an application bug! Operations personnel are responsible for infrastructure corrections, development is responsible for application corrections.
These examples do not lead to an environment where developers do all the operations work (the so-called “NoOps” scenario). Just because logs are centrally collected and containers do not have local storage does not mean that disk space is infinite. What it does mean is that the problems will be global. In the case of IBM Cloud Private, for example, the Elastic Stack indexing the logs may run out of space.
Briefly, for cloud-native microservices-based applications:
Operators cannot (and should not) be responsible for the heath of specific services and applications.
If the developer isn’t writing code properly, the operators can’t “throw manpower” at the problem to solve it.
If there is a problem, it’s often faster and easier for the developer to fix it than for the operator to introduce a workaround to keep the application going.
Cloud-native development allows developers to use the tools and practices they are most comfortable with and normalizes the results in a way which enables the operations side of the organization to work in the most efficient way possible.
While it is technically possible to develop traditional applications to be resilient and developers should write logs so they’re easily found, in practice, designing and implementing systems with solid reliability and serviceability are often prioritized below delivering the next new feature. Cloud native development enables organizations to jumpstart their Service Management maturity because the tools and practices are so well suited for it.
That is, instead of operations team demanding that the development team implement health checks and observability in a specific way that suits serviceability, cloud-native frameworks make it natural for developers to create health checks without being asked. In turn, the operation team no needs to know what the health check is or how it works, as long as the result is manageable.
While getting to this stage does require investment in tools and processes, the end result will be something much more automated, simpler and easier to use than the old “traditional” tools. Without exception, the developers I’ve introduced these concepts to have been enthusiastic about the extra power and flexibility they have in monitoring and managing cloud-native applications.
Within IBM, we call the process of building a manageable service Build-to-Manage. These practices include concepts that are purely in the hands of development (e.g., writing code that is Observable) and concepts that are shared between developers, DevOps and operations (e.g., best practices in activities and decision-making authority: RACI – Responsible/Accountable/Consult/Inform). These practices are documented in the Service management architecture for IT and cloud services. The IBM Cloud Garage consulting team also offers hands-on assistance with cloud service management and operations based on actual client experience.