Identify the root cause of service outages quickly by monitoring these four key metrics
Today’s modern cloud applications are often built on a distributed microservices architecture; this architectural style simplifies development by delineating responsibilities and fostering reuse. It also increases agility and supports a DevOps approach because each service can be modified, tested, and deployed without affecting other services. Another benefit is that each microservice can be based on different technologies and code bases – making things easier for developers but it may complicate things when a problem is encountered. With potentially hundreds of microservices, it can be very time consuming to understand all the dependencies and drill down into the specific technology causing the issue to find and fix the root cause.
Fortunately, there’s a better way: Monitoring just a few “golden signals” will point the way to the root cause. In this video, I’ll define these signals and how Application Performance Management from IBM uses them to rapidly pinpoint the service(s) that are causing the outage without you pulling in an army of SMEs.
SRE and simplifying monitoring for complex modern applications
Today, I’d like to talk a little about the Site Reliability (or SRE) discipline and how we can apply it to simplifying monitoring for complex modern applications. This will help us identify root causes more quickly and drastically reduce the mean time to recovery so that we can maintain the end user performance that we want for our applications.
Operations gets an alert. What usually happens next?
So first, let’s take a look at what happens before we have applied these SRE principals to our monitoring. So, let’s say that I am the owner of an application, and I’ve gotten an alert that says I am having a latency issue.
Now, my application is really critical for this business, so I need to find the root cause quickly. But, because I am part of this complex microservice topology, it can be really difficult to figure out where exactly the root cause is coming from.
Diagnosis should not require expertise in multiple technologies
And to make things more complex, all of my dependencies could be based on different technologies. So let’s say one is built on Node.js, one is a Db2 database, another is written in Swift and so on. Now, all of these have different metrics that are typically monitored and I may not be an expert in any of these different technologies.
So it may be difficult for me personally to go in and figure out what the problem is. So I would have to call in an expert for each of these technologies. Now, as you can imagine, this is time consuming for everyone to go through their service, figure out if there is a problem, or if I need to keep going downstream. And all the while, my users are still experiencing this latency issue.
Four SRE Golden Signals, defined
Now, what if there was a better way? This is what we can learn from the SRE discipline, which tells us that there are really only four key performance indicators that we need to monitor; not all the different metrics for each technology.
And we call these “Golden Signals”. They are defined as:
Latency – the time it takes a service to request,
Errors – view of request error rate,
Traffic – demand being placed on the system,
Saturation – utilization versus max capacity view of utilization against max capacity.
Now, let’s go back to our initial example and see how this would work, applying the “Golden Signals”.
Example: Alert diagnosis with SRE Golden Signals
So my service; we will call it Service A, we know we have a latency issue. Now, we know that latency is typically a symptom, and if we examine the service, let’s say we are not seeing any of the causes. We know we have to keep looking downstream. But we don’t want to go back to this complicated microservice topology and try and figure it all out. So, some APM tools can help you out with this by identifying only the services that are one hop away from my service in question. So, let’s say we have Services B, C and D that are connected to my Service A that’s having the problem.
Now, no matter what technology these services are built on, all we need to do is go look at the “Golden Signals”. So let’s say we look at the “golden signals” for Service B and everything looks fine. So we know Service B is not the problem, and let’s say Service C is the same scenario. We don’t see any issues, so we can eliminate that as the problem.
Now, Service D, let’s say we are seeing an issue with our saturation, which is trending upwards. So right there, after only a few minutes, we have identified Service D is likely our root cause.
So now, instead of having to pull in the experts for each of these different services, now we can go directly to Service D and let them know that we have identified that they are likely a cause of this issue that we’re having. And they can go about fixing it.
And, what’s even better is if they are using Golden Signals to monitor their service, it’s very likely they have already identified this and are already working on the fix.
Save yourself time and headaches
As you can see, this process drastically improves the time that it takes to go through this complex topology and many different technologies to figure out where your root cause is, and identify exactly how to fix it. So when you are identifying an APM tool to use, such as Application Performance Management from IBM, make sure that it offers the ability to use these Golden Signals and this one hop dependency view so you can quickly identify the root causes and get the service restored as quickly as possible.
Ready to start working with Kubernetes? Want to build your Kubernetes skills? The five tutorials in this post will teach you everything you need to know about how to manage your containerized apps with Kubernetes.
We are excited to announce the availability of Kubernetes v1.14.1 for your clusters that are running in IBM Cloud Kubernetes Service. IBM Cloud Kubernetes Service continues to be the first public managed Kubernetes service to support the latest upstream versions from the community.