Monitoring microservices applications: Introducing the dynamic graph

Modern microservices and container-based applications have attributes that introduce challenges for operating and monitoring these environments:

Scale: Modern applications can be constructed out of hundreds or thousands of µServices and containers, which are loosely coupled but communicate with each other. These applications can dynamically scale up and down depending on the load of the system and can spread all over the world in cloud data centers and availability zones.
Speed: µServices can be deployed independently from each other and are built for continuous delivery (CD) processes so that changes are happening at a high frequency.
Steadiness: µServices are not steady anymore. They can be started and stopped just for a functional call and containers are moved around by tools like Kubernetes or Apache Mesos to utilize the hardware in a better way. This process requires that monitoring tools identify these services and understand their role in the overall context—without relying on manual tagging.
Polyglot: Using the right tool for the problem is one of the paradigms of modern applications. This paradigm leads to multiple languages, frameworks and even persistence models for an application. In particular, new, distributed caching and database models are getting adopted quickly.

Figure 1

Adrian Cockcroft has built a simulation tool for such environments called Spigo. See the screenshot in Figure 1 for a small µService environment. One of his intentions was to supply monitoring tool providers with a test bed needed to test for scalability and rapid change.

The challenges of monitoring

It’s kind of easy. If you have 10 times more components involved in modern applications, monitoring must be at least 10 times more difficult. What you really care about—the reason for monitoring in the first place—is an understanding of the operational health of your application: performance, availability, reliability and so on. An understanding of an application’s health implies the need to know how everything works together and the entire surrounding context.

The current generation of monitoring tools has no understanding of an environment—the tools only have the concept of single components with metrics and traces, business transactions, for understanding health and performance. The interpretation and understanding of how all these components work together to find the root cause of problems is left to the user:

Understanding what components are affected by an issue
Understanding what metrics to look at
Interpreting and getting the right information out of the metrics and data
Correlating metrics and data to find the root cause of the problem

Some tools even provide “war rooms” for collaboration as normally whole teams of experts are needed to find the root cause of a problem.

As a simple example, a trace could show that the performance of a request was bad because of a slow Elasticsearch query. Finding the root cause means looking at different components—maybe the payload of the query, the number of resulting documents, the thread pool of the app server, the performance of the caching infrastructure and the configuration of the Elasticsearch cluster. This process is already an expert task—and most companies have a small team of 1–3 people, even in bigger companies, who troubleshoot these problems based on their experience and using current tools to get indicators as to where to look.

If hundreds of µServices are involved and distributed caching and database technologies are being used and changes are deployed multiple times a day—all in a highly dynamic, fluid cloud and containerized environment, the task of finding the root cause is like looking for a needle in a haystack.

The dynamic graph

The core technology powering the IBM® Instana® platform is what we call the dynamic graph. The graph is a model of your application that understands all physical and logical dependencies of components. Components are the piece parts of your application, like host, OS, Java Virtual Machine (JVM), Cassandra nodes, MySQL and so on.

The graph has more than the physical components—it also includes logical components like traces, applications, services, clusters or tablespaces. Components and their dependencies are discovered automatically by the Instana Agent and Sensors such that the graph is continuously kept up to date. Every node in the graph is also continuously updated with state information like metrics, configuration data and a calculated health value based on semantical knowledge and a machine learning (ML) approach.

This knowledge also analyses the dependencies in the graph to find logical groupings like services and applications to understand the impact on that level and derive criticality of issues. The whole graph is persistent and the Instana application can go back and forth in time to use the knowledge of the graph for many operational use cases.

Figure 2

Based on the dynamic graph, we calculate the impact of changes and issues on the application or service and, if the impact is critical, we combine a set of correlated issues and changes into an incident. An incident shows how issues and changes evolve over time, enabling the Instana platform to point directly to the root cause of the incident. Any change is then automatically discovered, and we calculate its impact on surrounding nodes. A change can be a degradation of health, which we call an “issue”; a configuration change; a deployment or appearance; and a disappearance of a process, container or server.

To make it a real-world example, I’ll describe how we would model and understand a simple application that uses an Elasticsearch cluster to search for a product using a web interface. In fact, this could be just one µService but it shows how we understand clusters and dependencies in the Instana platform.

Understanding a dynamic application

Let’s develop a model of the dynamic graph for an Elasticsearch cluster to understand how this process works and why it’s useful in distributed and fluid environments.

We start with a single Elasticsearch node. An Elasticsearch node technically is a Java application, so the graph looks like what’s shown in Figure 3.

Figure 3

The nodes show the automatically discovered components on the host and their relationships. For an Elasticsearch node, we would discover a JVM, a process, a Docker container—if the node runs inside of a container—and the host that it’s running on. If it’s running in a cloud environment like Amazon AWS, we would also discover the availability zone it’s running in and add the zone to the graph.

Each node has properties like JVM_Version=1.7.21, and all the relevant metrics in real-time, for example, the I/O and network statistics of the host, garbage collection statistics of the JVM and number of documents indexed by the Elasticsearch node.

The edges between the nodes describe their relationships. In this case, these edges are “runs on” relationships. So, the Elasticsearch node runs on the JVM.

For an Elasticsearch cluster we would have multiple nodes that are building the cluster.

Figure 4

In this case, what we added to the graph is a cluster node that represents the state and the health of the whole cluster. It has dependencies on all four Elasticsearch nodes that are building the cluster.

The logical unit of Elasticsearch is the index. The index is used by applications to access documents in Elasticsearch. An index is physically structured in shards that are distributed to the Elasticsearch nodes in the cluster.

We add the index to the graph to understand the statistics and health of the index used by applications.

Figure 5

To get a little bit further, we assume that we access the Elasticsearch index with a simple Spring Boot application.

Now the graph includes the Spring Boot application.

Figure 6

As our sensor for the Java application will inject some instrumentation for tracing distributed transactions, the Instana platform will automatically “see” if the Spring Boot application accesses an index of Elasticsearch.

Figure 7

You can see in the screenshot in Figure 7 that the transaction includes a waterfall chart showing the calls to different services and Figure 8 shows the details of one service call to an Elasticsearch index, including performance and payload data.

Figure 8

We inserted this trace and its relationship to the logical components into the graph and tracked statistics and health on the different traces.

Using this graph, we can understand different Elasticsearch issues and show how we analyze the impact on the overall service health.

Let’s assume that we have two different problems:

I/O problem on one host causing read and write on index and shard data being slow
Thread pool in one Elasticsearch node is overloaded so that requests are queued as they can’t be handled until a thread is free

In this case, the host (1) starts having I/O problems and our health intelligence would set

the health of the host to yellow and fire an issue to our issue tracker. A few minutes later the Elasticsearch node (2) would be affected by this issue and our health intelligence would see that the throughput on this node is degraded to a level where we mark this node as yellow—firing an issue again.

Our engine would then correlate the two issues and add them to one incident, which wouldn’t be marked as problematic as, in this case, the cluster health is still good so that the service quality isn’t affected.

Then, on another Elasticsearch node (3), the thread pool for processing queries is filled up and requests are getting pooled. As the performance is badly affected by this process, our engine marks the status of the node as red. This issue affects the Elasticsearch cluster (4), which turns to yellow, as the throughput is decreasing. The two issues generated are aggregated to the initial incident.

As the cluster affects the performance of the index (5) we mark the index as yellow and add the issue to the incident. Now the performance of the product search transactions is affected and our performance health analytics will mark the transaction as yellow (6), which also affects the health of the application (7).

As the application and the transaction are affected, our incident will actually fire with a yellow status stating that the product search performance is decreasing and users are affected—showing the path to the two root causes—the I/O problem and the thread pool problem. As seen in the screenshot, the Instana platform will show the evolution of the incident and the user can drill into the components at the time the issue was happening—including the exact historic environment and metrics at that point of time.

These examples show the special capabilities of the Instana platform:

Combining physical, process and trace information using the graph to understand their dependencies
Intelligence to understand the health of single components but also the health of clusters, applications and traces
Intelligent impact analysis to understand if an issue is critical or not
Show the root cause of a problem and give actionable information and context
Keeps the history of the graph, its properties, metrics, changes and issues and provide a “timeshift” feature to analyze any given problem with a clear view on the state and dependencies of all components

Finding root cause in modern environments will only get more challenging in the coming years. The simple example described here has shown that finding the root cause isn’t a trivial task without an understanding of the context, dependencies and impact. Now think of “liquid” systems based on µServices that add and remove services all the time with new releases pushed out frequently. The Instana platform keeps track of the state and health in real-time and understands any impact of these changes or issues. This process is all done without any manual configuration and in real-time.

The IBM Instana platform helps keep your application healthy and dramatically reduces the amount of time to find the root causes of problems or optimizations.

Get started with IBM Instana and sign up for the free trial

Was this article helpful?

YesNo

IBM Instana Team

IBM Instana

The challenges of monitoring

The dynamic graph

Understanding a dynamic application

More from IBM Instana

Probable Root Cause: Accelerating incident remediation with causal AI

Observe GenAI with IBM Instana Observability

Average 219% ROI: The Total Economic Impact™ of IBM Instana Observability

IBM Newsletters