Leveraging the Dynamic Graph
The Dynamic Graph is a model of your application that understands all the physical and logical dependencies of components such as Host, OS, JVM, Cassandra Node, MySQL, etc. The graph also includes logical components such as traces, applications, services, clusters, and tablespaces. Components and their dependencies are automatically discovered by our agent and sensors, which means the graph is kept up to date in real-time.
Every node in the graph is also continuously updated with state information like metrics, configuration data, and a calculated health value based on semantical knowledge and a machine learning approach. This knowledge also analyses the dependencies in the graph to find logical groupings, like services and applications, to understand impact on that level and derive criticality of issues. The whole graph is persistent, meaning the Instana application can go back and forth in time to leverage the entire knowledge base of the graph for many operational use cases.
Based on the Dynamic Graph, we calculate the impact of changes and issues on the application or service, and, if the impact is critical, we combine a set of correlated issues and changes into an Incident. An incident shows how issues and changes evolve over time, enabling Instana to point directly to the root cause of the incident. Any change is then automatically discovered and we calculate its impact on surrounding nodes. A change can be a degradation of health (which we call an “Issue”), a configuration change, a deployment or appearance/disappearance of a process, container or server.
To make this concrete, let's look at how we would model and understand a simple application that uses an Elasticsearch cluster to search for a product using a web interface. In fact, this could be just one µService but it shows how we understand clusters and dependencies in Instana.
Let’s develop a model of the Dynamic Graph for an Elasticsearch cluster to understand how this works and why this is useful in distributed and fluid environments.
We start with a single Elasticsearch node, which technically is a Java application, so the graph looks like this:
The nodes show the automatically discovered components on the host and their relationships. For an Elasticsearch node, we would discover a JVM, a Process, a Docker container (if the node runs inside of a container), and the host on which it is running. If it is running in a cloud environment like Amazon AWS, we would also discover its availability zone and add that to the graph.
Each node has properties (like JVM_Version=1.7.21) and all the relevant metrics in real-time, e.g. I/O and network statistics of the Host, Garbage Collection statistics of the JVM, and number of documents indexed by the ES node.
The edges between the nodes describe their relationships. In this case, these are “runs on” relationships. For example, the ES node "runs on" the JVM.
For an Elasticsearch Cluster, we would have multiple nodes that are building the cluster.
In this case, what we added a cluster node to the graph that represents the state and the health of the whole cluster. It has dependencies on all four Elasticsearch nodes that comprise the cluster.
The logical unit of Elasticsearch is the index – the index is used by applications to access documents in Elasticsearch. It is physically structured in shards that are distributed to the ES nodes in the cluster.
We add the index to the graph to understand the statistics and health of the index used by applications.
In addition, we assume that we access the Elasticsearch index with a simple Spring Boot application.
Now the graph includes the Spring Boot application.
As the Instana Java sensor records distributed traces, Instana will know whether the Spring Boot application accesses an Elasticsearch index. We correlate these traces with the logical components in the graph and track statistics and health on the different traces.
Using this graph, we can understand different Elasticsearch issues and show how we analyze the impact on the overall service health.
Let’s assume that we have two different problems:
- I/O problem on one host causing read/write on index/shard data to be slow.
- Thread pool in one Elasticsearch node is overloaded so that requests are queued as they cannot be handled until a thread is free.
In this case, the Host (1) starts having I/O problems. Our health intelligence would display the host's health as yellow and then fire an issue to our issue tracker. A few minutes later, the ES (Elasticsearch) Node (2) would be affected by this, and our health intelligence would see that the throughput on this node is degraded to a level that we mark this node as yellow – firing an issue again. Our engine would then correlate the two issues and add them to one incident, which wouldn’t be marked as problematic as in this case, the cluster health is still good so that the service quality is not affected.
Then on another ES node (3), the thread pool for processing queries is filled up and requests get pooled. As the performance is badly affected by this, our engine marks the status of the node as red. This effects the ES cluster (4), which turns to yellow, as the throughput is decreasing. The two issues generated are aggregated to the initial incident.
As the cluster affects the performance of the index (5), we mark the index as yellow and add the issue to the incident. Now the performance of the product search transactions is effected, and our performance health analytics will mark the transaction as yellow (6) which also affects the health of the application (7).
As both the application and the transaction are effected, our incident will actually fire with a yellow status saying that the product search performance is decreasing and users are affected. The path to the two root causes are highlighted – the I/O problem and the Threadpool problem. As seen in the screenshot, Instana will show the evolution of the incident, and the user can drill into the components at the time the issue was happening – including the exact historic environment and metrics at that point of time.
This shows the unique capabilities of Instana:
- Combining physical, process, and trace information using the graph to understand their dependencies.
- Intelligence to understand the health of single components, but also the health of clusters, applications, and traces.
- Intelligent impact analysis to understand if an issue is critical or not.
- Show the root cause of a problem and give actionable information and context.
- Keeps the history of the graph, its properties, metrics, changes and issues, and provide a “timeshift” feature to analyze any given problem with a clear view on the state and dependencies of all components.
Finding root cause in modern environments will only get more challenging in the coming years. The simple example above has shown that finding the root cause is not a trivial task without understanding of the context, dependencies, and impact. Now think of “liquid” systems based on µServices that add and remove services all the time with new releases pushed out frequently – Instana keeps track of the state and health in real time, and understands any impact of these changes or issues. This is all done without any manual configuration and in real time.
The Dynamic Graph is created and updated automatically. The definition of some components, like services, can be further specified through service configuration.
Graph traversal and scoping can be accomplished using our powerful dynamic focus ability.