Root cause analysis

Concept

DevOps practitioners face significant problems in today’s world of dynamic applications that are composed of hundreds or possibly thousands of components. First of all, when things break they need to be able to detect and understand the problem as soon as possible, even before users start to feel the service impact. Secondly, after restoring the service as quickly as possible, they need to figure out and fix the exact root cause and to ensure that the problem does not occur again. Practitioners trawl through log files, look at metrics, comb through events, consult crystal balls, and do whatever it takes to find the answer. It can take hours or days to identify the root cause of an issue, and often, the reason is left unidentified and lurking in the background waiting to reappear. Thankfully, Instana has made significant strides in managing incidents and accelerating the identification of root cause. Instana automatically detects changes, issues, and incidents to help you detect, understand, and investigate quality of service issues of your applications.

Changes

A Change is an event representing anything from a server start/stop, deployment, configuration change on a system, you name it. Further separated into:

  • Changes - Changed configuration of components, for example versions, environment variable values, and so on.
  • Offline/Online - Tracking presence of components under management.

Change events are important information used together with the Dynamic Graph to automatically detect relation of changes in configuration to incidents.

Issues

An Issue is an event that gets created if an application, service, or any part of it gets unhealthy. Instana comes with several hundreds of out-of-the-box curated health signatures detecting various problems ranging from degradations of service quality, to complex infrastructure issues, to disk saturation. Issues are automatically resolved as soon as the metrics, events, or metadata returns to the expected values.

In addition to built-in issues, you can define custom events to detect problems that are specific to your system.

To see all issues (both built-in and custom issues) that are detected by Instana, go to the Events view, and click the Issues tab. You can use Dynamic Focus to filter issues.

Each Instana issue contains following information:

  • Severity - can be CRITICAL or WARNING, where CRITICAL means that there is a direct or indirect risk of data loss or service being not available and WARNING means any other performance issue that might impact user experience or lead to a problem in long term
  • Start, end time and duration of the issue
  • Affected entities - one or more entities that are affected by the problem
  • Details - additional description providing additional context and measures to resolve the problem
  • Metrics - metric charts showing metric values relevant to the problem around the time the problem has happened
  • Where applicable, users can navigate to Unbounded Analytics to investigate traces, calls, or page loads affected by the issue.

In this example, the CPU steals time on one Linux machine is suspicious and therefore marked as an issue. An issue by itself does not trigger an alert, Instana simply notes that it happened. Should the service to where this system is connected behave badly, this issue is part of the incident. This methodology is one of the major benefits of Instana because it frees you from manually correlating events and performance problems. Just because something is using too much CPU for a while doesn’t mean there is a problem as such. Only when a service is impacted will this be relevant information.

Checkout Manage Built-in Events for more information on managing built-in and custom issues.

Since Instana knows all dependencies between monitored services, it triggers Incidents for all quality of service issues when these are impacting user. Also, some critical infrastructure issues, such as disk saturation and Elasticsearch cluster split brain situations, will trigger incidents because their end result is most likely data loss.

Applications, services, or endpoints that receive infrequent traffic (for example, one call every 15 minutes) are not considered to have a sufficient basis for our issue detection. The severity of an issue can change during its lifetime. It represents the highest severity that was ever reached by this particular issue.

Incidents

Incidents yield the highest severity level. They are created when edge services accessed by users are impacted or there is an imminent risk of impact. Using Dynamic Graph all relevant events are correlated for each incident to provide context and root cause analysis hypotheses.

Below is an example of an incident. A service is suddenly responding slower than usual, we call this a sudden increase in average latency. The incident is automatically marked in yellow as a warning. The color is presented as long as this incident is still active. Once it is resolved, the color changes to gray and is still available for the drill-down menu.

The incident detail view is organized into three parts:

  1. The header contains basic information about the key facts of the incident.

    • Start time;
    • End time (current if it is still ongoing);
    • The number of the still active events;
    • The number of changes involved;
    • The number of affected entities.

    You can see the incident start date, the end date (if available), how many events are still active, how many changes belong to this incident, and the number of affected entities:

  2. The second section provides a visual representation of the incident development over time. The chart shows the complete timeframe, from start to end and all events, sorted by start time. The view is limited to seven events when collapsed. Press the expand button to see the full view if your incidents contain more than seven events at a time. Clicking either of the bars open the detail-view for that issue:

  3. The third section contains the details for the graph view in section 2. A list of all events, sorted by start time, allows the user to see all available information for each event. To do this, click it to expand it:

The details help in understanding the event, followed by multiple charts with the corresponding metric plotted for visualization. If an event is still active, the chart continues rendering new incoming metric values. There are two flags available, emphasizing that this event affects a service and/or that this event has triggered the incident. If available, the flags are placed top of each event in the list.

When focusing on an event, the detail section provides the same information described in the incidents event list on point 3.

Automatic probable root cause (technology preview)

To alleviate the Mean Time to Remediation (MTTR) for DevOps practitioners, Instana automates probable root cause, an algorithm that dynamically analyzes the trace statistics and topology by using Causal AI. This algorithm identifies the probable root cause entity of a failure, which enables DevOps practitioners to quickly determine the probable source of an application's failure.

Currently, you can access the Probable Root Cause section panel on any incidents that are created from a smart alert.

The Probable Root Cause section panel is organized into two primary sections:

  1. The probable cause entity
  2. The events that are associated with the probable cause entity.

Currently, the Causal AI algorithm pinpoints an entity (or multiple entities) that is likely to be the source of the problem. The entity can be any physical or logical entity in the Instana dynamic graph and is displayed as the Probable root cause entity. The displayed entity links to the entity page, which describes the state of the entity at the time of the incident. The entity events are all open issues and incidents that occurred on the probable root cause entity. With detailed entity events, DevOps practitioners can quickly identify issues and incidents that caused the problem.

On the Probable Root Cause section panel, Instana displays up to three entities that most likely failed. To see other entities that are possible candidates for probable root cause, use the page selector at the lower right of the probable cause. These entities are sorted by the likelihood of failure, so the most likely root cause is the first one shown. The probability levels are shown in the upper right, under the title Probability level, and can be labeled as low, medium, or high, where high indicates the most likely to have failed. A tooltip is displayed when you hover your cursor on the Probability level, which further explains the meaning for each label.

Events view

To see all events detected by Instana go to "Events" view and choose between "Incidents", "Issues", "Changes" or "All" tabs to see corresponding event types. Searching through events discovered by Instana relies on the Dynamic Focus feature. By clicking one or selecting multiple bars in the events bar chart at the top, events table lists only the events that are included in the selected bars. This allows detailed inspection of events without changing current time interval.

In addition, you can use the search box to find specific items by the data shown in the columns “Title” or “On” (the name of the service on which the incident occurred) in the overview table. In this example, the search query is event.text:"Error rate". The result is a list of all events containing the phrase "Error rate" in the title: