Root cause analysis

Edit online

Instana manages incidents and accelerates the identification of probable root cause. Instana automatically detects incidents, issues, and changes to help you detect, understand, and investigate quality of service issues of your applications.

DevOps practitioners face significant problems in today’s world of dynamic applications that are composed of hundreds or possibly thousands of components. When things break they need to be able to detect and understand the problem as soon as possible, even before users start to feel the service impact. After the DevOps restore the service as quickly as possible, they need to fix the exact root cause and make sure that the problem does not occur again. The DevOps can take hours or days to identify the root cause of an issue, and often, the reason remains unidentified.

Incidents

Edit online

Incidents yield the highest severity level. They are created when edge services that are accessed by users are impacted or an imminent risk of impact exist. Using Dynamic Graph all relevant events are correlated for each incident to provide context and root cause analysis hypotheses.

A service is suddenly responding slower than usual, we call this incident a sudden increase in average latency. The incident is automatically marked in yellow as a warning. The color is presented till this incident is active. After it is resolved, the color changes to gray and is still available for the drill-down menu. See the following example of an incident.

The incident detail view is organized into three parts:

The header contains basic information about the key facts of the incident.
- Start time;
- End time (current if it is still ongoing);
- The number of the still active events;
- The number of changes involved;
- The number of affected entities.
You can see the incident start date, the end date (if closed), how many events are still active, how many changes belong to this incident, and the number of affected entities:

Figure 2. Incident KPIs
The second section provides a visual representation of the incident development over time. The chart shows the complete time frame, from start to end and all events, which are sorted by start time. The view is limited to seven events when collapsed. Press the expand button to see the full view if your incidents contain more than seven events at a time. Clicking either of the bars open the detail-view for that issue:

Figure 3. Incident population
The third section contains the details for the graph view in the second section. A list of all events, which are sorted by start time, allows the user to see all available information for each event. Click an event to expand the details and see all the available information for the event:

Figure 4. Expanded incident event

The details help in understanding the event, followed by multiple charts with the corresponding metric plotted for visualization. If an event is still active, the chart continues rendering new incoming metric values. Two flags are available. One flag is to emphasize that an event affects a service and the other flag that an event triggered the incident. If available, the flags are placed over each event in the list.

When you focus on an event, the detail section provides the same information that is described in the Incidents event list on point 3.

Automatic probable root cause (public preview)

Edit online

To alleviate the Mean Time to Remediation (MTTR) for DevOps practitioners, Instana automates the process for identifying a probable root cause for an incident. Instana's probable root cause engine uses a statistical, non-deterministic analysis model instead of relying on fixed rules. Instana uses the model's causal AI algorithm to dynamically analyze trace statistics and topology, evaluating any discovered patterns, dependency relationships, anomaly correlations, and telemetry confidence scores to estimate the component that is most likely causing the incident.

The Causal AI algorithm identifies the entity (or multiple entities) that is likely to be the source of the problem. The Probable Root Cause section displays up to three entities that are identified as the most likely root causes. These entities are sorted by the likelihood of causing the issue, so the most likely root cause is shown first. The entities can be any physical or logical entity that is monitored by Instana and is displayed. Any displayed entity links to the details page for the entity, which describes the state of the entity at the time of the incident. With this identified probable root cause, Instana enables DevOps practitioners to more quickly determine the actual cause and resolution for an application's failure.

The Probable root cause for an incident only appears on the details page for that incident when the AI model reaches a sufficient level of confidence for the identified probable root cause. If the confidence level is not high enough, Instana intentionally does not display the probable root cause or corresponding UI section to avoid indicating a misleading or incorrect cause for the incident.

Instana only analyzes and identifies a probable root cause for incidents that are created from a Smart Alert on the following entity types:

Application perspectives
Services
Endpoints
Service Level Objectives on application perspectives

When a likely cause is identified and the Probable Root Cause section for an incident does display on the details page, the section includes the following information:

The most likely probable root cause entity, and any additional identified probable root causes, as well the related infrastructure or application information. Links to the details page for the entity from the shown Hierachy is also included.
The evidence used to determine the entity to help your DevOps practitioners understand the reason why a specific entity is identified as a likely probable root cause.
The list of recommended actions for identified probable root causes.
An option (UI button) to run an intelligent incident investigation that uses advanced LLM-based investigation capabilities to provide additional insights. Learn more.
An option (UI button) to to view the related events that are associated with the probable root cause entity and the probability level that indicates the likelihood of failure. The associated events are all recent events that occurred on the probable root cause entity. With detailed associated events, the DevOps practitioners can quickly identify issues, incidents, or change events that caused the problem.
An option (UI button) to view the relevant trace error messages and logs to the probable root cause uncover additional details of the problem at first glance.
- The trace error messages are extracted through traces that flow through the probable cause (if there are any trace errors being logged by your system). The table displays both the error message itself, and the count of occurrences of that particular message that was logged during the defined time frame.
- The trace logs are more comprehensive record of the events of the system's call flow. The trace logs are ordered by the count and include log levels such as ERROR and WARN.

Issues

Edit online

An Issue is an event that gets created if an application, service, or any part of it gets degraded. Instana comes with several hundreds of curated health signatures that detect various problems that range from degradation of service quality, to complex infrastructure issues, to disk saturation. Issues are automatically resolved when the metrics, events, or metadata return to the expected values.

In addition to built-in issues, you can define custom events to detect problems that are specific to your system.

To see all detected issues (both built-in and custom issues) by Instana, go to the Events view, and select the Issues tab. You can use Dynamic Focus to filter issues.

Each Instana issue contains the following information:

Severity: This information can be CRITICAL or WARNING. CRITICAL means that a direct or indirect risk of data loss or service and are not available. WARNING means any other performance issue that might impact the user experience or lead to a problem in the long term.
Start, end time, and duration of the issue.
Affected entities: One or more entities are affected by the problem.
Details: Extra description that provides more context and measures to resolve the problem.
Metrics: Metric charts that show metric values that are relevant to the problem around the time the problem happened.
Where applicable, go to Unbounded Analytics to investigate traces, calls, or page loads that are affected by the issue.

In this example, the CPU steals time on one Linux machine is suspicious and therefore marked as an issue. An issue by itself does not trigger an alert, but Instana does note that it happened. If the service to where this system is connected behaves badly, this issue is part of the incident. This methodology is one of the major benefits of Instana because you do not need to manually correlate events and performance problems. Just because something is using too much CPU for a while does not mean that a problem exists. This information becomes relevant only when it impacts a service.

For more information about managing built-in and custom issues, see Manage Built-in Events

Since Instana knows all dependencies between monitored services, it triggers Incidents for all quality of service issues when incidents are impacting the user. It also triggers Incidents for critical infrastructure issues, such as disk saturation and Elasticsearch cluster split-brain situations because these issues are likely to cause data loss.

Note: Applications, services, or endpoints that receive infrequent traffic (for example, one call every 15 minutes) are not considered to have a sufficient basis for our issue detection. The severity of an issue can change during its lifetime. It represents the highest severity that was ever reached by this particular issue.

Draft comment:

Impacted Users for application issues (private preview)

This feature is under private preview. You can contact the technical Instana support to get included in this program.

By using this feature, you can see the impacted users of a specific event, and get valuable insights into how events are affecting your users by quickly identifying and addressing issues that impact user experience.

Availability

To use this feature, ensure that the following conditions are met:

Both your front-end (website or mobile app) and back-end servers are monitored by Instana.
The correlation between front-end and back-end monitoring functions as expected. For more information, see [Backend correlation](../website_monitoring/backend_correlation.md).
The Impacted Users feature is currently supported only for application issues.

What is an impacted user?

An impacted user is a user whose experience is negatively affected by an application issue that triggers an event. For example, an impacted user might be someone whose journey or visit to your website or mobile app is disrupted due to a back-end server error issue as follows:

The user encounters a critical error page and cannot continue using the site or app.
The user experiences significant delays or timeouts, leading to a disrupted experience.
The user's actions (such as form submissions or transactions) fail to complete due to server-side issues.

Event data correlation and impact analysis

When an event is triggered, the system correlates data from your front-end and back-end monitoring to identify which end users are impacted. Then, you can detailed information about the affected users and understand the scope and impact of the issue.

Changes

Edit online

A Change is an event that represents changes, such as a server start/stop, deployment, and configuration change on a system. Further, separated into:

Changes - Changed configuration of components, for example versions, environment variable values, and other components
Offline/Online - Tracking the presence of components under management

Change events are important information that is used together with the Dynamic Graph to automatically detect relation of changes in configuration to incidents.

Events view

Edit online

To see all events that are detected by Instana, go to the Events dashboard and select the Incidents, Issues, Changes or All tabs to see the corresponding event types.

Filtering Capabilities for all Events

Edit online

Dynamic Focus Query

Edit online

Searching through events that are discovered by Instana relies on the Dynamic Focus feature. By selecting one or more multiple bars in the Events bar chart, the Events table lists only the events that are included in the selected bars. By selecting the bars in the Events bar chart, you can do a detailed inspection of events without changing the current time interval. You can also use the search box to find specific items by the data in the “Title” or “On” columns (the service where the incident occurred) in the Overview table. In this example, the search query is event.text:"Error rate". The result is a list of all events that contain the phrase "Error rate" in the title:

Filtering Table

Edit online

The Events view provides powerful filtering capabilities through dedicated UI filters. The event list can be filtered using the three key filter options:

Transient Events: if events are transient, non-transient or both.
Event Type: if events are Built-in or Custom.
Smart Alerts: if events are triggered by Smart Alerts from Application, Website, Synthetics, Infrastructure, Mobile, Log or SLO.

These filters can be used individually or in combination to quickly find relevant events and focus your troubleshooting efforts on what matters most.