Intelligent incident investigation
Intelligent incident investigation identifies the probable root cause (RCA) for an incident across your entire environment using agentic AI, causal AI analysis, and large language models (LLMs).
Modern cloud-native applications run across hundreds of microservices, containers, Kubernetes clusters, and infrastructure components. When a production incident occurs, DevOps and SRE teams must quickly determine what broke, where, and why. Manually correlating metrics, logs, traces, and change events across multiple tools can take hours and significantly increase mean time to resolution (MTTR).
The AI-powered intelligent incident investigation feature builds on Instana's full-stack observability platform to automatically analyze system topology, distributed tracing, application performance metrics, logs, and infrastructure events in parallel. It performs a multi-entity investigation (system-wide analysis of related services and infrastructure), providing an AI investigation experience that reduces MTTR by up to 60-80% and accelerates incident response automation. The investigation processes use generative AI to synthesize findings into plain language recommendations, identifying root causes that might be several layers removed from where symptoms from appeared.
What is intelligent incident investigation?
- Causal AI-based probable root cause identification.
- Topology-aware, multi-entity LLM-powered agentic investigation across services, infrastructure, and Kubernetes resources.
- Plain-language investigative guidance to support incident response automation.
- Builds and refines hypotheses about the incident.
- Traverses the dynamic service and infrastructure graph.
- Correlates change events, configuration updates, and deployments with the incident.
- Produces a failure propagation chain, a prioritized list of affected components, and recommended remediation actions.
- DevOps and SRE teams that practice site reliability engineering (SRE) best practices.
- Platform teams that run Kubernetes, microservices, and cloud-native observability.
- Operations teams looking to reduce alert fatigue, improve production incident management, and shorten on-call incident response
Key benefits
- Automated root cause analysis (RCA):
- Automatically identifies the most likely root cause across services, infrastructure, and Kubernetes resources.
- Reduces dependence on specific domain experts and ad-hoc queries.
- Reduced MTTR and faster incident resolution:
- Quickly identifies the source of problems. Typical investigations complete in minutes instead of hours of manual analysis.
- Helps you meet service level objectives (SLOs) and error budget targets by speeding up incident remediation.
- Full-stack, multi-entity investigation:
- Moves beyond single-service views and analyzes entire failure propagation chains across microservices, databases, queues, and infrastructure. This analysis provides a comprehensive scope over the entire system topology rather than isolated components to understands how components interact and affect each other.
- Works with distributed tracing, log aggregation, and infrastructure monitoring to provide full-stack observability.
- Correlates changes to automatically identify relevant configuration changes that might have contributed to the incident.
- Evidence-driven recommendations:
- Provides clear supporting evidence (metrics, logs, traces, and change events) for each conclusion.
- Generates AI-authored remediation suggestions in plain language for incident response runbooks and incident war room discussions.
- Integrated with your observability platform and AIOps workflows:
- Integrates with existing Instana features such as Smart Alerts, probable root cause, topology, and application performance monitoring (APM).
- Can be combined with incident remediation capabilities and external automation tooling to move from investigation to incident response automation.
How intelligent investigation works
When you run an investigation, Instana performs a comprehensive analysis of your incident through an automated, multi-phase process. While the investigation is ongoing, you can continue working with the Instana application and explore the infrastructure context of the incident.
Prerequisites and getting started
- SaaS environment requirements
For SaaS environments, intelligent incident investigation is available by default, but your environment must still meet the following conditions:
- An application Smart Alert incident with probable root cause identified must exist.
- If you use Custom Events, you must migrate to Instana Smart Alerts.
- Permissions: No permissions are required to view an incident and run an investigation. However, to set up the creation of incidents you must have an account with the Events and alerts management access Configuration of Events and Alerts permission.
For more information about setting permissions, see Managing user access.
- Self-hosted environment requirements
For Standard Edition and Custom Edition, you must enable the required
feature.rca.agentic.enabledfeature flag to enable the intelligent incident investigation capabilities.- To configure the feature flag on Standard Edition, see Enabling optional features in Standard Edition.
- To configure the feature flag on Custom Edition, see Enabling optional features in Custom Edition
- An application Smart Alert incident with probable root cause identified must exist.
- If you use Custom Events, you must migrate to Instana Smart Alerts.
- Permissions: No permissions are required to view an incident and run an investigation. However, to set up the creation of incidents you must have an account with the Events and alerts management access Configuration of Events and Alerts permission.
For more information about setting permissions, see Managing user access.
Starting an investigation:
When your software system experiences an issue or outage, you can open the corresponding incident from the Instana Events dashboard to start the investigation into the incident.
- From the Instana menu, go to .
- Filter for Application Smart Alerts and apply the dynamic focus query (DFQ): event.rca.found:true to show the incidents that have a probable root cause.
- Click on an incident to open its details view and scroll to the Explore probable root cause section.
- Click Run investigation. The investigation starts automatically and streams results in real-time.
You can close the running investigation window to have the investigation run in the background. You can continue working while it runs in the background, for instance to continue to explore the application and infrastructure context and symptoms of the Incident.
Viewing the results:
- Investigation results stream in real-time as each phase completes.
- The final results summary for the investigation appears at the beginning of the page when the investigation completes.
- Review the Failure propagation chain to understand how the issue spread.
- Examine Trace logs and Trace error messages for detailed evidence.
- Check Recommended actions for specific resolution steps.
Investigation phases
- Initial insights
- Change event analysis
- Entity analysis
- Final report
- Initial insights
First the investigation processes establish the foundation of the investigation by collecting incident details, probable root cause data with health scores, related events, initial trace error patterns, and topology relationships.
- Change event analysis
Next, the investigation processes examine system modifications within a window beginning 60 minutes prior to the start of the incident to 20 minutes after the incident start. The processes analyze up to 200 change events including deployments, configurations, and scaling operations to identify suspicious timing.
- Entity analysis
The agent runs a tool call to collect data about an entity of interest and the entities around it. comprehensive analysis across six entity categories (focal entity, service callers, service callees, infrastructure parent, infrastructure children, and other infrastructure dependencies) is then completed through the collection of trace logs, error messages, events, and metrics for each entity.
- Final report
Finally, the investigation processes synthesize all findings into a comprehensive report with root cause identification, fault conditions, failure propagation chain visualization, consolidated evidence, component classification, remediation suggestions, and timeline correlation. The overall process is designed to be non-intrusive and operate asynchronously to allow you to continue working while the investigation and analysis occurs. Results are streamed in real-time as they become available, providing you with immediate visibility into the investigation progress.
Data collection and analysis
A comprehensive data collection across multiple dimensions of your system is done, gathering evidence from each component in the failure propagation chain. Data is collected in batches and streamed asynchronously with backend processes to provide real-time insights as the investigation progresses.
Investigation scope and entity categories
- Primary component: The service or component that is most likely responsible for the issue.
- Connected Services: The investigation reviews both upstream and downstream services that interact with the primary component to assess potential ripple effects.
- Infrastructure context: The investigation examines the underlying systems and platforms hosting these services, as well as any dependent resources like databases or message queues.
- Application-level relationships: The investigation considers related application components and orchestration elements to ensure a complete picture.
- Change events: The investigation considers change events, built-in events, and Kubernetes events in the scope of the incident.This approach helps to quickly isolate the root cause and understand its broader impact, enabling faster and more accurate resolution.
Investigation results
The intelligent incident investigation provides a comprehensive analysis that includes:
| Component | Description |
|---|---|
| Probable root cause identification | Precise determination of which component is the source of the problem. |
| Failure propagation chain | Visual representation of how the issue spread through your system. |
| Evidence summary | Consolidated metrics, logs, traces, and change events supporting the conclusion. |
| Component classification | Clear indication of which components are causing issues versus showing symptoms. |
Analysis time window
The investigation analyzes system data beginning from 60 minutes before the incident starts, providing comprehensive context about events and changes that led to the problem. During an investigation, Instana collects data in batches from all relevant entities and uses backend processes to analyze the data asynchronously. This approach is designed to return results quickly without requiring manual data correlation and to limit additional load on your monitored systems.
Frequently asked questions (FAQ)
How do I manually get to the page for an incident with a probable root cause to run an investigation?
- From the Instana menu, go to .
- Filter for Application Smart Alerts and apply the dynamic focus query (DFQ): event.rca.found:true to show the incidents that have a probable root cause.
- Click on an incident to open its details view and scroll to the Explore probable root cause section.
- Click Run investigation.
The investigation starts automatically and streams results in real-time. You can continue working while it runs in the background.
Requirements: The incident must be an Application Smart Alert with a probable root cause identified.
Can I run multiple investigations simultaneously?
Yes, you can run investigations on different incidents simultaneously. Each investigation operates independently and maintains its own state.
What happens if I close the browser during an investigation?
The investigation continues running in the background. When you return to the incident page, you can resume viewing the investigation results. The system caches the investigation state, allowing you to reconnect to ongoing investigations.
Can I export investigation results?
Investigation results are stored and can be accessed through the incident page. The complete investigation state, including all findings and evidence, is preserved for future reference.
What if the investigation doesn't identify a root cause?
If the investigation cannot definitively identify a root cause, it provides all collected evidence, failure propagation chains, and component classifications to help you manually investigate the incident. The consolidated evidence and AI analysis can help you more efficiently troubleshoot.