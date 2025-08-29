Site reliability engineering (SRE) and DevOps teams are exhausted. Sprawling IT estates, tool overload and the job’s on-call nature, all play a role in an overarching issue—alert fatigue.
Alert fatigue (sometimes called alarm fatigue) refers to “a state of mental and operational exhaustion caused by an overwhelming number of alerts.” It erodes the responsiveness and efficacy of DevOps, security operations center (SOC), site reliability engineering (SRE) and other teams responsible for IT performance and security, and is a widespread, consequential problem.
Vectra’s “2023 State of Threat Detection” report (based on a survey of 2,000 IT security analysts at firms with 1,000 or more employees) found that SOC teams field an average of 4,484 alerts per day. Of these, 67% are ignored due to a high volume of false positives and alert fatigue. The report also found that 71% of analysts believed that their organization might already have been “compromised without their knowledge, due to lack of visibility and confidence in threat detection capabilities.”
While the Vectra report takes a security-specific focus, teams charged with monitoring application and infrastructure performance face a similar overload. For example, a single misconfiguration can cause hundreds or thousands of performance alerts, an “alert storm” that can distract or desensitize IT teams and cause delayed responses to critical alerts and real issues. Those real issues can be costly.
What’s driving this burnout, and can agentic AI be part of a scalable solution?
There are several culprits, and an overwhelming volume of telemetry is often cited as one of them, but a focus on data volume specifically obscures a core issue—data quality and context.
When teams are dealing with loads of low-quality, context-poor data, feeding dozens of different threat intelligence or performance feeds, they are bound to encounter trouble. This is the sort of environment in which false positives and redundant alerts proliferate, and low-priority noise distracts from real threats and performance issues. These “false alarms” can grind the life out of IT, DevOps and security teams.
Simply feeding these massive telemetry streams into a large language model (LLM) isn’t a viable solution, either. For one, it’s a waste of compute. It’s also a great way to produce hallucinations.
A practical solution starts with developing a workflow that synthesizes raw data, and aggregates this higher-quality, context-rich data within a centralized platform. There it can be used for enterprise-wide observability and the training of local AI models.
Enterprises often use many performance and security monitoring solutions—large enterprises have an average of 76 security tools. These tools can be team- or product-specific, or specific to a certain IT environment (on-premises solutions vs. cloud solutions, for example).
Each one of these tools might be responsible for monitoring dozens or hundreds of applications, application programming interfaces (APIs) or servers, each feeding their own data pipeline. With such silos, separate tools can generate multiple alerts stemming from the same underlying issue. This lack of integration limits visibility, which hampers correlation and root cause analysis. SREs waste time chasing up each one of these alerts before identifying the redundancies.
When data streams are not integrated into a comprehensive monitoring system, IT teams don’t have the system-wide observability needed for efficient alert correlation, root cause analysis and remediation.
What’s worse, this lack of integration hinders the efficacy of the automation tools for alert management, such as alert prioritization and correlation workflows, set up to assist in detection and resolution and reduce the volume of alerts. Teams are left to manually connect the dots, an arduous and time-consuming (if not impossible) task.
A survey cited in Deloitte’s “Adaptive Defense: Custom Alerts for Modern Threats” report found that a “lack of visibility or context from security tools resulted in 47% of attacks being missed in a 12-month period.”
While individual agents don’t necessarily require centralization, a centralized platform where data from agents is aggregated facilitates system-wide analysis, storage and visualization.
Yes…with a focused strategy.
A recent MIT report ignited a firestorm with the claim that “95% of organizations are getting zero return” on their generative AI investments.
Setting aside the inflammatory stat, and the cascade of opinions the report solicited, the report highlights a valuable theme: many AI projects fail because of “brittle workflows, lack of contextual learning, and misalignment with day-to-day operations.” As Marina Danilevsky, Senior Research Scientist at IBM notes on a recent Mixture of Experts podcast, the most successful deployments are “focused, scoped and address a proper pain point.”
What the MIT report seems to reinforce is the fact that companies that view AI as a sort of panacea or something that can be haphazardly shoehorned into a process, aren’t likely to see a return on their investment. Organizations that can strategically implement AI tools into their workflows to solve a specific problem, and reinforce these tools over time, are better suited for success.
An observability or security solution that can incorporate adaptive machine learning, contextual prioritization, explainable AI, AI-powered automation and real-time intelligence into an integrated strategy can enable teams to create stronger workflows that help correlate, prioritize and remediate performance or security alerts.
AI agents can improve traditional systems that rely on static rules and preset thresholds by bringing factors like asset importance, performance guarantees, risk profiles and historical trends to bear.
For example, consider a post-incident detection and remediation workflow, and how an AI agent might assist an SRE team.
A notification hits the alert system flagging high CPU usage for a node in a Kubernetes cluster. In a traditional system, SREs might need to comb through MELT data (metrics, events, logs, traces) and dependencies to identify the root cause.
In this hypothetical agentic workflow, the agent uses the observability tool’s knowledge graph, and topology aware correlation, to pull only the telemetry related to the alert (such as logs for the services running on that node, recent deployments, telemetry from the Kubernetes API server or load balancers that route traffic to the node or cluster). With this additional information, the agent can enrich raw alerts and provide context-rich telemetry to a local AI model trained on the enterprise’s performance data and benchmarks.
The agent excludes irrelevant information, such as logs for unrelated services that happen to run on the same cluster. During this context gathering, the agent can also identify related signals and correlate alerts that likely stem from the same root cause and group these alerts together to be investigated as one incident.
With this information, the model can propose a hypothesis. The agent can also request more information (perhaps checking container configurations or time series data around the usage spike) to check and refine the model hypothesis, adding additional context before proposing a probable root cause.
The use of explainable AI and agents is a crucial part of solving the trust issue, of "seeing inside the black box,” or internal workings, of an AI tool.
Explainable artificial intelligence (XAI) “is a set of processes and methods that allows human users to comprehend and trust the results and output created by machine learning algorithms.”
In addition to the probable root cause, the agent can provide explainability through its chain of thought—its reasoning process—along with supporting evidence that demonstrates how it arrived at the proposed probable root cause. This explainability and supporting evidence:
- Enables humans to see why something has been recommended or filtered a certain way
- Provides the transparency needed to review the agent’s analysis and proposal, and judge whether it can be trusted
SRE analysis and assessment of agent recommendations can be fed back into the model to further improve accuracy.
There are several paths forward to resolution. Teams can decide how much autonomy to provide an agent, or define this autonomy based on incident type, severity, environment or other factors. The next steps include:
- Validation: An agent can generate steps to help SRE and DevOps teams validate that the root cause the agent identified is correct. This helps keep human input in the system.
- Runbook: When validated, the agent can produce a step-by-step guide of remediation steps (a runbook). This is a script that team members can follow to resolve the issue.
- Automation scripts: The agent can also take the actions it has suggested and build workflows (automation scripts). It might turn these runbook steps into an Ansible playbook snippet with the command syntax and parameters for the steps.
- Documentation: Agents can produce automatic documentation, such as a post-incident review, that summarizes the incident, actions taken and reasons for doing so. An agent can also produce an in-progress summary that helps those new to the task quickly understand what’s going on. This documentation can be used for reinforcement learning.
These steps all help optimize incident response and reduce mean time to repair. For a video walk-through of a similar hypothetical, click here.
AI frameworks can be used to improve various aspects of alert fatigue, such as the prioritization of actionable alerts across an IT environment.
In a 2023 paper titled “That Escalated Quickly: An ML Framework for Alert Prioritization,” Gelman et al introduce a machine learning framework designed to reduce alert fatigue with minimal changes to existing workflows through an alert level and incident-level actionability scoring system. Run on real-world data, the TEQ model reduced response time to actionable incidents by 22.9% and suppressed 54% of false positives (with a 95.1% detection rate.) It also reduced the number of alerts within singular incidents by 14%.1
In “Advancing Autonomous Incident Response: Leveraging LLMs and Cyber Threat Intelligence,” Tellache et al demonstrate how a retrieval-augmented generation (RAG)-based framework can improve incident resolution by integrating data from cyberthreat intelligence sources.2 A similar solution that uses agents to build on the RAG approach could be used to add greater context to performance data, for example, fetching agreed-upon performance thresholds from enterprise service level agreements (SLAs) to help decide which application alerts need to be prioritized.
An IT team might use several agents to improve alert processes, each designed to address a different facet of alert fatigue, such as an incident triage agent that pulls out critical threats for immediate attention, or a routing agent that fields prioritized alerts and routes them to the appropriate team along with documentation and analysis.
By routing data into a centralized hub, enterprises can help eliminate blind spots and present agents with a more comprehensive understanding of the environment they operate in. AI is most effective when working with high-quality, trustworthy data, and a centralized platform can help ensure the uniform application of data governance standards. As organizations scale AI solutions, this platform plays a crucial role in maintaining consistency in data management and agent deployment across business units.
Can an organization just “use AI” and mop up the alert deluge? No. Can well-trained models and agents help synthesize and analyze telemetry, and triage alerts to give IT teams a break? Much more cause to be optimistic there.
The successful use of AI and agents to alleviate alert fatigue hinges upon a few key factors: the targeting of a specific use case, strategic implementation and the AI’s ability to learn and improve alongside dynamic environments. Enterprise leaders must understand what’s required, be willing to make the cultural changes and assign the resources necessary to make the system work and find a vendor whose tools can be customized to match their need.
1 “That Escalated Quickly: An ML Framework for Alert Prioritization,” Gelman, Taoufiq, Vörös, Berlin, 15 February 2023
2 “Advancing Autonomous Incident Response: Leveraging LLMs and Cyber Threat Intelligence,” Tellache, Korba, Mokhtari, Moldovan, Ghamri-Doudane, 14 August 2025