When a component in the application end-to-end workflow becomes unavailable causing impact to internal users or external clients, the clock starts ticking, and customer satisfaction can be significantly impacted. Market research firm Aberdeen pegs an outage at about $260,000/hour. And many businesses are ill-equipped to resolve such outages promptly.
But there’s some good news for frustrated CIOs. Market intelligence firm IDC predicts that, by 2024, enterprises that are powered by AI will be able to respond to customers, competitors, regulators, and partners 50% faster than those that are not using AI.
Dispersed Data, Imminent Problems
The central problem many IT departments face is that the vast volumes of data from various sources being processed constantly by the modern enterprise can’t be monitored in real time using traditional data analysis techniques or applications. It can take hours or even days to troubleshoot the root cause of these issues when they occur.
But there is a bright light on the horizon for CIOs. Over the last decade, the IT industry has seen the rise of a new set of frameworks for IT operations. DevOps and DataOps revolutionized the way IT departments integrate with the rest of the enterprise. Now a new industry methodology called AIOps further extends the ability of the IT department to respond to change and address issues in real time.
What Can AI Do?
AI is quickly becoming an essential component of today’s IT departments because it can be used to automate how enterprises detect, identify and respond to potentially costly or catastrophic IT anomalies during an event or even before they occur. AI solutions can address the vast volumes of data, structured and unstructured, that traditional system monitoring tools were not designed to oversee with a singular view.
AI can collect data from a heterogeneous array of sources across the IT infrastructure, from performance alerts to incident tickets. This data can be used, for example, to enable cost reductions and help achieve improved productivity by recognizing a specific time of day when demand on IT resources is low, and shifting compute resources automatically. If automatic adjustments are not desired, data can be displayed in a visual format that provides IT operations managers or Site Reliability Engineers (SREs) with recommended courses of action, and explains the rationale behind those recommendations. AI can automate tasks like shifting traffic from one router to another, freeing up space on a drive, or restarting an application. AI systems can also be trained to self-correct so IT managers and their teams can spend their time on higher value work, while simultaneously getting full visibility into the enterprise’s operations.
Introducing IBM Watson AIOps
We’re excited to announce Watson AIOps, a new product that leverages machine learning, natural language understanding, explainable AI and other technologies to automate IT operations. Drawing from advances made at IBM Research, Watson AIOps gives businesses the ability to address and shape future outcomes, transitioning from a reactive posture toward proactive strategies. They are designed to introduce cost and personnel efficiencies, improve resilience across the enterprise’s information architecture, and speed issue resolution.
Watson AIOps is trained to connect the dots across data sources and common IT industry tools in real time, helping to quickly detect and identify issues. It extends beyond traditional structured sources of operational data, like metrics and alerts, to semi- and unstructured data like logs, tickets, and combines them using machine learning and natural language understanding to create a synthesized holistic problem report to identify and address the situation.
How It Works
Watson AIOps groups diverse sets of log anomalies and alerts based on spatial and temporal reasoning as well as similarity to past situations. Then, it provides a pointer to where the problem is occurring and identifies other services that might be affected, or commonly known as a blast radius. It does this by showing details of the problem based on data from existing tools in the environment, all in the context of the application topology, distilling multiple signals into a succinct report.
Watson AIOps leverages IBM’s leading natural language processing (NLP) technology to understand the content in tickets to identify and extract resolution actions automatically. As a new issue is identified, Watson AIOps will identify similar past incidents and provide the recommended next best actions to address the current issue to restore service. With the insight from Watson AIOps, predictive and proactive capabilities can be leveraged to drive more automation, shifting operations teams to higher value work.
IBM AI innovations are at the forefront of developing trusted and explainable technologies to help SREs interpret the reason behind a Watson AIOps recommendation. Maintaining transparency and explainability is critical to building trust in an AI system’s actions, and IBM continues to develop AI solutions that inspire confidence, including Watson OpenScale.
The Technology Behind the Service
As is the case with much of IBM’s AI development, significant portions of the technologies underlying Watson AIOps were born out of IBM Research. This new offering, part of what we’re calling AI for IT, is the culmination of years of research and development at IBM Research into how AI can be used to transform the IT lifecycle. Learn more on that here from IBM Research’s Chief Scientist, Ruchir Puri.
IBM Clients Using AIOps
IBM has partnered with Slack to provide what we think is a world-class ChatOps experience. ChatOps bypasses the traditional method of creating and responding to help tickets and support emails. When an issue arises, specific engineers or groups can be alerted by Watson AIOps from inside Slack. Then they can direct the system toward resolutions or deploy code, without needing to leave the chat environment. Everyone’s on the same page and all of these actions are logged in one convenient place. This partnership, along with integration with Box, represents an immediate solution to the current unplanned distribution of engineers who are working from home due to the COVID-19 pandemic and supports a future where IT services are more distributed as a matter of course.
Watson AIOps has also partnered with best-in-class monitoring solutions such as PagerDuty, LogDNA and Sysdig to deliver holistic insights across today’s IT environments. In addition to these, IBM Watson AIOps integrates with other IT Ops tools, is highly customizable, and uses Red Hat Openshift, to run on any cloud.
 IDC FutureScape: Worldwide Digital Transformation 2020 Predictions, Doc # US45569118, Oct 2019