9 min read
Every day, billions of people globally use their computers or mobile devices to access the Internet. Invariably, some of those users attempt to access a website that is either slow to load or prone to crashing.
One reason that the website underperformed is that too many people were trying to access the site at the same time, overwhelming the servers. However, it might also be indicative of a larger concern, including DNS misconfiguration, a lasting server failure or a malicious attack from a bad actor.
Incidents are errors or complications in IT service that need remedying. Many of these incidents are temporary challenges that require a specific remedy, but those that point to underlying or more complicated issues that require more comprehensive addressing are called problems (link resides outside ibm.com).
This explains the existence of both incident and problem management, two important processes for issue and error control, maintaining uptime, and ultimately, delivering a great service to customers and other stakeholders.
Organizations increasingly depend on digital technologies to serve their customers and collaborate with partners. An organization’s technology stack can create new and exciting opportunities to grow its business. But an error in service can also create exponential disruptions and damage to its reputation and financial health.
Incident management is how organizations identify, track, and resolve incidents that might disrupt normal business processes. It is often a reactive process where an incident occurs and the organization provides an incident response as quickly as possible.
An increase in organizations pursuing digital transformation and other technology-driven operations makes incident management even more important given the dependence on technology to deliver solutions to customers.
Organizations’ IT services are increasingly made up of a complex system of applications, software, hardware and other technologies, all of which can be interdependent. Individual processes can break down, disrupting the service that they provide to customers, costing the business money and creating reputational issues. Organizations have embraced advanced development operations (DevOps) procedures to minimize incidents, but they need a resolution process for when they occur.
Every day, organizations encounter and need to manage minor and major incidents, all of which have the potential to disrupt normal business functions. Organizations need to pay attention to several types of incidents, including unplanned interruptions like system outages, network configuration issues, bugs, security incidents, data loss and more.
As technology stacks have increased in complexity, it becomes even more important to strategically manage the incident management process. To ensure that everyone in the organization knows what to do if they encounter an incident.
Incident management systems have evolved from blunt tools where employees recorded incidents that they observed (which might happen hours after occurring). To a robust, always-on practice with automation and self-service incident management software, enabling anyone in the organization to report an incident to the service desk.
It is important to resolve incidents immediately and prevent them from happening again. This allows organizations to uphold their service-level agreement (SLA), which may guarantee a certain amount of uptime or access to services. Failing to adhere to an SLA might put your organization at legal or reputational risk.
The incident manager is the key stakeholder of the incident management process. An incident manager is responsible for managing the response to an incident and communicating progress to key stakeholders. It is a complex IT services role that requires the employee to perform under stressful conditions while communicating with stakeholders with different roles and priorities in the business.
Problem management is intended to prevent the incident from reoccurring by addressing the root cause. It logically follows incident management, especially if that incident has occurred several times and should likely be diagnosed as a problem or known error.
Incident management without problem management only addresses symptoms and not the underlying cause (that is, the root cause), leading to a likelihood that similar incidents will occur in the future. Effective problem management identifies a permanent solution to problems, decreasing the number of incidents an organization will have to manage in the future.
A problem management team can either engage in reactive or proactive problem management, depending on what incidents they observed and what historical data they have.
There is one major difference to consider when observing incidents versus problems: short-term versus long-term goals.
Incident management is more concerned with intervening on an issue instance with the stated goal of getting that service back online without causing any additional issues. It is a short-term tool to keep the service running at that very moment.
Problem management focuses more on the long-term response, addressing any potential underlying cause as part of a larger potential issue (that is, a problem).
Organizations try to keep their IT infrastructure in good standing by using IT service management (ITSM) to govern the implementation, delivery, and management of services that meet the needs of end users. ITSM aims to minimize unscheduled downtime and ensure that every IT resource works as intended for every end user.
Issues arise regardless of how much effort organizations put into their ITSM. An organization’s ability to address and fix unforeseen issues before they turn into larger problems can be a huge competitive advantage. An IT service breaking down once is considered an incident.
For example, too many people trying to access a server may cause it to crash, creating an incident that your organization needs to fix. Incident management relates to fixing that particular issue affecting your users as quickly and carefully as possible. In this case, an incident manager can contact the organization’s employees and ask them to exit programs while the organization resolves the issue.
Incident management and problem management are both governed by the Information Technology Infrastructure Library (ITIL), a widely adopted guidance framework for implementing and documenting both management approaches. ITIL creates the structure for responding reactively to incidents as they occur. The most up-to-date release at the time of writing is ITIL 4.
It provides a library of best practices for managing IT assets and improving IT support and service levels. ITIL processes connect IT services to business operations so that they can change when business objectives change.
A key component of ITIL is the configuration management database (CMDB), which tracks and manages the interdependence of all software, IT components, documents, users and hardware that is required to deliver an IT service. ITIL also creates a distinction between incident management and problem management.
A constantly crashing server may represent a larger, systematic problem, like hardware failure or misconfiguration. The crashes may continue if the IT service team fails to uncover the root cause and map a solution to the underlying issue. In this case, the response may require an escalation to problem management, which is concerned with fixing repeated incidents.
Problem management provides a root cause analysis for the problem and a recommended solution, which identifies the required resources to prevent it from happening again.
Effective incident and problem management encompasses a structured workflow that requires real-time monitoring, automation, and dedicated workers coordinating to resolve issues as quickly as possible to avoid unnecessary downtime or business interruptions. Both forms of management feature several recurring components that organizations should know.
Organizations often assess incident managers and the incident management process based on several key performance indicators (KPIs):
Companies with comprehensive problem and incident management plans can quickly respond to incidents and outperform their competition. The following are some benefits:
IBM® Turbonomic® integrates with your existing ITOps solutions, bridges siloed teams and data, and turns manual, reactive processes into continuous application resource optimization while safely reducing cloud consumption by 33%.
IBM Cloud Pak® for AIOps, the self-hosted option for incident management, achieves proactive incident management and automated remediation to reduce customer-facing outages by up to 50% and mean time to recovery (MTTR) by up to 50%.
Subscribe to the Think Newsletter
Discover how IBM® Turbonomic helps manage cloud spend and application performance, with a potential 247% ROI over 3 years.
Learn best practices and considerations for selecting a cloud optimization solution from PeerSpot members who use Turbonomic.
Learn how users of IBM Turbonomic achieved sustainable IT and reduced their environmental footprint while assuring application performance.
Automatically scale your existing IT infrastructure for higher performance at lower costs.
Discover how AI for IT operations delivers the insights you need to help drive exceptional business performance.
Move beyond simple task automations to handle high-profile, customer-facing and revenue-producing processes with built-in adoption and scale.