Every day, billions of people globally use their computers or mobile devices to access the Internet. Invariably, some of those users attempt to access a website that is either slow to load or prone to crashing. One reason that the website underperformed is that too many people were trying to access the site at the same time, overwhelming the servers. However, it also could be indicative of a larger concern, including DNS misconfiguration, a lasting server failure or a malicious attack from a bad actor.
Incidents are errors or complications in IT service that need remedying. Many of these incidents are temporary challenges that require a specific remedy, but those that point to underlying or more complicated issues that require more comprehensive addressing are called problems.
This explains the existence of both incident and problem management, two important processes for issue and error control, maintaining uptime, and ultimately, delivering a great service to customers and other stakeholders. Organizations increasingly depend on digital technologies to serve their customers and collaborate with partners. An organization’s technology stack can create new and exciting opportunities to grow its business, but an error in service can also create exponential disruptions and damage to its reputation and financial health.
An increase in organizations pursuing digital transformation and other technology-driven operations makes incident management even more important given the dependence on technology to deliver solutions to customers.
Organizations’ IT services are increasingly made up of a complex system of applications, software, hardware and other technologies, all of which can be interdependent. Individual processes can break down, disrupting the service they provide to customers, costing the business money and creating reputational issues. Organizations have embraced advanced development operations (DevOps) procedures to minimize incidents, but they need a resolution process for when they occur.
Every day, organizations encounter and need to manage minor and major incidents, all of which have the potential to disrupt normal business functions. Organizations need to pay attention to several types of incidents, including unplanned interruptions like system outages, network configuration issues, bugs, security incidents, data loss and more.
As technology stacks have increased in complexity, it becomes even more important to strategically manage the incident management process to ensure everyone in the organization knows what to do if they encounter an incident.
Incident management systems have evolved from blunt tools where employees recorded incidents that they observed (which could happen hours after occurring) to a robust, always-on practice with automation and self-service incident management software, enabling anyone in the organization to report an incident to the service desk.
It is important to resolve incidents immediately and prevent them from happening again. This allows organizations to uphold their service-level agreement (SLA), which may guarantee a certain amount of uptime or access to services. Failing to adhere to an SLA could put your organization at legal or reputational risk.
The incident manager is the key stakeholder of the incident management process. An incident manager is responsible for managing the response to an incident and communicating progress to key stakeholders. It is a complex IT services role that requires the employee to perform under stressful conditions while communicating with stakeholders with different roles and priorities in the business.
What is problem management?
Problem management is intended to prevent the incident from reoccurring by addressing the root cause. It logically follows incident management, especially if that incident has occurred several times and should likely be diagnosed as a problem or known error.
Incident management without problem management only addresses symptoms and not the underlying cause (i.e., root cause), leading to a likelihood that similar incidents will occur in the future. Effective problem management identifies a permanent solution to problems, decreasing the number of incidents an organization will have to manage in the future.
A problem management team can either engage in reactive or proactive problem management, depending on what incidents they observed and what historical data they have.
Differences between incident management and problem management
There is one major difference to consider when observing incidents vs. problems: short-term vs. long-term goals.
Incident management is more concerned with intervening on an issue instance with the stated goal of getting that service back online without causing any additional issues. It is a short-term tool to keep service running at that very moment.
Problem management focuses more on the long-term response, addressing any potential underlying cause as part of a larger potential issue (i.e., a problem).
How do incident management and problem management work together?
Organizations try to keep their IT infrastructure in good standing by using IT service management (ITSM) to govern the implementation, delivery and management of services that meet the needs of end users. ITSM aims to minimize unscheduled downtime and ensure that every IT resource works as intended for every end user.
Issues will arise regardless of how much effort organizations put into their ITSM. An organization’s ability to address and fix unforeseen issues before they turn into larger problems can be a huge competitive advantage. An IT service breaking down once is considered an incident. For example, too many people trying to access a server may cause it to crash, creating an incident your organization needs to fix. Incident management relates to fixing that particular issue affecting your users as quickly and carefully as possible. In this case, an incident manager can contact the organization’s employees and ask them to exit programs while the organization resolves the issue.
Incident management and problem management are both governed by the Information Technology Infrastructure Library (ITIL), a widely adopted guidance framework for implementing and documenting both management approaches. ITIL creates the structure for responding reactively to incidents as they occur. The most up-to-date release at the time of writing is ITIL 4.
It provides a library of best practices for managing IT assets and improving IT support and service levels. ITIL processes connect IT services to business operations so that they can change when business objectives change.
A key component of ITIL is the configuration management database (CMDB), which tracks and manages the interdependence of all software, IT components, documents, users and hardware required to deliver an IT service. ITIL also creates a distinction between incident management and problem management.
A constantly crashing server may represent a larger, systematic problem, like hardware failure or misconfiguration. The crashes may continue if the IT service team fails to uncover the root cause and map a solution to the underlying issue. In this case, the response may require an escalation to problem management, which is concerned with fixing repeated incidents.
Problem management provides a root cause analysis for the problem and a recommended solution, which identifies the required resources to prevent it from happening again.
Key components of incident and problem management
Effective incident and problem management encompasses a structured workflow that requires real-time monitoring, automation and dedicated workers coordinating to resolve issues as quickly as possible to avoid unnecessary downtime or business interruptions. Both forms of management feature several recurring components that organizations should know.
Incident identification: To resolve an incident, you must first observe it. Organizations increasingly automate systems to detect and send notifications when incidents occur, but many also require a human to ensure that an incident is happening, determine whether or not it requires intervention and confirm the correct approach. For instance, a server crash is a common incident with digital-first organizations. When the server goes offline, an automated tool or employee may identify the incident, initiating the incident management process.
Incident reporting: This is the formal process for cataloging an incident record that a machine or human observed. It includes incident logging, the process by which an individual or system assigns a respondent to the issue, categorizes the incident and identifies the impacted business unit and the resolution date.
Incident resolution prioritization: Software and IT services are often interdependent in modern organizations, so one incident can have a knock-on effect on other services. Sometimes an incident occurs as part of a larger systematic failure, which can set off a catastrophic chain of events. For example, if multiple servers crash, the business analytics team may not be unable to access the data that they need, or the company’s knowledge workers may not be able to log in and access the software for their jobs. Or, if a company’s API fails, the organization’s customers may be unable to access the information they need to serve their end users. In both situations, the response team will have to assess the entire scope of the problem and prioritize which incidents to resolve to minimize the short-term and long-term effects on the business. They can prioritize based on which incident has the greatest impact on the organization.
Incident response and containment: A response team—potentially aided by automated software or systems—then engages in troubleshooting the incident to minimize business interruptions. The response team usually comprises internal IT team members, external service providers and operations staff, as needed.
Incident resolution: This is critical for IT operations to return to normal services. Potential resolutions to an IT incident include taking the incorrectly working server offline, creating a patch, establishing a workaround or changing the hardware.
Incident documentation and communication: This is a crucial step of the incident lifecycle to help avoid future incidents. Many companies create knowledge bases for their incident reports where employees can search to help them solve an incident that may have occurred in the past. In addition, new employees can learn about what incidents the company has recently faced and the solutions applied, so they can more readily help with the next incident. Documentation is also critical for determining whether an issue is recurring and becoming a problem, increasing the need for problem management.
Problem assessment: The organization now must determine if the incident should be categorized as a problem record or if it is just an unrelated incident. The former means it now becomes a part of problem management.
Problem logging and categorization: The IT team now must log the identified problem and track each occurrence.
Root cause analysis: The organization should study the underlying issues behind these problems and develop a roadmap to create a long-term solution. One way to accomplish this is by asking recursive “how” questions at each step of the way until one can identify the original problem.
Problem-solving: An IT team that understands the problem and its root cause can now solve the problem. It could involve a quick or protracted response depending on the severity or complexity of the problem.
Postmortem: A postmortem where relevant employees discuss the incident(s), root causes and response to the problem is a critical component of any transparent organization interested in maintaining uptime and providing customers excellent service. Postmortems provide everyone an opportunity to discuss how to improve without judging any employee or casting blame for any issue. The purpose of the postmortem is to find out what happened and to define actions to improve the organization. It also can provide insights into how the team can better respond to future incidents. It can identify whether an organization requires change management to revitalize and streamline its incident and problem management. The best ideas and best results will come from postmortem meetings that are open and honest. Team culture should assure all members that this is a way to discover how the team can improve IT services and not a way to find someone to blame. Teams will quickly understand if this is an honest and supportive exercise or not.
Incident and problem management key performance indicators
Organizations often assess incident managers and the incident management process based on several key performance indicators (KPIs):
Mean time to take action: An incident requires detection, response and repair. Organizations judge the health of their incident management service by the mean time to alert or acknowledge (MTTA) and mean time to respond and mean time to repair (MTTR), all of which provide a clear picture of how the organization can respond to incidents.
Mean time between failures (MTBF): The time between incidents for any IT service. MTBF, which happens more frequently than expected, could signify larger problems requiring a more proactive stance.
Uptime: The time your services are available and working as intended. Too little uptime can put an organization at risk of violating its SLA with end users and otherwise losing business to competitors.
Incidents and problems reported: The number of incidents an incident manager has reported in a given time frame. Increasing incidents reported may be a sign of a larger problem.
Incident management and problem management benefits
Companies with comprehensive problem and incident management plans can quickly respond to incidents outperform their competition. The following are some benefits:
Increased customer satisfaction and loyalty: Customers expect that the services and products they pay for will work whenever needed. More and more products are software (or connected to software, like smart devices). A server crashing at a company making smart doorbells means people cannot enter their homes or apartments. A hotel booking website having a DNS error issue loses revenue that day and potentially loses a lifetime customer to a competitor. The impact of incidents and problems can weigh heavily on an organization. The ones that respond to incidents quicker and minimize downtime will earn the loyalty of customers who are likely to switch providers if they’re unhappy. A robust incident management strategy will save companies money by decreasing downtime and the likelihood of a customer or employee leaving, both of which are associated with hard costs.
Increased employee satisfaction: A severe IT incident affects employees as much as customers. Employees that can’t access critical business software can’t do their jobs. Their work will pile up as the company tries to get things back online. They may have to work overtime or during the weekend to catch up, creating stress and threatening their morale.
Meeting SLA requirements: Organizations detail customer expectations for their products and services in an SLA. The organization could be at risk for legal action if they fail to withhold the terms of service in their SLAs and potentially lose customers to competitors.
Discover how to achieve proactive IT operations
IBM Turbonomic integrates with your existing ITOps solutions, bridges siloed teams and data, and turns manual, reactive processes into continuous application resource optimization while safely reducing cloud consumption by 33%.
IBM AIOps Insights helps accelerate remediation time for IT incidents. This SaaS-based solution uses intelligent automation and AI to aggregate information collected and correlated from various sources. IBM Cloud Pak for AIOps, the self-hosted option for incident management, achieves proactive incident management and automated remediation to reduce customer-facing outages by up to 50% and mean time to recovery (MTTR) by up to 50%.