As businesses adopt cloud to enable business transformation, there is an increasing focus on site reliability. The goals for Operations Management now are fewer problem tickets and faster resolution. Key service metric used is Mean Time To Resolution/Repair (MTTR). It is an indicator of how well an enterprise executes incident management standard operating procedure (SOP).
What is MTTR?
The measurement unit for MTTR is time. It is the time period between the start of an incident and the end of an incident (issue resolved). MTTR can be divided into 4 component intervals:
- Mean Time To Identify (MTTI): Time period between the start of an incident and the time the incident is detected. The detection may be automatic via events/alarms seen in the event management system.
- Mean Time To Know (MTTK): Time period between the detection of the incident and the time the root cause of the incident is identified.
- Mean Time To Fix (MTTF): Time period between the isolation of the root cause of incident and the time taken to resolve the issue.
- Mean Time To Verify (MTTV): Time period between the resolution of the issue and confirmation of successful resolution from the users or automated tests.
In short, MTTR is the sum of MTTI, MTTK, MTTF and MTTV (see Figure 1):
MTTR = MTTI + MTTK + MTTF + MTTV
Figure 1 Mean Time To Resolution (MTTR): Incident Management Steps
Of these four components, MTTK is usually the biggest contributor to lengthy MTTR (see Figure 2).
Figure 2 Mean Time To Resolution (MTTR): Component Intervals by Relative Time Percentage
Therefore, by looking for ways to reduce MTTK, we will get the biggest gains in reducing MTTR. The following 3 factors are the top causes of lengthy MTTK i.e. time taken to find root cause of incident (see Figure 3):
- Factor 1: The monitoring tool has poor UI design
- Factor 2: The monitoring tool lacks context & visibility
- Factor 3: The monitoring tool is fragmented
Figure 3 Top 3 Factors Contributing to Lengthy MTTK
Factor 1: The Monitoring Tool Has Poor UI Design
The Operations Engineer noticed a threshold violation alarm for a WAN device interface on the Event Console. The engineer needed to know the reason for the threshold violation alarm and where this device was located in context of network topology.
The engineer opened the window for the Network Management tool and entered the device IP address. After confirming that the interface was a WAN link connecting the London office to the main office, the engineer turned towards the Performance Management tool and looked up the device and its associated interface. Now, in the Performance Management tool, the engineer was able to view the historical utilization of the interface. The engineer could not tell what was causing the high traffic utilization on the interface. The engineer then resorted to NetFlow Analyzer tool to understand traffic breakdown of the interface.
What’s wrong with this scenario?
Number of Open Windows of Monitoring Tool: 4
- Event Console
- Network Management/Topology
- Network Performance Management
- NetFlow Analyzer
If it sounds like deja vu, it is - this is the modern-day version of “swivel chair integration”. The Operations Engineer had to open 4 Monitoring Tool Windows to arrive at the conclusion. It increased the Mean-Time-To-Knowledge (MTTK) and was mistake prone because the tools were not inter-linked.
Factor 2: The Monitoring Tool Lacks Context & Visibility
In a large enterprise, there might be several thousand devices in the network. A network management tool will poll the devices in the network using SNMP protocol to discover existing and new devices. The purpose for doing so is to build an inventory of known devices in the network. Likewise, a network performance management tool will also discover the same devices in the network using SNMP protocol.
What’s wrong with this scenario?
The network management tool and the network performance management tool took different snapshot views of the network inventory. As a result, discrepancies might occur between the two tools. Without data reconciliation, it will be impossible to achieve contextual drill through from the topology view of devices in network management tool to the performance metrics from the network performance management tool.
An unwanted side effect is the network devices are being polled by separate tools and thereby increasing the load of the SNMP agents on the devices.
Factor 3: The Monitoring Tool Is Fragmented
Company X started out with one set of network monitoring tools. Over time, the number of network monitoring tools grew. There are myriad reasons for the increase in network monitoring tools, such as:
- Personnel: Preference of one team for one tool over another
- Scalability: Existing solution cannot scale and thus requiring multiple instances of the same tool for different coverage
- Corporate Growth: Due to merges and acquisitions, overlapping network monitoring tools from the acquired companies are introduced
What’s wrong with this scenario?
The biggest risk here is misinformation and miscommunication - operations teams acting on differing views on the health of the network. Chaos would ensue should seismic events such as weather disruptions or DDoS attacks befall the company. It will be chaotic because there isn’t one single view of the health of the network. Teams will be tripping over each other resolving issues, thereby prolonging MTTK.
Ways to Reduce MTTR
Efficiency around network monitoring can be improved through:
- Solution 1: UI Consolidation
- Solution 2: Tools Consolidation
I will illustrate how IBM Netcool Operations Insight (NOI) meets this criterion through its out-of-the-box pre-integration of its component applications with focus on the network performance use case.
For the uninitiated, IBM Netcool Operations Insight (NOI) integrates infrastructure and operations management into a single coherent structure across business applications, virtualized servers, network devices and protocols, internet protocols, and security and storage devices. It can be optionally extended by integrating Network Management, Performance Management, and Service Management solution extensions.
Solution 1: UI Consolidation
We should aim to improve MTTK efficiency by providing information available with the fewest mouse clicks possible – to ensure the users not lose time looking up information in various tools. To achieve this, the tools must be pre-integrated.
IBM Tivoli Network Manager (ITNM) tool builds the network topology by discovering devices in the network through SNMP polling. ITNM shares the discovered devices with the network performance tool, IBM Network Performance Insight (NPI).
This pre-integration has the following benefits:
- Solves the issue highlighted by Factor 1 – The Monitoring Tool Has Poor UI Design
With ITNM sharing the discovered entities with NPI, both tools can now reference the same device. This makes it possible for the Operations user to select the device in one tool, and then view in another tool without the hassle of re-entering the device id. NPI has an out-of-the-box (OOTB) dashboard to demonstrate this capability. It’s called “Device Dashboard” – a single pane of glass view of event list, network topology and associated performance metrics. More details on this dashboard further down.
- Solves the issue highlighted by Factor 2 – The Monitoring Tool Lacks Context & Visibility
With ITNM sharing the discovered entities with NPI, then NPI does not need to discover the same devices again. ITNM takes the lead in discovering SNMP devices. This significantly reduces the chances of device discrepancies between ITNM and NPI.
Here’s an example of the workflow of an Operations Engineer using NOI upon detecting a threshold violation.
Step 1 (Identify/Detect): The Operations user right clicks at the event/alarm for threshold violation and selects “Show Device Dashboard” (see Figure 4):
Figure 4 NOI Event Viewer
Step 2 (Isolate): In the Device Dashboard single pane of glass view, the Operations user can view (see Figure 5):
- Top left quadrant: the device topology
- Bottom left quadrant: the list of events
- Top right quadrant: the device and interface performance metrics and threshold indicators
- Bottom right quadrant: the time zoom of a selected performance metric
If the Operations user wants to know the cause of the high snmpInBandwidth, the Operations user right clicks an interface and clicks “Show Traffic”.
Figure 5 NPI Single Pane of Glass - Device Dashboard
Step 3 (Diagnose): In the “Traffic Details” view which is based on NetFlow data, the Operations user can view the dominant applications which consume the highest bandwidth (see Figure 6).
Figure 6 NPI Traffic Details - Interface traffic composition from NetFlow data
Solution 2: Tools Consolidation
Tools consolidation will improve efficiency by reducing the number of similar tools. The IT operations team would be able to:
- Maximize operational efficiency.
- Achieve application license economies of scale.
IBM Network Performance Insight (NPI) meets the criteria to be a unified network performance management tool in the following ways:
As described earlier, NPI relies on ITNM for SNMP device discovery. For SNMP performance metrics polling, NPI has its High Scale Collectors. These collectors can collect up to 100 Million metric records per hour. This is equivalent to 10,000 devices or more (depending on polling frequency and metrics collected).
In addition, NPI uses Hadoop technology to store the huge amounts of NetFlow data. With Hadoop platform, the data storage is horizontally scalable, i.e. just add additional nodes.
- Pre-built correlation between SNMP metrics and NetFlow data sources
Operations users could view SNMP performance metrics of an interface of a device and within a click, launch to the NetFlow view of the same interface to view the traffic composition.
- Integration with Cacti
Cacti is a popular open source performance metrics collection tool. NPI can integrate with existing Cacti pollers to bring the performance metrics into NPI storage.
At the end of the day, MTTR is an indicator of the operational efficiency of an enterprise’s Operations Management. Reducing MTTR in operations is an iterative process. The business justifications for embarking on this journey are faster incident response management as well as cost optimization. In this blog, I recommend you to look at MTTK (time taken to diagnose the incident) optimization first. It is a low hanging fruit because it is one area where the most time is consumed or lost. I also recommend to focus on monitoring tools consolidation - both at the UI and the back end applications level.
In short, identify the monitoring tools in your enterprise that work well together with minimum fuss and effort. I illustrated examples of such products working together using the IBM NOI offering.
For more information about IBM’s Netcool Operations Insight monitoring capability please follow this IBM Knowledge Center link:
For more information about IBM’s Network Performance Insight monitoring capability please follow this IBM Knowledge Center link: