What are the new features in Netcool Operations Insight all about?

Technical Blog Post

Abstract

Body

Netcool Operations Insight 1.3 introduces some exciting new capabilities - but what are they all about and how can they help me? And what's the difference between the new capabilities introduced in version 1.3 and the existing ones? This blog aims to provide some insights (no pun intended) into these new features - and how they can be used to great advantage to streamline your operations and reduce costs.

So, what is the difference between Seasonality, scope-based event grouping, and analytics-based related event grouping? Seasonality, in a nutshell, is an analytics-based tool to help you identify chronic issues in your environment. The two event grouping functionalities are designed to operate in a complementary manner - to sort and group events by incident - based on relationships I know about (scope-based) and the ones I don't (analytics-based). Read on for more details on each.

What is Seasonality? (analytics-based)

Imagine you get a critical low disk space warning alarm that occurs every Monday at 4am when the backup jobs are running. This cuts an auto-ticket, which is assigned to a L1 operator. By the time the operator receives the ticket and checks the disk space on the target server, the disk level is OK again - and so the operator closes the ticket. Next week, the same thing happens again - Monday morning at 4am - but there is a different operator on duty that gets the ticket. This goes on for weeks and months because the operations staff is large - and so a different person gets the ticket every week it happens - and, due to the volume of tickets operations deal with, nobody spots the pattern. This, in turn, incurs significant and wasteful cost to the business.

The Seasonality function works by analysing the historic event archive (REPORTER schema in the Tivoli Data Warehouse) looking for individual events that occur with any sort of degree of regularity. For example, this could be at the same minute of the hour, or the same hour of the day, or the same day of the week, or day of the month - or a combination.

The value of this is that it helps to identify chronic issues in our environment - issues whose temporal characteristics would typically not be noticed by operators in the NOC - issues such as our recurring critical disk space alarms. In many cases, the characteristics of the seasonality are clues to the cause of the underlying problem. In this case, we might ask: "what happens every Monday at 4am?" A seasoned operator with some "tribal knowledge" looking at the seasonality results would know that Monday at 4am is when the backups run. A simple assessment of this issue could result in, for example, more disk being allocated to the backup job.

By resolving chronic issues like this, one can relatively easily calculate the monetary saving to the business by not cutting these tickets anymore. One large North American bank were able to reduce their overall event numbers by 12% by identifying and rectifying issues that were causing seasonal events and resulting in costly trouble tickets.

Event grouping - what is the problem we're trying to solve?

Even when we have applied best practices and reduced the events we get to a minimum, there can still legitimately be large volumes of events during a major outage - many of them "critical". Further, events may be generated from many different sources, technologies and business units; and touch many different stakeholders. Finally, events for a single incident may arrive over an extended period of time. There is also no guarantee what order the events may arrive.

Ultimately this can lead to a lot of confusion when presented with such a "storm" of events. This information overload can cause many parallel, duplicate investigations; usually in the form of trouble tickets being opened. Not only does this add to the confusion, it also fragments out the pieces of the puzzle that are needed to work out what has happened.

Traditional methods of discarding an increasing number of events and applying filters come with inherent risk - what if we inadvertantly discard or filter out the wrong events? What if we hide from our operators the level of detail they need to have a "full picture" of what is going on, in the name of event reduction?

What is needed instead are tools that allow us to sort, prioritise and group events together by incident - so that one trouble ticket is created for each issue - and all the information pertaining to that issue is contained in one place. This will reduce duplicate efforts, provide the engineer tasked with pin-pointing the causes with the "full picture", reduce mean-time-to-repair, and ultimately reduce costs.

Sounds like the "Holy Grail" of event management? Read on...!

We create millions of measures per day, which translates into thousands of notifications or events per hour, which translates into dozens of activities or tickets per day. While the measures and notifications cost mere cents to generate, it's the activities that are the most costly - ones where human action is required. Even trouble tickets that require no action - ie. redundant duplicates - incur disproportionately large costs to our business. The goal therefore is to minimise the tickets and associated activities to only what is needed.

One of IBM's clients in the telecommunications industry reported that sometimes dozens of duplicate tickets can be opened during a major outage. The cost involved in managing this, not to mention fixing the fault or faults in a timely fashion, can be huge. The business case is clear - clients need help to organise and sort this "big data" of events that they are tasked to manage and make sense of. This upward trend of increasing event volumes seems only set to continue, also.

Event Grouping (scope-based) - event grouping based on the relationships I know about

There is often a great deal of monitored technologies in our infrastructures that are very structured - and often that structure is either encoded into the event data - or is stored somewhere where we can access it. The scope based event grouping provides a productised framework whereby events relating to an incident can be automatically grouped together based on their common scope - and that have occurred within the same time window.

The premise goes as follows: "if I receive a set of alarms - from the same place - at the same time - then it is highly likely that the set of alarms are related to the same problem." The "same place" is another way of saying "scope". Another way of thinking of scope is: "if something breaks and genrates alarms, and other things are affected by that breakage and generate alarms too as a result, these things can all be considered to be within the same boundary of affectation - or scope". If I then have a way to define and set this scope in my event set, I then have a handle to containing my events based on this scope.

So what do we mean by "time window"? If we receive a steady stream of alarms for a given scope, it makes sense to keep grouping those events together - as they're likely all related to the same issue. If we stop receiving any further events from that scope for a "reasonable" length of time however, we can assume we have received all the alarms we are going to for that particular incident. If we subsequently receive further alarms for that scope sometime later - ie. after the defined "quiet period" - we can treat it as a separate incident, and hence create a new grouping.

Consider a server room at a remote site - eg. ScopeID = 'SITE12'. We start receiving environmental alarms from SITE12 telling us that the air conditioning has failed - followed shortly afterwards by "high room temperature" warning events. 10 minutes later, we receive a flurry of failures from systems located at SITE12. We could reasonably deduce that all of those events are related - since they have all originated in the same server room - within the same time window.

By applying the scope-based event grouping function, we could define our "ScopeID" to be the site ID of the server room - ie. "SITE12" - and set this in our incoming events. So long as we receive a steady stream of events from SITE12, the framework will continue to add the incoming events to the grouping.

The grouping is visualised in the Netcool/OMNIbus Web GUI Event Viewer via the new twistie feature. A synthetic containment event is created for the grouping - and the "real" events are grouped underneath it. A trouble ticket can then be cut off the synthetic containment event, and the ticket number will automatically be propagated down to all unticketed child events. There are two advantages to note about this approach. One advantage is that events relating to the incident may come in, and then clear, and then come in again - making it very difficult to ticket against. Another advantage is that all unticketed events that get slotted into the grouping will automatically inherit the group ticket number from the synthetic containment parent event. Even if the events drip-feed into the system over a period of hours, they will automatically be appended to the grouping - and hence inherit the ticket number from the parent - and hence propagate to the ticket - potentially providing key information. This saves any manual efforts in associating events with existing tickets - and helps avoid costly duplicate tickets being opened.

We now have a basis for defining incident containment based on "scope". To leverage this containment, one needs only to set ScopeID in the incoming event stream to suitable values - and the containment will then happen automatically. IBM have embarked on a campaign of encoding this scope information into our Probes off-the-shelf, where it makes sense to do so. This means the feature is truly "out-of-the-box" for these Probes, since a client could simply deploy the Probe and see the grouping happening from day one. The framework is also open, of course, for clients to define their own set of ScopeIDs, in a manner that makes sense to their business. Some clients elect to use event enrichment on the events via Netcool/Impact to set ScopeID. The grouping function will work in any case, however ScopeID is set.

Is there anything more we can do with these groups of events?

So now we have the events neatly grouped by incident. What else can we do with this set of events to make sense of what's gone wrong in our network?

Typically equipment vendors assign the severity of an alarm according to its impact on the service or users. The events that are reporting a service is unavailable however, while critical to the business, are not typically the events that are the likely causes of an outage. Environmental alarms, such as power failures for example, while not directly reflecting the status of a service, are more likely to be causes. Ironically, very often likely causes can be reported with a lower severity than the ones reporting the service unavailability. While impact is important in knowing how my business is affected, operations are more interested in the causes, since their primary responsibility may be to fix the problem. Hence it is useful to differentiate between the impact an alarm has to the business from the likelihood this alarm is a contributor to the cause of a problem.

The scope-based event grouping function further introduces two new fields: ImpactWeight and CauseWeight. As the names suggest, they are a representation of the weighting an alarm has - expressed as an integer - in terms of both its impact and how likely it is that the alarm is a cause. The integer values mean nothing in themselves - but when compared to other alarms in a group, both the highest impact and the likely causes of an incident will quickly bubble to the top of an appropriately configured Event Viewer. Moreover, the event grouping function considers the set of events and highlights in the Summary of the synthetic containment event a precis of what is the main impact of this incident is - and its likely cause.

A powerful element of this ability to compare events with each other is that the events can potentially come from any source - and be readily compared with each other simply based on the integer values. Weights can be dynamically adjusted too; to increase or decrease an event's significance in terms of its impact or weight. This function is entirely data-driven and dynamically updates several times a minute - so that the current status of the synthetic containment event accurately reflects the underlying event set.

This process is in many ways analogous to the way a human would consider the facts available at the moment, and then make a conclusion about the causes based on those facts. If new information subsequently comes to-hand, this may alter the understanding of the situation, and any subsequent conclusions.

The process of applying impact and cause weights to our alarms is essentially encoding the SME knowledge into the alarm stream. Just as with ScopeID, IBM is also going through a process of encoding ImpactWeight and CauseWeight into Probe rules out-of-the-box, where it makes sense to do so.

In the server room above example above at SITE12, there may be a large number of events relating to this single problem. By additionally applying an ImpactWeight and CauseWeight to the alarm set, not only will the operator be presented with a single row - which can be expanded for inspection to see the underlying events, they will also see a concise summary of the main impact and cause of the issue. In this case for example, it might be "Performance failure caused by Environmental alarms". If the operator's Event Viewer is ordered by CauseWeight, they will also see the highest weighted cause events conveniently bunched at the top of the grouping. Note that very often there can be multiple contributors to a problem - and so having all the "likely causes" bunched at the top of the grouping can be extremely helpful. From here, a single ticket can be cut from the top-level synthetic event - either automatically or manually by the operator - and the problem can be progressed in an efficient manner; taking into account all pieces of contributing information from all related events.

The rule-based event grouping groups events together based on relationships we know about. The analytics-based event grouping described next, groups events together based on the relationships we don't know about. These two functions work together to, as much as possible, achieve the goal of sorting, organising and grouping events together based on incident.

Quick reference to key fields used by scope-based event grouping:

ScopeID: string field containing the scope that events are automatically grouped on
ImpactWeight: integer field containing the relative impact this event has to our business, services or users
CauseWeight: integer field containing the relative likelihood that this event is a cause of a problem
NormalisedAlarmName: string field containing text that is used in the Summary field of the synthetic containment event

Simply set ScopeID and the grouping will happen. Set the other three fields, and the impact and cause analysis will also occur.

Related Event Grouping (analytics based) - event grouping based on the relationships I don't know about

The above section describes a framework for grouping events based on a scope or relationship I know about - relative to a time window. But what about everything that is left? What if my event data has no easily definable scope? What if I don't necessarily understand the underlying relationships in my event data?

The analytics-based Related Event Grouping function analyses the historic event archive (REPORTER schema in the Tivoli Data Warehouse) looking for groups of events - that always occur together - and within the same time window. Identifying incidences of groups of events always occurring together - particularly when this has happened several times - provides strong evidence that the said events are in some way related to each other - and potentially the same fault. Even if we can not imply causation from this relationship, we can still infer correlation - and knowing that a group of events always occur together can equip us with valuable insights to the underlying infrastructure and to any faults that occur.

Once the analytics engine has performed an analysis of the historic event archive - looking for groups of events that always occur together, an administrator can look through the resulting groupings that are found. The result dashboard provides a convenient portal through which to inspect the discovered groupings, the times the groupings have occurred previously, and the individual events that were present in each case. Once the administrator is satisfied that any given grouping is valid, they can then either: Watch, Archive or Deploy the grouping. "Watch" means the system will continue to gather statistics about the grouping until further notice. An administrator may wish to check back in a month's time if the grouping incurs more occurrences before deciding whether or not the grouping is valid. "Archive" means the system will store the grouping away for later reference. "Deploy" means the system will automatically group these events together in the Event Viewer if any of the events are seen again in the future.

Some customers that have evaluated the analytics-based related event grouping function have discovered non-obvious relationships within their event stream, that they weren't aware of.

One large European bank discovered from the grouping results that whenever they tested a failover/failback system each Monday morning, this triggered a number of application failures in another department. Application users working in another department had to live with their systems going offline for a while every Monday morning - and not realising why. The applications had a chain of dependencies back to the underlying system that wasn't obvious, on the surface. This lead to the realisation that the failover/failback system wasn't actually working correctly, despite it reporting that it was. The related events analytics however revealed this relationship - and enabled the team responsible for the failover/failback system to investigate and rectify the problem.

Where this feature returns real value from an operations point of view, is that NOI can be instructed by the click of a button to automatically group these related events together in the Netcool/OMNIbus Web GUI Event Viewer, should they ever occur again in the future. This is done in a similar manner to the scope-based event grouping described earlier. Indeed, these two functions have been designed to be complementary in their workings; both work together to try, as much as possible, to sort, organise and group events together - and so that the groupings created by either mechanism can be visualised together within the same event view.

The value event grouping brings is clear: sorting and grouping related events together, reducing rows presented in operator screens, reducing confusion when major outages occur, reducing costly duplicate tickets being opened, reducing MTTR. A key point regarding the analytics-based event grouping function, is that the administrator has not had to write any Probe rules file, OMNIbus trigger or Impact policy. The system learns from what has gone before - and does this work for you. This is especially powerful in an environment that is constantly changing.
--

For a demo of these features and more, why not visit the Netcool Operations Insight demo hosted on IBM Service Engage:

https://www.ibmserviceengage.com/it-operations-management/explore

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11082259

Tips

What are the new features in Netcool Operations Insight all about?

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?