In my blog last time I looked at correlating events under synthetic container events. That is useful when the potential root cause of an incident doesn't have an alarm associated with it. A more common form of correlation is to exploit a hierarchy in the topology, and ITNM and OMNIbus do that very effectively. The Netcool Knowledge Library (NcKL) ships with a set of automations that exploit known relationships between devices, the interfaces they have and the virtual circuits over those interfaces. However, if we understand how these correlation automations work we can extend their use beyond the domain of IP networks and SNMP management.
Naming conventions are a familiar feature in IT, so familiar we are probably unaware of those conventions half the time. But consider the URL of a web page - that follows a naming convention. When networks were more hierarchical network managers attempted to make sense of them by deploying naming conventions and when the standards were developed for mobile networking - GSM and 3G - network naming conventions were standardised. On these networks devices were given what was called a Distinguished Name, which was a description that named the device uniquely (so a network wouldn't have 200 instances of "Router1") and named it so its place in the network hierarchy was defined. An example of a DN for a cellular radio transmitter might be:
A network operator would read that as being the second transmitter (TRX2) on a cell site (BSC3) attached via a circuit group (BCF145) to control region 2119 (BSC2119) on the GSM network. The network might have several thousand TRX2s but only one at that precise location and the DN uniquely identifies it. Enterprises sometimes adopt the same principles so that a wifi access point might be given the name HEADOFFICE/2NE/AP2 if its the second AP serving the north east wing on the second floor of the head office building.
Apart from identifying a node uniquely, why are naming conventions useful in event management? Well in the GSM network example a failure in the fibre backhaul network would take out not just one circuit group and cell site, but several. On the other hand a power failure at a cell site would impact all the transmitters there but not anywhere else. We are back with the single incident - multiple alarms scenario, and we can exploit naming conventions to provide us with basic correlation to group multiple alarms under a single one for incident management.
Netcool Advanced Correlation (AdvCorr)
Netcool AdvCorr is an integration of the rules file standardisation on the Netcool Knowledge Library and the event correlation based on topology of Network Manager. It is a related events correlation methodology so it requires a means of identifying which events could be related. There are two stages to the process:
- pre-classification of events into whether they could be a root cause or whether they are always symptoms of another root cause
- identification of alarming objects into hierarchical classifications of Primary Objects, Root Objects and Secondary Objects
This is achieved in the probe rules. When probe rules files provided by NcKL are used then these classifications are already taken care of, but for rules outside the NcKL framework we need to provide that ourselves.
This example uses a captured set of alarms from a real GSM cellular network, suitably anonymised of course. As the original source of alarms was a Nokia EMS, the rules file to be modified is one developed for Nokia NetAct. The first thing we need to do is to create a pre-classification file. This is where different alarm types are classified into root causes and symptoms. Fortunately the Nokia EMS, like most telco EMS, gives each alarm type a unique number so here the pre-classification can be done using a look-up table. (This file is in the zip archive linked to at the end of this blog).
Pre-classification requires some domain knowledge, but not too much. Enough knowledge to understand that a device not responding to a ping is a symptom rather than a cause, and that a Link Down alarm may be a cause but it might also be the symptom of a higher order failure.
The second thing that needs to be done in the rules file is to extract the Root, Primary and Secondary Objects from the Distinguished name. Since the DN has defined delimiters - in this case '/' - it's straightforward to do this using an extract statement in the rules file (A sample rules file is in the zip archive). The three fields that need to be populated are:
@LocalPriObj - which should be the full DN
@LocalRootObj - which in this case is the circuit group
@LocalSecObj - which is an intermediate extract between Primary and Root if one is possible
The rules that extract the Primary and Secondary Objects must also define their relationship to the Root object and to each other. Permitted values are Same (1), Alias (2) and Parent (3). Since an Object has multiple relationships the rules multiply the Root to Secondary relationship by 4 and the Secondary to Primary relationship by 16 and add them together. This turns the LocalObjRelate field into a bit position byte and thus all possibilities have a unique value.
The final thing that needs to be done to the rules file is to incorporate (or include) the AdvCorr36.include.compat.rules which come as part of the NcKL archive
AdvCorr in operation
There are three automations in the AdvCorr group. The first, AdvCorr_SetCauseType ensures compatibility with ITNM's RCA and sets the potential Cause Type to the same value as specified in the rulesfile lookup, unless there is a competing cause type set by ITNM. The other two automations populate root cause and symptom candidate tables which are then used in an iterative fashion to perform the containerisation.
It can all seem complicated but these automations have been available since OMNIbus 3.6 and simply work.
What is new though, and complements these automations very well is the new Event Viewer in WebGUI 7.4. The relationships between symptoms and root causes can be set up as a new Relationship Definition, and the Event Viewer configured to group these relationships with twisties available to expand and shrink the symptom events.
Like the synthetic event container automation covered in the last blog, these automations offer a way to reduce the number of alarms operators have to view without losing the detail needed to diagnose problems and assess impact.
A zip file containing sample files can be found here. The zip file also contains a word doc providing a fuller description.