In my last blog describing a new approach to event grouping based on the assumption that alarms that occur at the same time in the same place probably have the same underlying cause, I ended by saying event grouping by alarm scope needed to define three things:
- define what same time meant,
- define what same place meant
- select which of the group of alarms sharing the above two attributes is the most likely cause.
In this blog I will go into more detail how we implemented this in OMNIbus fix packs. However before I do that here is a link to the Object Server sql file included in Fix Pack 4
I will deal with "same place" first. As I said in an earlier blog, we started this work in response to a request from a major Asian telco. Most of their use cases revolved around correlating infrastructure and environmental alarms - power and air conditioning - to networking problems. One example was that if a cell site alarmed that it was going over to battery operation because of mains power failure and an hour later the cell site goes off air that these alarms were linked. It is after all a very reasonable assumption that the backup batteries were drained after an hour and that is the reason for the cell site being down. Similarly it was expected that a fan failure would see equipment cabinet temperatures rising and that as a result communications links might start clocking up framing errors or bit error rate test threshold alarms. So the first step was to populate alarms with the node location's site name. Ideally SiteName would be a unique identifier and would be included among the tokens sent by the element management system, as indeed it is in most cases. Failing that a lookup statement or an Impact policy could enrich the alarm from some inventory file.
Grouping by site name does not however bring in alarms from other sites that are related. One site might have a cabinet power failure and as a result there is a communications link failure. The site at the other end of that link will also generate alarms, for example Loss of Signal or Loss of Frame (or both) but as this is a different site those alarms won't be in the event grouping. This is where the concept of scope comes in. We define alarm scope as being the extent to which the impact of an alarm can be felt. Thus, as a comms link failure can be detected at the remote end of the link, the scope of that alarm should cover both A and B end sites. It should also cover the link itself, because that link might be based on transmission equipment that is supporting other links that are also in alarm. This means the choice for scope ID might be wider than just two sites.
Two new fields have been created in OMNIbus, @SiteName and @ScopeID. The diagram below represents how a GSM/3G network RAN might be presented in this way.
In practice then for a cellular network the ScopeID can be set to the BSC name, and in many instances that name can be extracted out of the Fully Distinguished Name that is in the Alarm itself. A typical 3GPP standards compliant DN might read: "GSM/BSC-43141/BCF-11/BTS-11/TRX-10". It's a simple task to extract "BSC-43141" out of there and use that to populate ScopeID
In real cellular networks though there might be multiple management domains using different inventories and naming conventions so we have provided a third level of scoping called ScopeAlias so that the same scope called by different names can be linked together, as in this example:
Scope Aliasing is implemented using a custom table in OMNIbus to link ScopeIDs
These examples are for GSM/3G networks. Other domains require a different approach. I will come back to determining ScopeID and Scope Alias in a later blog.
The next step is to define "same time". To do that let's reflect on how events and alarms are generated and sent. A network card may apparently send a "link down" event instantaneously after a cable is pulled, but in reality what has happened is that the card has detected certain control signals - which may be as simple as a voltage on a pin - are no longer present. That will take a millisecond or two to detect. At the other end of the link the network card has all its physical level indicators still working but the card has detected that the logical framing of the carrier signal is no longer present. It may be many seconds before the automatic resynchronisation processes have been tried and failed thus many seconds before the alarm is generated. Often though the physical problem is more of a dirty joint than a clean break and in those circumstances a distant end may detect the problem through increased errors in a background error rate test, a test that takes minutes to run. Or the errors may be detected by an performance management application collecting SNMP metrics every fifteen minutes. "Same time" therefore has to be a time window rather than a fixed time.
The way this is implemented in this Event Grouping automation is to define a quiet period. This is a period after the first alarm when new alarms can be added to the container. If that period is quiet, i.e. no new alarms are added, then the container is closed. Quiet period can be defined by alarm type and if no quiet period is defined then a default is held as a property. This is set to fifteen minutes in the initial installation but most users will want to reduce that.
If a new alarm comes in during the quiet period it can extend the time window if the alarm requires that.
As Quiet Period can be defined in the rules file it makes sense to set this according to the type of alarm. This can be fairly long, an alarm reporting a device has switched to battery power should be prepared to keep the container open for the hour or two it takes to drain the battery because that is how long it will take for other effects to be noticed. On the other hand low priority symptom alarms should not extend the quiet period and can be given a QuietPeriod of 1 second - not zero as that triggers the default to be applied.
The remaining question is which alarm is pointing out the underlying cause of the problem. In a previous blog I wrote about different techniques used historically and the upsides and downsides. What we are doing here is a simplified codebook approach. Rather than score all the possible alarms against each other or create loads of cause and effect relationships we have simply given each alarm a weighting, and as this is an integer determining which is the highest weighted alarm in a group is easy and efficient. And rather than do this for potentially hundreds of alarm types we have defined sixteen generic alarm types and in the rules file we map the vendor alarm codes to these. Our initial normalised alarm class list is as follows:
|Normalised Alarm Code||Cause||Impact|
|Physical||160||Control Shut Down||160||10|
|Sensor||120||Environmental Warning, inc Door Open and similar alarms||120||50|
|Operational||80||Inoperative State, Change of State||80||120|
|60||Control Path Loss||60||100|
|50||Operational Warning, inc running on backup||50||110|
|20||Workarounds in execution||20||140|
The recommended way of implementing the necessary rules file changes is as follows:
- Create a rules include file mapping vendor alarm types to Normalised Alarm Codes, OSI levels and setting individual Quiet Period times
- Acquire a copy of the genericcorr.common.include file. This file calculates the cause and impact weightings generically
- Add lines at the bottom of the existing rules to include the vendor-specific mapping and the generic.common files - in that order
- Reload the probe rules.
That covers the changes needed to rules files. Next time I'll cover setting up Event Views and provide examples