In this blog we come to the final piece of setting up Event Grouping, namely setting the time window. The time window is important because the principle on which event grouping is based is that events that occur in the same place at the same time are likely to have the same underlying cause. If no consideration of time was made then the event grouping might easily group alarms that have different causes, and that will also be the risk if the time window is set too long. On the other hand if the time window is set too short the event grouping may fail to catch those alarms that take a little while to generate, such as performance thresholds being breached. Like ScopeID then, setting the time window requires a bit of domain knowledge.
This diagram shows how we initially saw time windows working
The actual implementation is slightly different but it still follows the principle of the first alarm setting a time window which following alarms can extend if necessary. As the container closes when the time window expires without further extension we call the OMNIbus field that contains the time window length QuietPeriod, the concept being thta the time window closes after a quiet period of that length
The default QuietPeriod is held as a property in the master.properties table and is set as 900 seconds at install. This may be too long on busy systems but can easily be edited. This default value is used when the QuietPeriod field in an event is set to the default of zero.
QuietPeriod can also be set on an alarm by alarm basis, and for that we need to consider how alarms are generated and published. These can be quite different. Some alarms can be generated through instrumentation detecting a change in state and then published immediately and unsolicited. The time between cause occurrence and reception in an event management application is short enough to be considered as near real time. Most are probably not as fast as the reporting standards required for IEC 61850 compliance in electrical substations, which requires an alarm to reach the target system 4 milliseconds after the condition arising but then few IT systems could cause things to literally melt as a substation short circuit can. However there are a large number of alarm types where arrival in OMNIbus is within seconds of the condition causing the alarm arising. Not all do though and the cause of delay varies:
- alarms that are not solicited but need to be retrieved by polling the event table inside a device will be delayed by the length of the polling cycle which is typically one minute. Most OMNIbus Telco Service Monitors (TSM) included a polling application that ran every minute
- alarms may be delayed because a system goes through an automatic retry or reset process before reporting an alarm, this too may add up to a minute's delay before appearing in OMNIbus
- alarms that are generated by testing a sample of data, for example bit error rate test (BERT) alarms will be delayed by the length of the sampling period
- alarms that are created by external performance monitoring systems reporting a counter or delta between counter values has exceeded a threshold will be delayed by the delta period. This can be significantly delayed with polling cycles of 15, 30 minutes or even an hour being common
QuietPeriod needs to be set so that a container does not miss these delayed alarms, though with the last case a different approach may ultimately be needed.
The other consideration is how long it might take for the impact of an alarm to be felt. Datalinks may fail but if there is redundancy the impact may only be congestion detected an hour later. Another common delayed impact is when a server or equipment rack switches to battery power when the mains power fails. Only when the batteries drain an hour or more later will any impact be detected. However given that in such a case the mains failure is almost certainly the underlying cause of the incident we want the battery back up alarm inside the event container and not on its own in a different one
A final consideration is the likelihood that a particular alarm is reporting a condition that will trigger other alarms. If the likelihood is high then the QuietPeriod should be set long enough to catch these symptom alarms, but an alarm that is clearly a symptom should not extend the time window. Nor should a Resolution event, any event where Type=2 should have a QuietPeriod = 1 (0 triggers the default of 900, remember)
Each environment will be different, but here I suggest this rule of thumb:
- Likely cause alarms have QuietPeriod = 120
- Possible cause alarms have QuietPeriod = 60
- Symptom alarms have QuietPeriod = 1
- Resolution events have QuietPeriod = 1
- Environmental alarms where the impact is likely to be delayed (e.g. power fail, fan failure) have QuietPeriod = 900 or more
In the next blog I will describe how this EventGrouping has been implemented in the OMNIbus FixPacks.