In my previous blog (https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_creating_a_threshold_event_with_hysteresis?lang=en) I showed how using Node-RED with a TCP Socket probe meant we could have a threshold event that would not clear until three consecutive monitoring periods had been within the threshold. Quite often another threshold breach would occur before the third clear period and therefore the alarm would not clear. Additionally, if the temporal period of the DeleteClears automation is greater than the threshold monitoring period, the threshold may be breached again while the previously cleared event is still in the system and then this event would be updated. In short, it would be useful if we could indicate whether a threshold breach had been present for the entire life of the event or only for part of it - and if the latter how great a part.
In cases like the threshold events created by the Node-RED flow I described last time where we know that each update occurs every five minutes we can calculate how long the threshold was breached, namely by multiplying @Tally by 300 seconds. Since we know how long the alarm has been active from the @FirstOccurrence and @LastOccurrence fields we can calculate a percentage, and if we put that in another field then we have the KPI we can display.
We need an automation. The basic automation is quite simple:
-- find the events that the KPI applies to
for each row problem in alerts.status where problem.AlertGroup = 'Threshold Breach' and problem.Tally > 1
set monitorPeriod = 300; -- sets default
set proportionInAlarm = problem.Tally *monitorPeriod*100 / (monitorPeriod+ problem.LastOccurrence - problem.FirstOccurrence);
update alerts.status via problem.Identifier set ThresholdEventKPI = proportionInAlarm;
There is one little gotcha. If you think the life of the event is the difference between LastOccurrence and FirstOccurrence you will get strange results, percentages of 198% for example. That is because the first occurrence of the event comes at the end of a monitoring period so we need to add the monitorPeriod value (300 seconds) to the life of the OMNIbus event to get the actual total monitoring period.
Note also that I created a new field, @ThresholdEventKPI, to hold the result. Obviously this field needs to be created before the automation or OMNIbus will report an SQL error.
This approach will not unfortunately work in those cases where repeat alarms come in randomly, but in any cases where the source is a timed or polled collection of data this automation, with a few tweaks, should give some extra information.