Sitworld: Suppressing Situation Events By Time Schedule
John Alvord, IBM Corporation
How can situation events be suppressed during specific time schedules. There are many solutions and this is a simple and reliable way to make that happen. See a section at the end for other methods. The example situations are available in a zip file linked here. There is a base situation and an until situation which work together to achieve the goal.
Development Until Situation
The example until situation controls the time schedule. It must have the same sampling interval and the same distribution as the base situation.
Note that the situation uses Local Time attribute as defined in the Unix OS Agent attribute group. All agents have Local Time and two others [Universal Time and Universal Message]. That means the formula is evaluated in the same context as the base situation and about the same time. The formula must be True when you want the base situation events suppressed. In this case we are suppressing events from 8am to 5pm. This is exactly opposite to other solutions. After developing and testing you will also want to dissociate the until situation from the TEP navigation node. It is a helper situation and you do not need to create alerts for helper situations.
The until situation can be used for any other situations that have the same sampling interval and the same distribution.
In production use you would have the until situation with Run at Startup and the sampling interval would be a more reasonable 15 minutes or whatever is appropriate for the business needs. You cannot predict when situations evaluate. Typically they run first at agent startup and then every [sampling interval] minutes. However that can change if the situation is manually stopped and started.
Development Base situation
The example situation is for Unix OS. A situation event is created if a certain process is missing.
Not seen here is that in Advanced Persist=2 has been set. An explanation follows.
Clicking on the Until tab we see
The *UNTIL/*SIT has been set to the previously created UNTIL situation based on a time schedule.
How it Works
When these situations evaluate periodically at the agent, results may be sent to the hub TEMS if they are true. The TEMS decides whether a situation event is warranted. When there is an Until situation defined and the condition is true [results exist], during TEMS evaluation any open situation event is closed *and* from then on the base situation results are discarded. That means no future situation events until the until situation becomes false and allows the base situation results to generate a new event.
The base situation uses persist=2 because it is unpredictable when the base results and the until results are processed. When everything is running perfectly it can happen within a second or less. However if the TEMS or the Agent is workload stressed or if there are communications are flaky, one result may arrive significantly after the other. The persist=2 setting gives time for things to settle down. If you don't care about a few invalid alerts once in a while you can leave out that persist setting.
Because of a double timing cycle [one cycle at the Agent and a second cycle in the TEMS SQL processor] it can take twice the sampling interval for a true condition to be reported as an event. That should be planned for in deciding the sampling intervals. On the other hand, if the recovery from the issue is not that critical consider increasing the sampling interval. You get the full benefit from the events but don't pay the resource cost of constant activity.
Negative Aspects of Example Solution
It takes manual effort to create all these time schedule until situations.
Changing them can take manual effort also... like adding holidays.
It has more overhead then other solutions since situations keep evaluating the whole time.
1) The Overseer Workflow Policy Pattern is an extremely low overhead solution. The situations just do not run during the suppression period. On the other hand the development is more challenging.
2) You can add LocalTime tests to the Situation Formula. During TEMS processing that creates hidden sub-situations. That case is subject to the Persist=1 problem. That solutioncannot shared across situations so manual work is higher. This also prevents use of DisplayItem and also action commands.
3) You can add *SIT tests to the test, but it has the same drawbacks as (2).
4) You can have a situation override - so the formula itself changes by a schedule. That is another level of complexity and more moving parts.
5) Using a site program, you could have the situation distributed by a Managed System List and at the given times use tacmd editsystemlist functions to delete and add back in agents from the MSL. The challenge here involves handling cases where the TEMS recycles and ensuring that the external program doesn't stop.
6) The most modern solution is to add logic into the event receiver to ignore events from certain situations and certain agents during predefined conditions. That means some work in functions like Netcool/Impact but virtually eliminates the effort of forcing ITM to handle such work.
This shows how to create situations that will have situation events suppressed during a time schedule.