Gaining more insights using System Automation for z/OS and OMEGAMON
JuergenHoltz 110000K5JP Visits (2605)
The System Automation for z/OS team has just released the small product enhancement APAR OA43571 which contains the following functions:
Let me highlight and explain the new functions in more detail.
Immediate message reporting on the Tivoli Enterprise Portal
With the System Automation for z/OS Monitoring Agent, you can bring your operational data into the Tivoli Enterprise Portal (TEP). Side by side, you can display automated resources with their status next to performance monitoring data from the OMEGAMON, IBM Tivoli Monitoring (ITM) and IBM Tivoli Composite Application Management (ITCAM) products deployed in your environment.
On the TEP, you can define situations, special conditions that you want to monitor always and for which you want to be notified once they occur. Up to now, System Automation for z/OS has only supported sampled situations. These are situations where the condition is re-evaluated periodically, i.e. at a user-defined interval, and if the result is true, a situation event is raised. This leads to an effect where conditions that happen and are then resolved during such an interval remain undetected. The longer the interval, the more conditions potentially remain undetected. The shorter the interval on the other side, the higher is the overhead due to extra polling.
Now, with OA43571, System Automation for z/OS provides support for pure event situations. Unlike sampled situations, pure events can be generated by the Monitoring Agent itself and pushed to the TEP for immediate reporting. The first attribute group that supports this capability is the new “Message Events” attribute group introduced with this SPE.
When the Monitoring Agent is configured and started, System Automation for z/OS can now forward any messages that are captured by its automation policy to the Monitoring Agent which will relay it to ITM as a pure event. If there is a situation defined on the TEP for such a message, the event will be displayed immediately on the Situation Event Console. Similarly, the navigator entry for this Monitoring Agent's node will receive a decorator to indicate that situations exist on this node. Operators on the TEP are now notified automatically for operational conditions of interest and none of these conditions remain undetected anymore.
For more information about capturing messages and using this new function, refer to What
Looping address space suppression
It is not so uncommon that there are workloads on your system that fail by falling into silent, CPU demanding loops. In some cases, they can consume an extraordinary amount of processing power and not being able to detect this early is at the very least a waste of resources but can also hurt your business as they could block other, maybe more important workloads, competing for processing power.
Of course, you can start to write your own monitoring scripts, identify the candidates yourself and decide from case to case, what to do. However, with OMEGAMON XE for z/OS and System Automation for z/OS, you now have a much better choice! You can gain better insight into what is running on your z/OS systems and combine that with the flexibility offered by the System Automation for z/OS automation policy to make the right automated decisions for each individual case.
With OA43571, System Automation for z/OS provides an out-of-the-box solution for you that monitors all address spaces (SA-managed and non SA-managed, such as batch) and that can automatically discover and resolve these situations by giving looping address space candidates a specific treatment based on definitions in your automation policy.
Rather than performing detailed using and delay analysis itself, System Automation for z/OS periodically asks OMEGAMON XE for z/OS for address space bottleneck analysis data. It asks for address spaces that have a very high CPU loop index. These are address spaces with a high probability of more or less just consuming CP or adjunct processor (zIIP, zAAP) cycles, while doing almost no I/O or waiting for any other reason.
So, there are candidates, what comes next? The solution is very flexible as it allows you to categorize your different workloads. Simply spoken, you can distinguish test from production work or discretionary work from your business critical work. For each category described as such in the automation policy, you can define particular recovery rules. For example, you can decide to cancel all address spaces that fall into the test category while you can also decide to not touch any address space in the business critical workload category. And as you might expect, there are other alternative recovery means between the extremes just mentioned. System Automation for z/OS also supports to warn an operator, to kick of OMEGAMON XE inspection to gather more details about a potential looping address space, to RESET the address space to another service class or combinations of those.
In order to get familiar with this new solution, System Automation for z/OS allows you to log candidates and potential recovery actions into the NetView netlog. Over the time, when you feel comfortable with your categorization and recovery rules, you can switch from logging only to actual recovery.
For more information about setting up this solution and how to use it, please refer to Loop