Managing looping address spaces

IBM Z® System Automation works with OMEGAMON® on z/OS® to detect and respond to looping address spaces.

Any address space has the potential to fall into a looping pattern, including a started task, batch job or TSO user. It can be difficult, however, to distinguish between a looping address space and an address space that is merely performing a CPU-intensive part of its normal execution sequence. The cost of mistakenly canceling a batch job that could represent several CPU hours of work, for example, requires mechanisms to distinguish between real looping situations and false positives.

OMEGAMON on z/OS starts the detection cycle by calculating a CPU Loop Index value, which represents the amount of time that a job is using or waiting for the CPU. This value is designed to slowly increase over time, so that a high value indicates that the address space is doing nothing other than consuming CPU. Therefore, a high CPU Loop Index value indicates a high likelihood of a looping address space.

IBM Z System Automation allows you to define a pass-based recovery policy for address spaces with high CPU Loop Index values. This recovery policy categorizes each address space as one that should continue running, stop running, or be investigated further. The policy also specifies recovery actions for each of the categories, which can include various combinations of ignoring the address space, notifying operators, gathering diagnostics and automatically cancelling the address space.

Looping address spaces shows how OMEGAMON on z/OS works with IBM Z System Automation to detect and respond to looping address spaces:
Figure 1. Managing looping address spaces
The diagram demonstrates a method for managing looping address spaces.
IBM Z System Automation first issues a SOAP query to the Tivoli® Enterprise Monitoring Server, requesting a list of address spaces with high CPU Loop Index values. For each address space that is returned, it applies the categorization policy and then takes one or more of the following actions, depending on the recovery policy:
  • Ignores the indication
  • Requests a detailed analysis from OMEGAMON on z/OS, which can identify the malfunctioning segment of the program
  • Notifies the operator that an address space is potentially looping
  • Requests for IBM® Workload Manager (WLM) to adjust the service class of the address space
  • Requests for WLM to suspend the address space
  • Stops the address space.

A video is available on YouTube that provides additional details: Automating the suppression of a looping address space.

For additional information on these suite components, see the component products listed in the Overview.