Detecting CPU looping address spaces

Address spaces can occasionally fall into a CPU loop where a task executes instructions endlessly. Looping is usually an unproductive event, using CPU resources that could be better spent on other workloads. Looping may be a symptom of a storage corruption within the application or perhaps a design flaw that did not anticipate some rare set of environmental circumstances. Whatever the cause of a loop, it has proven difficult in the past to detect that an address space has begun to loop.

On the surface, detecting a CPU loop should be simple: just look for an address space that is using 100% CPU. On z/OS®, this is not a good strategy. Most logical partitions (LPARs) are defined with several logical processors (LPs) that can each run instructions for different dispatchable units such as tasks, enclaves, and service request blocks (SRBs). Under these circumstances, does 100% CPU refer to 100% of a single LP or 100% of all the logical processors in the LPAR? Typically, a looping address space is consuming a single logical processor.

Another confounding factor is the z/OS dispatching algorithms, including Workload Manager (WLM), which actively try to distribute processor resources appropriately among all the competing dispatchable units. These algorithms interrupt a looping job to dispatch other work that is not getting CPU resource which it is due under the service policy defined. Eventually, the looping job gets redispatched and squanders more CPU. But because of the interrupts, the job’s measured CPU% will drop. Often looping jobs will be parked by the system policy so much that their measured CPU percentage will be too low to be detected by simple threshold settings.

Another popular strategy for detecting CPU loops is to look for address spaces with high CPU usage and low or no I/O activity. But as already noted, it is not easy to define what “high CPU usage” means; can it be said with confidence that a job using little or no I/O is clearly misbehaving? What about an application that has much of its working data cached in memory so that it can be more responsive to transactions? This is a good performance strategy, but it means that the application will also present a profile of low or no I/O activity.

OMEGAMON XE on z/OS offers a metric, CPU Loop Index, which is designed to overcome these issues and make detecting CPU loops an easier task. The purpose of this metric is to characterize the intent of an address space to use the CPU. Looping jobs will show an unrelenting intent to use CPU to the exclusion of any other resource. Even when they are parked by WLM or other z/OS policy actions, their intent to use the CPU can be detected.

The calculation of the CPU Loop Index is done over 10 or more minutes for workspace reporting. For situations, the calculation are done over either 10 minutes or the refresh period of the situation, whichever is larger. There is no need to use persistence. Service classes determined to be of low importance automatically use longer periods of calculation to avoid false positive indications.