Solve CPU loops quickly with OMEGAMON Performance Management Suite by combining z/OS and MQ information
D.C.C. 0600015U7P Visits (14295)
The new IBM Tivoli OMEGAMON Performance Management Suite for z/OS V5.3.0 is Generally Available today. This suite provides both real-time and historical performance and availability management capabilities for your IBM z/OS operating system, networks, storage subsystems, IBM DB2, IBM CICS, IBM IMS, IBM MQ for z/OS, IBM Integration Bus for z/OS, and IBM WebSphere Application Server for z/OS. This suite can help maximize efficiency and effectiveness of managing your z/OS environment by utilizing this powerful integrated tool set. New enhancements include historical and integration capabilities in the Enhanced 3270 User Interface (Enhanced 3270UI), easier installation, configuration, and maintenance, as well as new functions to help reduce monitoring overhead. Additional enhancements are included in the individual Tivoli OMEGAMON products, as well as in the suites.
Here is an example of how IBM Tivoli OMEGAMON Performance Management Suite detects a looping problem that happened in an IBM MQ Applications:
1. Subject Matter Expert Jim received a call from an operator, Annette, who reported that she received a situation alert indicating that the queue depth is getting high. Thus Jim logged into the Enhanced 3270UI of IBM Tivoli OMEGAMON Performance Management Suite to take a look, and he noticed that in the Current Queue Manager Status the Queue Health of the queue manager Q721 is in Critical status, which was caused by a High Depth Queue:
2. Then Jim clicked the High Depth Queue Count column to see which queue is in high depth, and he noticed that the % Full of Queue P6.IN.Q1 is 91.2%:
3. After he entered "S" to select the Queue Status Details of the Queue P6.IN.Q1, from the view of Application with Open Handle for the Queue, the Queue is being opened by 2 applications and A6INP and A6OUP, since we have high depth queue problem, thus there might be something wrong with the read application, i.e. A6OUP, so check its application detail.
4. Since the problem is a queue getting full problem, thus there might be something wrong with the output application, thus Jim entered "S" to select the application A6OUP to check its application details, and he saw that besides the MQ information, there are the z/OS Address Space CPU Details for this application, which allows us to zoom on the Job Name.
5. Zooming on the Job Name heading, it brings up a popup menu of options for various places in OMEGAMON/zOS as well as some Take Action commands like cancelling a task. The menu also includes an option to go the Bottleneck Analysis for the Address Space, Jim decide to use that option to take a look:
6. The option B brought Jim to the Bottleneck Analysis panel, the Bottleneck Analysis of the address space A6OUP shows its CPU Loop Index is increasing, CPU Loop Index is designed to overcome these issues and make detecting CPU loops an easier task. The purpose of this metric is to characterize the intent of an address space to use the CPU. Looping jobs will show a consistent intent to use CPU to the exclusion of any other resource. Even when they are parked by WLM or other z/OS policy actions, their intent to use the CPU can be detected.
So looks like the job A6OUP is looping. With OMEGAMON/zOS, Jim decide to Inspect what the job is doing now:
7. The Inspect CPU Usage panel allows you to observe where in the executable code a z/OS address space is spending its time. The CPU Usage for ASID view contains data drilled down to the agent-selected level of granularity within each CSECT for the most active TCBs and sorted in descending order of CPU usage percentage. The Inspect agent returns data only for elements for which it observed CPU activities. Use the information in the view to identify where in the code an address space is spending its time.
From the table, Jim noticed that the load module CSQ4BCJC does not load any other module, thus possibly there is a loop happening in the module. Which means although the job P6OUP opening the P6.IN.Q1 queue for input, but it does not do anything:
So Jim reported this problem to the Application Team and asked them to check what’s wrong with the application P6OUP, and whether he can cancel the P6OUP job and restart the job again. After checking the symptom of this problem, The Application Team Duty Manager told Jim it is a known issue, and the PTF will be applied in the coming maintenance window, and it is safe for him to cancel this job and restart it again.