Don't be doomed to repeat history, use it! Use OMEGAMON XE on z/OS version 5.3.0 with history
Brian Kealy 110000T2V7 Visits (2744)
One way or another we've all heard the quote “Those who don't know history are doomed to repeat it.” This is as true in managing problems in z/OS LPARs as anywhere else. History can teach us about recurring problem and about lingering problems. In this discussion we will show how the new OMEGAMON XE on z/OS version 5.3.0 Enhance 3270 User Interface can help system managers identify DASD device contention, how long it has been happening and who is involved.
Imagine for now that there is a z/OS operation support desk where our hero, Bob, receives a call asking why the clients jobs appear to be delayed.
Hi I'm Bob. I work in operations support and on 07/18/2014 at about 3pm I received a call that a user's batch jobs BKEALIO1 and BKEALIO2 in the CVT53PLX sysplex were being delayed. I have the new OMEGAMON XE on z/OS version 5.3.0 with Near Term History support and the Enhanced 3270 User Interface application that allows me to explore my z/OS environment. This is how I used it to investigate what was happening?
I select the “V” option to see how my service classes are behaving right now.
Looking at Service Classes for Sysplex I see nothing suspicious for the batch class which should be where the jobs BKEALIO1 and BKEALIO2 will be managed. But given the report I decide to look further at batch period 2 for Workflow Analysis.
The menu tells me that option "D" will show me Workflow Analysis for the Service Class. This will tell me how various system resources like CPU, I/O activity, Enqueue, etc. are being used or contended for among the jobs in this service class.
Workflow Analysis for the BATCH service class shows me that the service class is experiencing over 31% I/O wait. I also see that there is significant I/O activity for device TDSL13. I decide to use the "H" option to explore near term history for this device. OMEGAMON XE on z/OS in this Enhanced 3270 User Interface shows me the last 2 hours of historical activity by default.
I see that the device has been experiencing a consistently high Activity Rate over these two hours. So now I would like to see who is using this device. I can use the "S" option on any row to see what address spaces are using the device at that time. Indeed when I do that I see the batch jobs I was called about are delayed for the device. We also see that there are two started tasks contending for use of the device.
I'd like to learn more about when this contention started. I can use the history time configuration to look at the last 12 hours. I do this by selecting the View menu and from there select the History Timespan option. This gives me the History Selection pop-up.
I use option 2 and set the period to the last 12 hours and use OK to accept this change. When I refresh the Historical DASD Device Summary workspace and scroll to the bottom of the rows presented I see activity dramatically increased for this device at about 10:10 am
Now I'm curious about what jobs were using the device at that time. I can use the "S" navigation on the 10:10 row to see what jobs were using the device at that time.
So the two started tasks I still see using the device started using it at about 10:10am. I can use the navigation tab at the bottom of the screen to move forward in time, one 5 minute increment per jump. At around 11am we see the batch jobs starting to use the device as well.
Another interval forward shows
Now I see that by 11:05 the two batch jobs are being significantly delayed for device usage by the started tasks. I can now tell the batch job user that his jobs are contending for I/O access with two started tasks. The owners of the jobs and the tasks should perhaps run at different times or segregate their files so they are not on the same volume.