Sitworld: TEMS Audit Process and Tool
John Alvord, IBM Corporation
There have been cases every year where a TEMS was running with high enough CPU/Storage resource usage that the customer was concerned. In some cases, the TEMS experienced a steady storage growth and failure after some days. In all recent cases, this condition has been triggered by workloads or environmental conditions. Situations are the most common workload but SOAP calls, historical data collection and Portal Client workspaces can be involved. In addition there may be important error messages in the diagnostic log.
In 2008 I began work on a year old customer problem that took until Spring 2010 to resolve. The final conclusion was that a certain type of situation caused a severe storage fragmentation and a TEMS failure in 10-12 days. The customer decided to recycle the remote TEMSes once a week. That was probably the single most expensive PMR [for IBM and customer] I ever worked on.
Some 6 months later in Dec 2010 I had a customer with six AIX servers running at 95% utilization and all from remote TEMS processing. I was able to present a solution quickly. However the customer was unconvinced. I wrote a Perl program to summarize the results in a spreadsheet file. The customer was convinced, made the changes and those six systems dropped to 10% utilization. In March 2011 I published the process and tool as a technical note and it is now widely used.
TEMS Audit continues to be enhanced as new issues are encountered. The technical note as documentation became unwieldy so I reworked it as an install guide and a usage guide. All the recent changes have been documented and are included in the program objects below. The advisory messages are of special note since it points to specific issues or alternatively states there is no issue identified.
Here are recently published versions, In case there is a problem at one level you can always back up.
Correct some logic on incoming PostEvent messages
Handle a null SQL text continuation better
show more contributors to the resource interval report
Show duplicate agent online better - top 20
Add SimpleHeartbeat report - finds another type of duplicate agent
Add Major Jitter correlation report
Make command capture logic work on z/OS logs
Add result interval report
Add results interval count report
Restore ProcessTable report section
Add PostEvent report section
Add SQL report section and a maximum concurrent action command section
Add action command report section
Correct defect on z/OS log and tracking listen pipes
Add Soap Burst Advisory and SOAP Detail report
Add advisory for ulimit stack more then 10M
ProcessTable Summary, listen pipes. "No Matching Request" error summarized, Nofile advisory, improve -z option processing
1.10000 - last technote version
Advisory section. 16meg truncation warning
Identify and correct workload and configuration problems. I encourage anyone to share success stories, enhancement requests or problems found.
Note: Art Deco Cat sculpture