Sitworld: TEMS Audit Process and Tool
Version 1.22000 11 February 2014
John Alvord, IBM Corporation
There have been cases every year where a TEMS was running with high enough CPU/Storage resource usage that the customer was concerned. In some cases, the TEMS experienced a steady storage growth and failure after some days. In all recent cases, this condition has been triggered by workloads or environmental conditions. Situations are the most common workload but SOAP calls, historical data collection and Portal Client workspaces can be involved. In addition there may be important error messages in the diagnostic log.
In 2008 I began work on a year old customer problem that took until Spring 2010 to resolve. The final conclusion was that a certain type of situation caused a severe storage fragmentation and a TEMS failure in 10-12 days. The customer decided to recycle the remote TEMSes once a week. That was probably the single most expensive PMR [for IBM and customer] I ever worked on.
Some 6 months later in Dec 2010 I had a customer with six AIX servers running at 95% utilization and all from remote TEMS processing. I was able to present a solution quickly. However the customer was unconvinced. I wrote a Perl program to summarize the results in a spreadsheet file. The customer was convinced, made the changes and those six systems dropped to 10% utilization. In March 2011 I published the process and tool as a technical note and it is now widely used.
TEMS Audit continues to be enhanced as new issues are encountered. The technical note as documentation became unwieldy so I reworked it as an install guide and a usage guide. All the recent changes have been documented and are included in the program objects below. The advisory messages are of special note since it points to specific issues or alternatively states there is no issue identified.
Here are recently published versions, In case there is a problem at one level you can always back up.
Add Soap Burst Advisory and SOAP Detail report
Add advisory for ulimit stack more then 10M
ProcessTable Summary, listen pipes. "No Matching Request" error summarized, Nofile advisory, improve -z option processing
1.10000 - last technote version
Advisory section. 16meg truncation warning
Identify and correct workload and configuration problems. I encourage anyone to share success stories, enhancement requests or problems found.
Note: Art Deco Cat sculpture