Auditing TEMS for Improved Performance
By John Alvord, IBM Corporation
The following information has been updated and included in the new TEMS Audit distribution which is documented here.
People mostly ignore TEMS performance issues until the TEMS crashes or agents start going wildly online and offline. Until 2010 this was a constant challenge and issues would linger for months. At the end of 2010, I worked a perfect example. The customer had a single situation that – as it happened – caused 6 AIX servers to run at 95% utilization or higher. I saw from a diagnostic trace what situation was involved and why. An ad hoc summarization program demonstrated that 93% of the incoming workload came from that one situation – on each of the six AIX systems.
To help explain this to this customer, I exported the data from the analysis program into a comma separated file. That way the customer could view a spreadsheet representation of the impact. The customer agreed to cut the situation into 3 pieces with the sum of the three situations had the identical effect. The result was that utilization on all 6 AIX systems dropped to 10% or lower… not by 10% but dropped from 95% down to 10%!! This was a case of a Too Big situation - where the WHERE clause was too large to send to the agent.
These experiments resulted in a published technote in Spring 2011:
However it is now published in a blog post here.
There have been several years of experimentation and improvement. The technote contains a full description of the process and a Perl program to perform the analysis. Changes introduced over time were in response to specific customer cases or to performance concerns.
A number of customer and IBM sites use TEMS Audit regularly to identify issues before experiencing a high impact.
On 20 May 2013 I published version 1.00000. That version calculates the diagnostic log segments on the fly and copies them to a work directory to handle rare cases where segments are reused. With this level, you could run this periodically using a crontab task [Linux/Unix] or an AT command [Windows] and get a regular report on upcoming issues.
On 27 July I published version 1.05000. See below for a list of improvements. The trace report was added to measure a customer TEMS environment that was accidentally set up to run with maximum tracing continuously.
On 12 August I published version 1.10000 with the Advisory section added.
12 August 2013 improvements
1) Add Advisory section
2) Fix hands off logic with Windows log
3) Add 16meg truncated result advisory
27 July 2013 improvements
1) report on trace lines and trace size per minute
2) Add -inplace to skip copying diagnostic segments when not needed
3) Add count of remote SQL failures
4) Count "Filter object too big" messages in hands off mode and display counts
5) Correct defect when there are a few scattered historical data export messages.
1) Reports on historical exporting from TEMS
2) Reports on SOAP usage
3) Reports on historical exporting from Agent
4) Report on “Too Big” situations
5) Hands Off operation - operate directly off the active logs directory
I suggest everyone get this package and run it regularly. Everyone prefers to run without a crisis. Smooth running is best for everyone. If you have any interesting ideas on how to extend this work, I am always interested. Add a comment here or send an email.