Sitworld: TEMS Audit Process and Tool
John Alvord, IBM Corporation
There have been cases every year where a TEMS was running with high enough CPU/Storage resource usage that the customer was concerned. In some cases, the TEMS experienced a steady storage growth and failure after some days. This condition is usually caused by workloads, configuration or environmental conditions. Situations are the most common workload but SOAP calls, historical data collection and Portal Client workspaces can be involved. In addition, there may be important error messages in the diagnostic log.
In 2008 I began work on a year-old customer problem that took until Spring 2010 to resolve. The conclusion was that a certain type of situation caused a severe storage fragmentation and a TEMS failure in 10-12 days. The customer decided to recycle the affected remote TEMSes once a week. That was probably the single most expensive PMR [for IBM and customer] I ever worked on.
Some 6 months later in Dec 2010 I had a customer with six AIX servers running at 95% utilization and all from remote TEMS processing. The solution was presented but the customer was unconvinced. I wrote a Perl program to summarize the results in a spreadsheet file. The customer was convinced, made the changes and those six systems dropped to 10% utilization. In March 2011, I published the process and tool as a technical note and it is now widely used.
Since the last published version 1.37000 [28 advisories] 38 new advisories have been added and many report sections to explain the environment further. At level 1.64000 the documentation has been reworked for simplicity and clarity.
- The TEMS Audit report itself includes an explanation of each advisory and that is not duplicated in this document. The temsaud.pl program can be viewed to see each advisory explanation.
- The separate word file documents have been eliminated and information added here, revised and extended.
- Each report section is documented along with needed traces.
These changes will continue. The source in every case is based on diagnosing a specific customer issue and when possible adds an advisory or report section which would speed the diagnostic effort – even allowing the customer to correct the condition whenever possible. In recent months there have been several advisories and/or report sections added each month.
TL;DR - Too Long; Didn't Read
Sometimes you want immediate results and to heck with the details. Here are the instant use instructions. Unpack the most recent zip file and copy the temsaud.pl and testaud.ini files to a convenient location where Perl is installed [all Unix and Linux environments] and on the system where a TEMS is running. Let’s pick /tmp and the TEMS is installed at the default directory. Now run this command
perl temsaud.pl -v -logpath /opt/IBM/ITM/logs
and if there should happen to be more than one candidate inventory file,
perl temsaud.pl -v -logpath /opt/IBM/ITM/log hostname_ms_kdsmain.inv
and then view the report file temsaud.csv. The end of that file includes an explanation of each advisory. If you want more information about the TEMS workload then add the needed traces documented later.
Linux and Unix systems usually have Perl installed.
On Windows, you can install the community edition at www.activestate.com. You can also just copy the needed files to a system where Perl is installed.
For a z/OS TEMS trace, capture the RKLVLOG sysout log into a file and run temsaud.pl with the -z option. I worked to make that automatic but have not found the secret sauce.
Perl is usually preinstalled in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. Alternatively you can copy the needed log files to a system where Perl is installed. The program only uses Perl core services, No CPAN modules are needed.
TEMS Audit has been tested with
This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread
zLinux with Perl v5.8.7
There are two TEMS Audit program files in a zip file here.
1) temsaud.pl Perl program
2) temsaud.ini Option file for setting nominal values
Install them somewhere convenient.
Run Time Options.
There are many but temsaud.pl does not require any of them.
-b Show HEARTBEATs in Managed System report
-cmdall When cmd trace is present, report all commands
-expslot <mins> Historical export report, display slot size - default 60 mins
-flip Produce detailed report on agent location flipping, ITM 630 FP7
-h display help information
-jitall produce advisory on minor level heartbeat jitter
-jitter produce jitter reports, default no jitter reports
-logpath Directory path to TEMS logs – default current directory
-o Specify the output report filename
-odir Specify directory where output files will be stored
-ossdir What directory to write send status report
-nohdr Remove variable header lines – used for regression testing
-rd Result detail reports
-ri <secs> result workload report by interval, default 60 seconds
-rdslot Seconds in result detail slots, default 60 seconds
-sqldetail produce a SQL detail report by source and table involved
-sr SOAP Report minute by minute - soap_detail.txt
-ss produce a report send status – used for multiple remote TEMSes
-sum produce a one line summary file
-v Produce some progress messages in STDERR
-work Copy files to work directory before analyzing.
Environment variable Windows – TEMP, Linux/Unix tmp
-workpath Directory path to work directory, default is the system
-z z/OS RKLVLOG log
-eph Capture ephemeral file report
-ephdir Where to store ephemeral file report
The remaining parameter [if present] is a log file specification. Often no specification is needed. It will look at the -logpath setting and determine the current logs from the inventory file. Sometimes there is more than one candidate and you can specify it like
You can also specify log files to study as a partial name such as
That setting analyzes multiple log sections with the same file name starting string. That can be extremely useful of the .inv file is missing or was not sent with the diagnostic files to be studied. It is also the way to analyze a diagnostic log set that is not the most current.
In hands-free operation, do not specify a specific file. Instead use -logpath to specify a path to a normal TEMS logs directory or leave it off for the current directory. In hands-free mode the directory is examined to determine the current active log segments. If there are multiple .inv files the one with the most recent diagnostic logs is chosen. The log segments are processed in the sequential time order. When -work option is set, the diagnostic logs are copied to a work directory to avoid the issue where a log is deleted while being used. Specify the. inv file in case there are more than one candidate.
Note: If there is a one hour or higher time gap between the first and next segment, the first segment is processed only for certain messages. That said there are times when you will need to manually assemble a single combined log to analyze. That can be true if traces are turned on dynamically. Some of the reports use the time to calculate rates and isolating just the traced log makes that more accurate.
The default output report is temsaud.csv. Use the -o option to specify your own name for the report.
One section of the report is advisories. If the maximum advisory impact level is equal or greater than 50, the program ends with exit code 1 and otherwise 0. That can be used in shell files to control other actions – like sending email.
perl temsaud.pl cmsall.log
perl temsaud.pl -v
Reports and Diagnostic tracing needed
The reports are intended to be self-contained and easy to use. With no tracing beyond the default ERROR you get a lot of useful information. To get full value you need to set tracing. See post on how to set tracing. The following shows a report and how to get extra value by setting diagnostic tracing.
This labels the report and TEMS Audit version number, when run, the runtime parameters, the logs used [if -v is specified] and a line which is the TEMS nodeid and *LOCAL for hub TEMS or *REMOTE for remote TEMS. Occasionally the last line is missing because not in the logs. The header can be suppressed using the -nohdr option.
TEMS Audit report v1.63
Start: 2017-05-24 15:44:38
Runtime parameters: -sum -rd -flip -odir /ecurep/tmp/ITM_Health/Awn0gXM2ky/ -logpath /ecurep/pmr/4/8/48002,080,678/2017-05-24/48002.080.678.pdcollect-w94486.tar.Z_unpack/autopdzip/autopd/IBM/ITM/logs/logs w94486_ms_59243128
"working on 0 /ecurep/pmr/4/8/48002,080,678/2017-05-24/48002.080.678.pdcollect-w94486.tar.Z_unpack/autopdzip/autopd/IBM/ITM/logs/logs/w94486_ms_59243128-01.log"
The header is followed by advisory messages. These are indications of errors and warnings from the diagnostic log. The advisories are sorted in reverse order by impact. These impact values selected based on years of experience however feel free to use you own judgement. These are Advisory and not The Law. The following is just one recently created. There are 56 advisories at ths\is writing and more are added all the time.
Advisory Message Report - *NOTE* See advisory notes at report end
200,TEMSAUDIT1052E,HUB,Hub TEMS has lost connection to HUB 84 times
100,TEMSAUDIT1050W,PCB,Agent connection churning on  systems total - See following report
90,TEMSAUDIT1055W,Agent,Agent Location Flipping Changes Detected  - See following report
90,TEMSAUDIT1061E,TEMS,Situation [ddb_sanprb_x07c_aix_gen] with unknown attribute [K07_SANPATH.MonitoringSolution] - [*IF ( ( *VALUE K07_SANPATH.RC *EQ '20' ) *OR ( *VALUE K07_SANPATH.RC *EQ '21' ) *OR ( *VALUE K07_SANPATH.RC *EQ '22' ) *OR ( *VALUE K07_SANPATH.RC *EQ '23' ) ) *AND *VALUE K07_SANPATH.MonitoringSolution *EQ 'aix']
80,TEMSAUDIT1016W,Onlines,3 Agents with repeated onlines
80,TEMSAUDIT1035W,diagnostic,7 worrying diagnostic messages - see later report section
80,TEMSAUDIT1054W,SOAP,SOAP Errors Types  Detected - See following report
50,TEMSAUDIT1021W,timeout,4 Agent time out messages
0,TEMSAUDIT1005W,Stack,ulimit stack  is above nominal  (kbytes)
The pattern of each advisory is an impact level, an advisory identification, a category and a text message. Some reference later report sections. The first one above
200,TEMSAUDIT1052E,HUB,Hub TEMS has lost connection to HUB 84 times
Impact: 200 The highest current level.
ID: TEMSAUDIT1052E a sequential number
Category: HUB the hub TEMS is involved
Text: Hub TEMS has lost connection to HUB 84 times
At the end of the temsaud.csv report you find the following explanations of each advisory. Here is the one for the issue above.
Advisory code: TEMSAUDIT1052E
Text: Hub TEMS has lost connection to HUB count times
(58D77E39.0002-11A0:ko4mgtsk.cpp,133,"ManagedTask::sendStatus") Connection to HUB lost - stopping situation status insert
Meaning: This is produced when one hub TEMS task can no longer connect with the dataserver function of the hub TEMS. This is a severe error which is only recovered with a hub TEMS recycle.
The hub TEMS will not recover without a recycle.
This is a most severe issue and needs prompt diagnosis. It can be caused by many reasons including TEMS database problems, network problems and excessive workload.
Recovery plan: Contact IBM support for aid in diagnosis.
<end of advisory explanation>
The reports are documented in the temsaud.csv report at the end. Here is an example:
Text: Unable to get attributes for table <table> [<count> times]
(591ACD2E.0000-7C:kshcat.cpp,296,"RetrieveTableByTableName") Unable to get attributes for table tree TOBJACCL
Meaning: The SOAP request failed. This hints of an out of storage
issue, a broken attribute collection or mssing application
Recovery plan: Work with IBM Support to resolve issue.
Tracing Guide in a Separate Document
The Tracing guide is documented in a separate post because of a document size limit:
The tracing appendix is in a separate post because of a document size limit. It shows exactly how to set diagnostic traces for TEMS Audit both statically and dynamically.
This document shows how to collect performance data from a TEMS. As a result you can make operations more efficient and complete.
Here are recently published versions, In case there is a problem at one level you can always back up.
Improve report explanation on churning report
Add advisory and report on nodelist missing messages
Exclude (NULL) results from always true report
Add advisory on STH corrupted rows
always true report/advisory only when result row tracing present
Improve RB logic based on report ill-logical results
Add Situation True advisory and report
only product FTO Control message if there are any
Add FTO control message report<Correct Local Time calculations
Add PostEvent node status advisory and support
test advisory 1067 logic
Advisory when KBB_RAS1 starts with a single quote
handle RB capture better on earlier maintenance levels
Add portscan report over time
Add report of recent install and config operations
Advisory on KDEB_INTERFACELIST and KDCB0_HOSTNAME
Advisory on same agent name different affinity
Added inline report explanations.
Detect Seed file messages which mean duplicate index messages probably not important
Eliminate Bad Port report and add Multiple Listening port report
Add KFA Validity Node duplicate agent report
Revise GSKit error - handle all error codes
Capture ephemeral port information - what is physical system/port
More node status reports including base listening port usage
Extend some of the node status reports for easier access to diagnostic logs
and to allow some cases to be diagnosed immediately.
Report on churning based on 95% of total
make jitter report(s) optional
Add RB FindDupAgents.rex logic
A year of changes, too many to list here. See the temsaud.pl history section if interested.
Restructure for improved standalone running. Add several new advisories low nofiles and concurrent action commands
restructure advisory messages, add advisories for TEMS database errors
Correct some logic on incoming PostEvent messages
Handle a null SQL text continuation better
show more contributors to the resource interval report
Show duplicate agent online better - top 20
Add SimpleHeartbeat report - finds another type of duplicate agent
Add Major Jitter correlation report
Make command capture logic work on z/OS logs
Add result interval report
Add results interval count report
Restore ProcessTable report section
Add PostEvent report section
Add SQL report section and a maximum concurrent action command section
Add action command report section
Correct defect on z/OS log and tracking listen pipes
Add Soap Burst Advisory and SOAP Detail report
Add advisory for ulimit stack more then 10M
ProcessTable Summary, listen pipes. "No Matching Request" error summarized, Nofile advisory, improve -z option processing
1.10000 - last technote version
Advisory section. 16meg truncation warning
Identify and correct workload and configuration problems. I encourage anyone to share success stories, enhancement requests or problems found.
Note: Art Deco Cat sculpture