Sitworld: TEPS Audit
John Alvord, IBM Corporation
A recent case involved very slow Portal Client response time and general TEMS instability including failures. After some study the root cause turned out to be duplicate agent names.
ITM architecture depends unique agent names - no duplications. However, there is no logic that enforces that rule and indeed duplicate agent names occur in normal processing. For example, if an agent with two remote TEMS configured loses contact with one remote TEMS because of some communications outage, it can connect via a secondary remote TEMS. During that reconnection logic, the hub TEMS will note the duplicate agent name and accept it. Later on the first remote TEMS will notice the agent going offline [node status update missing], update the hub TEMS the agent is offline and the hub TEMS does the right thing and ignores it. So that duplicate agent name case is a transient normal condition.
However, this case involved some 15 MQ agents all of which had been configured with the same name. This causes the duplicate name logic to occur steadily, 15 times every 10 minutes by default. This project reports duplicate names using a very simple TEPS trace and a report program.
The TEMS must do work to manage the duplicate agent condition: stopping situations on the earlier agent instance and starting on the new agent instance. The TEPS maintains a topology of the ITM Agents and that topology map must be recalculated with every new arrival. Navigator Updates Pending alerts in Portal Client are also seen. In this case the effect was so strong that Portal Client was experienced very poor performance. In addition and worst, only one agent at a time was actually running situations. As the agents swapped in and out of connection situations kept starting and that produced a steady stream of situation events all apparently from the "same" agent and duplicating earlier events.
I worked on this once before Diagnosing the "Navigator Updates Pending" message and that technote has been very effective. However it requires manual work and analysis to correct.
I also added a potential duplicate agent report to the most recent 1.16000 level of Sitworld: ITM Database Health Checker. That report identifies some cases [agents reporting to different remote TEMSs and Agents reporting with different maintenance levels] but not others [agents with same maintenance level reporting to same remote TEMS]. Therefore something more was needed,
This project works with a TEPS trace and produces a full report on duplicate agent names. That works because the TEPS receives a stream of Situation Status Updates as each agent goes online and perhaps later offline.
At 0.53000 a summary of incoming events is provided. This can be very useful in diagnosing a TEPS that is overloaded with "too many events". More importantly, when there are event floods that usually means a violation of the prime ITM design: situation events should reflect rare and unusual events which can be worked on to prevent in the future. A violation causes excess work at the agent, the TEMS, the TEPS and the people who attempt to manage the event workload.
Preparing for the install
Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. The program only uses Perl core services and no CPAN modules are needed.
TEPS Audit has been tested with
This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread
zLinux with Perl v5.8.7
A zip file is found found here. There is one file tepsaud.pl. Install it somewhere convenient.
Run Time Options
-h display help information
-v Produce some progress messages in STDERR
-log Produce a log file, default yes
-logpath Directory path to TEMS logs – default current directory
-nohdr Remove variable header lines – used for regression testing
-istone Show instance=1 cases
The remaining parameter [if present] is a log file specification. The log files to study can be specified as a partial name such as HUB_ibms98htems_ms_52052f96. That setting analyzes multiple log sections with the same file name starting string.
In hands-free operation do not specify a specific file. Instead use –logpath to specify a path to a normal TEPS logs directory or leave it off for the current directory. In this mode the directory is examined to determine the current active log segments. The log segments are processed in the sequential time order.
Note: If there is a one hour or higher time gap between the first and next segment, the first segment is processed only for certain error messages.
The report is written to tepsaud.csv in the current directory.
Run command examples:
perl tepsaud.pl cmsall.log
perl tepsaud.pl -v
Configuring TEPS trace for Duplicate Agent Report and Situation Event report
The diagnostic trace required is this: ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)
See Appendix 1 on how to enable that trace. Generally a one hour trace is sufficient for this purpose.
In addition, if you have any of the following set in cq.ini
remove or comment them out and restart the TEPS before beginning the capture. They suppress the raw data we need to create the report.
TEPS Audit Report
The report has a lot of details, so a screen image is tough to understand.
There are two sections. The first is the advisory messages [eventually there will be more messages].
|Object||relates to - agent in this case||ibm_test:LZ|
|Message||Node has connected via 62 ip addresses or versions or thrunodes: Likely duplicate name configuration error|
The TEPS Node Status Update Report follows. Each agent has a first line
|Agent Name||Managed System Name||ibm_test:LZ|
|Node Status Count||291|
Instances represent how the agent presented including four pieces of data
Thrunode [often remote TEMS]
Affinities [type of agent]
Hostaddr [including ip address
|(blank)||to line up with header line|
|Node Status Count||for this instance||7|
|Online Count||for this instance||7|
|Offline Count||for this instance||0|
|Product||Two character Agent Code||LZ|
|Version addendum||Other version information||A=00:lx8266;C=06.23.03.00:lx8266;G=06.23.03.00:lx8266;|
|Thrunode||How the agent connected||REMOTE_ibmtest101|
|HostAddr||ip address and hostname||ip.spipe:#188.8.131.52<NM>ibm_test</NM>|
|HostInfo||type of host||Linux~|
|Affinities||Type of agent and TEMA||%IBM.STATIC134 000000000I000Jyw0a7|
|Timestamps||When update occurred||1150911113710000:1150911114710000: ...|
IP and HOSTNAME and Property change report
There are cases where TEPS produces a shorter statement about conditions where an agent changes ip address, host name, or affinities change. If these are spotted a report section is generated for each. For example
The IP Change report starts with a header line
|Node||Managed System Name||ibm390:LZ|
The first line is followed by instances to represent the observed change
|(blank)||to line up with first line|
|IP Address||ip address of agent||127.0.0.1|
|Count||number of times seen||
In these cases, the from ip address and the to ip address are counted separately. Both are involved and need to be investigated. If the from ip address is blank, the report is ignored since that is the common condition when an agent first connects.
There are similar reports for hostname changes and property changes. These often overlap and solving one case will solve all of them.
Situation Event Report
The TEPS Situation Event Report follows. Each situation name has a first line
|Situation Name||Managed System Name||ibm_test_situation|
|Situation event count||Count of status||291|
The first line is followed by instances to represent how and when the situation event status was observed
|(blank)||to line up with header line|
|Node||Agent where result calculated||Primary:IBM-XXX1397-S:NT|
|Atomize||option distinguishing element||arc_IBM_22|
|Count||number of status||9|
|Open Count||situation event opens||5|
|Close Count||situation event closes||4|
|FullName||display Long situation name||ibm_test_situation|
|Sitmon Node||which TEMS made event||REMOTE_-IBM0067-S|
|Time_Status||Time and event status||1151221143708000_Y:1151221143938003_N|
The Time_Status is the ITM time stamp when processed by the TEMS and the Situation Event status. Y=Open, N=Close, S=Start, P=Stop and some rarer cases
Recovery Action Plan - Duplicate Agents
The first case was extreme. There were 115 agent names which showed with more than one connection instance. The next worst two were 62 and 54 connection instances.
The recovery action was very straightforward although laborious. Proper running of ITM requires that each agent have a different agent name. Each case needs to be investigated and corrected. For example in the case above - ibm_test:LZ - each of the 62 systems must be logged on and have the configuration manually changed and the agent recycled. In most cases, CTIRA_HOSTNAME is not needed. The hostname of the system is naturally different and without CTIRA_HOSTNAME specified the agent name will reflect the name of the agent. If you do to need to pick a CTIRA_HOSTNAME - it must be selected so all agents have unique names. In this case it is probably specified in the <installdir>/config directory in the lz.ini file or possibly in the lz.environment file. Edit that file, delete that CTIRA_HOSTNAME line or specify it so the names are unique.
Do that for every single case of duplicate names... using the report.
When you think you have made progress, redo the TEPS trace and see if any items remain. As the duplicate agent conditions are resolved ITM will begin running better and better.
The final step is to determine why this duplication exists. Often a "golden system image" is created and then cloned. Part of that process *MUST* be to configure the new agent so it has a unique name. Investigate that process thoroughly to avoid future problems.
This case was extreme, but I have seen a few such cases in almost every large ITM environment.
Recovery Action Plan - Excess Situation Events
In another case, the external symptom was Portal Client giving bad response times and sometimes failing. The Situation Status report showed that 70% of the situation event workload was coming from a single situation. It was alerting on cases where Windows services were in a non-running state. The summary showed that the events were opening and closing quite often. The situation was running on a lot of Windows systems and the combined effect was to bury the TEPS. At the same time, the situation was generating a lot of situation opens and quickly event closes. The recovery plan was to rethink that situation so it only alerted on rare and exceptional cases. The plan was to make the situation less complex by splitting into multiple situations, increase the persist count, and add more tests to ignore common conditions that did not need an alert.
Like many real life cases, once you have the information about what is driving the workload you can figure out how to make things better. In this case there were 550+ active situations and just a single one was causing the problem.
Identify and correct agent duplicate name configuration issues. If you find any anomalies which are hard to correct, please contact the author.
Here are recently published versions, In case there is a problem at one level you can always back up.
Avoid a divide by zero condition
Handle mixed trace diagnostic threads
Create Situation Event report section
warn on different ports from same system
look for two different forms for inventory file
Appendix 1 – Turning Tracing On and Off
A trace filter looks like this
ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)
The first token is a general level of tracing. The unit: specifies a local source unit. An optional Entry= specifies a function name.
From ITM 623 FP2 and ITM 630 GA there is a tacmd function to turn on tracing. You must do a tacmd login first.
Turn on trace:
Linux/Unix: ./tacmd settrace -m <tepsname> -p KBB_RAS1 -o 'ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)'
Windows: tacmd settrace -m <tepsname> -p KBB_RAS1 -o "ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)"
[one long line]
Turn off trace
Linux/Unix: ./tacmd settrace -m <tepsname> -p KBB_RAS1 –r
Windows: tacmd settrace -m <tepsname> -p KBB_RAS1 -r
A second way is to use the service console. If you are unfamiliar see this technote:
ras1 set ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)
ras1 set ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)
The general rule is that an “any” specification will remove the effect of any prior set(s). The er [error] should be added back.
Sometimes you cannot use the diagnostic console. Here is what you do for the different platforms.
Add this to the TEPS ini – which has the name cq.ini
KBB_RAS1= ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)
In MTEMS GUI
- right click on TEMS line
- Select Advanced
- Select Edit Trace Parms
- In the “Enter RAS1 Filters:” input area add
ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)
To remove the traces, restore the line to “error” in all cases and recycle the TEPS.
Note: Electrical Heart of a New Cruise Ship