Sitworld: TEPS Audit
John Alvord, IBM Corporation
A recent case involved very slow Portal Client response time and general TEMS instability including failures. After some study the root cause turned out to be duplicate agent names.
ITM architecture depends unique agent names - no duplications. However, there is no logic that enforces that rule and indeed duplicate agent names occur in normal processing. For example, if an agent with two remote TEMS configured loses contact with one remote TEMS because of some communications outage, it can connect via a secondary remote TEMS. During that reconnection logic, the hub TEMS will note the duplicate agent name and accept it. Later on the first remote TEMS will notice the agent going offline [node status update missing], update the hub TEMS the agent is offline and the hub TEMS does the right thing and ignores it. So that duplicate agent name case is a transient normal condition.
However, this case involved some 15 MQ agents all of which had been configured with the same name. This causes the duplicate name logic to occur steadily, 15 times every 10 minutes by default. This project reports duplicate names using a very simple TEPS trace and a report program.
The TEMS must do work to manage the duplicate agent condition: stopping situations on the earlier agent instance and starting on the new agent instance. The TEPS maintains a topology of the ITM Agents and that topology map must be recalculated with every new arrival. Navigator Updates Pending alerts in Portal Client are also seen. In this case the effect was so strong that Portal Client was experienced very poor performance. In addition and worst, only one agent at a time was actually running situations. As the agents swapped in and out of connection situations kept starting and that produced a steady stream of situation events all apparently from the "same" agent and duplicating earlier events.
I worked on this once before Diagnosing the "Navigator Updates Pending" message and that technote has been very effective. However it requires manual work and analysis to correct.
I also added a potential duplicate agent report to the most recent 1.16000 level of Sitworld: ITM Database Health Checker. That report identifies some cases [agents reporting to different remote TEMSs and Agents reporting with different maintenance levels] but not others [agents with same maintenance level reporting to same remote TEMS]. Therefore something more was needed,
This project works with a TEPS trace and produces a full report on duplicate agent names. That works because the TEPS receives a stream of Situation Status Updates as each agent goes online and perhaps later offline.
There will likely be later additions to TEPS Audit but this is a valuable starting point.
Preparing for the install
Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. The program only uses Perl core services and no CPAN modules are needed.
TEPS Audit has been tested with
This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread
zLinux with Perl v5.8.7
A zip file is found found here. There is one file tepsaud.pl. Install it somewhere convenient.
Run Time Options
-h display help information
-v Produce some progress messages in STDERR
-log Produce a log file, default yes
-logpath Directory path to TEMS logs – default current directory
-nohdr Remove variable header lines – used for regression testing
-istone Show instance=1 cases
The remaining parameter [if present] is a log file specification. The log files to study can be specified as a partial name such as HUB_ibms98htems_ms_52052f96. That setting analyzes multiple log sections with the same file name starting string.
In hands-free operation do not specify a specific file. Instead use –logpath to specify a path to a normal TEPS logs directory or leave it off for the current directory. In this mode the directory is examined to determine the current active log segments. The log segments are processed in the sequential time order.
Note: If there is a one hour or higher time gap between the first and next segment, the first segment is processed only for certain error messages.
The report is written to tepsaud.csv in the current directory.
Run command examples:
perl tepsaud.pl cmsall.log
perl tepsaud.pl -v
Configuring TEPS trace for Duplicate Agent Report
The diagnostic trace required is this: ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)
See Appendix 1 on how to enable that trace. Generally a one hour trace is sufficient for this purpose.
TEPS Audit Report
The report has a lot of details, so a screen image is tough to understand.
There are two sections. The first is the advisory messages [eventually there will be more messages].
|Object||relates to - agent in this case||ibm_test:LZ|
|Message||Node has connected via 62 ip addresses or versions or thrunodes: Likely duplicate name configuration error|
The TEPS Node Status Update Report follows. Each agent has a first line
|Agent Name||Managed System Name||ibm_test:LZ|
|Node Status Count||291|
Instances represent how the agent presented including four pieces of data
Thrunode [often remote TEMS]
Affinities [type of agent]
Hostaddr [including ip address
|(blank)||to line up with header line|
|Node Status Count||for this instance||7|
|Online Count||for this instance||7|
|Offline Count||for this instance||0|
|Product||Two character Agent Code||LZ|
|Version addendum||Other version information||A=00:lx8266;C=06.23.03.00:lx8266;G=06.23.03.00:lx8266;|
|Thrunode||How the agent connected||REMOTE_ibmtest101|
|HostAddr||ip address and hostname||ip.spipe:#220.127.116.11<NM>ibm_test</NM>|
|HostInfo||type of host||Linux~|
|Affinities||Type of agent and TEMA||%IBM.STATIC134 000000000I000Jyw0a7|
|Timestamps||When update occurred||1150911113710000:1150911114710000: ...|
Recovery Action Plan
This particular case was extreme. There were 115 agent names which showed with more than one connection instance. The worst two were 62 and 54 connection instances.
The recovery action is very straightforward although laborious. Proper running of ITM absolutely requires that each agent have a different agent name. Each case needs to be investigated and corrected. For example in the case above - ibm_test:LZ - each of the 62 systems must be logged on and have the configuration manually changed and the agent recycled. In most cases, CTIRA_HOSTNAME is not needed. The hostname of the system is naturally different and without CTIRA_HOSTNAME specified the agent name will reflect the name of the agent. If you do to need to pick a CTIRA_HOSTNAME - it must be selected so all agents have unique names. In this case it is probably specified in the <installdir>/config directory in the lz.ini file or possibly in the lz.environment file. Edit that file, delete that CTIRA_HOSTNAME line or specify it so the names are unique.
Do that for every single case of duplicate names... using the report.
When you think you are nearly complete, redo the TEPS trace and see if any items remain. As the duplicate agent conditions are resolved ITM will begin running better and better.
The final step is to determine why this duplication exists. Often a "golden system image" is created and then cloned. Part of that process *MUST* be to configure the new agent so it has a unique name. Investigate that process thoroughly to avoid future problems.
This case was extreme, but I have seen a few such cases in almost every large ITM environment.
Identify and correct agent duplicate name configuration issues. If you find any anomalies which are hard to correct, please contact the author.
Here are recently published versions, In case there is a problem at one level you can always back up.
Appendix 1 – Turning Tracing On and Off
A trace filter looks like this
ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)
The first token is a general level of tracing. The unit: specifies a local source unit. An optional Entry= specifies a function name.
From ITM 623 FP2 and ITM 630 GA there is a tacmd function to turn on tracing. You must do a tacmd login first.
Turn on trace:
Linux/Unix: ./tacmd settrace -m <tepsname> -p KBB_RAS1 -o 'ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)'
Windows: tacmd settrace -m <tepsname> -p KBB_RAS1 -o "ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)"
[one long line]
Turn off trace
Linux/Unix: ./tacmd settrace -m <tepsname> -p KBB_RAS1 –r
Windows: tacmd settrace -m <tepsname> -p KBB_RAS1 -r
A second way is to use the service console. If you are unfamiliar see this technote:
ras1 set ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)
ras1 set ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)
The general rule is that an “any” specification will remove the effect of any prior set(s). The er [error] should be added back.
Sometimes you cannot use the diagnostic console. Here is what you do for the different platforms.
Add this to the TEPS ini – which has the name q.ini
KBB_RAS1= ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)
In MTEMS GUI
- right click on TEMS line
- Select Advanced
- Select Edit Trace Parms
- In the “Enter RAS1 Filters:” input area add
ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)
To remove the traces, restore the line to “error” in all cases and recycle the TEPS.
Note: Electrical Heart of a New Cruise Ship