Sitworld: Real Time Detection of Duplicate Agent Names
John Alvord, IBM Corporation
One more time I worked on a case where ITM misbehaved because some agents used duplicate names. This particular case involved "false alerts" where a situation event was observed - a missing process case on a Linux System. When investigated, the Linux System did have that process running and so it was a false positive alert. This cases are wasteful of everyone's time and degrade the monitoring experience. After considerable time this was determined to be a duplicate agent name case: there were two different systems - one had a missing process and the other one did not. Each agent had the same name and so the investigation was against the wrong system. There were 100+ such cases. The effort consumed meetings over several months and wasted time and energy.
Here is a list of observed problems over the last few years collected by a colleague:
Agents going offline
Agents going offline and online repeatedly
Agents switching back and forth between TEMS'
Situation does not fire as expected
Situation fires unexpectedly
Situation does not start as expected
The data in the situation is not correct
Agent does not respond to requests
RTEMS does not respond to requests
RTEMS is hung
RTEMS is disconnected
HUB does not respond to requests
HUB is hung
Unstable ITM environment
TEP shows many navigator updates pending
TEP agent positioning flipping around
HIGH CPU or network usage related to TEPS
Duplicate Agent Name Progress up to now
There has been work ongoing to identify and resolve these cases. Here are useful tools.
The TEPS Audit blog post is a good first line of detection. You set a trace at the TEPS and then get a report with everything that TEPS sees.
The TEMS Audit blog post has some good reports - such as agents that repeatedly show online or reports at remote TEMS where the arrival of heartbeats is irregular.
The Database Health Checker blog post has a report section based in TEIBLOGT where you can see things like multiple additions to system generated MSLs which can imply duplicate.
We expect future process in this area, including advanced tracing and reports which identify cases where two agents with the same name are connecting to the same remote TEMS.
This post discusses a new cross TEMS check report on current live data.
Node Status Table Correlation Report
Each TEMS has an in-storage table INODESTS or Node Status table. A remote TEMS has entries corresponding to the nodes [agents] that are connected to it. In ideal cases, the hub TEMS and the remote TEMSes will contain the same information. If there are differences. such as the same agent name present in two different remote TEMSes, that is a very strong signal of a duplicate agent name. That is the goal of the current project.
This package uses a TEPS utility to get the TEMS data for the report. Therefore it is run on the same system as the TEPS.
The following assumes TEPS was installed in the default directory. The data collection work is done on the system which runs the TEPS. If you are using a non-default install directory then you will need to set an environment variable or specify the install directory in a parameter.
The package is ==>here<==. It contains
1) Perl program inodests_sum.pl.
I suggest inodests_sum.pl be placed an installation tmp directory. For Windows you need to create the <installdir>\tmp directory. For Linux/Unix create the sql directory. You can of course use any convenient directory.
Linux and Unix almost always come with the Perl shell installed. For Windows you can install a no cost Community version from www.activestate.com if needed.
Parameters for running inodests_sum.pl
All parameters are optional if defaults are taken
-h home installation directory for TEPS. Default is
This can also be supplied with an environment variable
Linux/Unix: export CANDLEHOME=/opt/IBM/ITM
Windows: set CANDLE_HOME=c:\IBM\ITM
-o Output file name
default is inodests_sum.csv in current directory
-h Help display
-work where to store TEMS database files, default is temp directory, period means current directory
-all record results for all agents, not just problem cases, default show only problem cases
-off include offline agents, usually not much value
-redo perform the report logic using the existing files. Then hub.lst file must be manually determined and renamed. This is mostly for reporting defects to author.
-aff handle one case of lst data from an older TEMS database level
-thrunode create thrunode.csv file for use in a TEMS Database File restoration project. These are consensus thrunodes based on hub and remote TEMSes. The new project recreates missing TNODELST NODETYPE=V records and TNODELST NODETYPE=M system generated Managed System List entries - which are sometimes missing.
Running the inodests_sum.pl
In the temporary directory
See below for comments.
Row 49/50 are identical in meaning. Column B is the source - which TEMS supplied the data. Row C is the THRUNODE - where the agent connected. Row D is the HOSTADDR - what system the agent was on an what was the listening port.
Row 48 shows the same agent name reporting to another remote TEMS and using a different ip address.
The conclusion here is that two agents are running on two different systems with the same name. This causes problems and should be stopped.
See below for comments on second report snippet.
Row 7/8 are identical in meaning. Column B is the source - which TEMS supplied the data. Row C is the THRUNODE - where the agent connected. Row D is the HOSTADDR - what system the agent was on an what was the listening port.
Row 6 shows the same agent name report to another remote TEMS from the same system using a different listening port.
The conclusion here is that two agent instances are running on the same system. That is unusual at it should be stopped.
The general procedure is to investigate and resolve. In the first case, login to system and see why two different agent instances are running. Perhaps one was supposed to shutdown and the shutdown failed. Perhaps there are actually two different agents installed. In the second case, the agents likely each have CTIRA_HOSTNAME configured but accidentally with the same value. One of the agents needs to be reconfigured.
Thrunode Report file
The -thrunode option creates the thrunode.txt file in the current directory. This file reports the calculated valid remote TEMS each agent is configured to. If there is a conflict [reporting to multiple remote TEMS] that agent is left out of the report. The thrunode.txt report is planned for use in a new project to restore some cases of missing TNODELST objects.
The program captures TEMS output of Node status Tables at each TEMS. If things do not work as expected, please capture those in a zip or compressed tar file and send to the author. I will endeavor to correct any issue promptly.
The information in the report will show cases where two or more TEMSes having differing information about particular agents. In the simplest cases that strongly suggests a case of duplicate agents.
option to export known good thrunodes - remote TEMSes that agents connect to
Note: Overhead Lights on New Cruise Ship