ITM Silver Blaze – Agent Responsiveness Checker
By John Alvord
Please note. While still interesting, this project has been largely superseded by
I have a great job. People come to me with puzzles and I get paid to investigate. A recent customer had 400+ Solaris systems running Unix OS Agent at ITM 622 FP5 and earlier levels. By chance they identified a single instance of a Unix OS agent that was not running situations. They were naturally worried there could be other cases.
In ITM, there are occasionally agents that report online but are not running situations. When real time data is requested the request times out. I call them non-responsive agents and have puzzled for a years about how to detect them easily.
If you suspect a non-responsive agent, you can attempt to view real time data and observe the time out condition. That requires expensive manual work for each agent and you can never be sure if things remain good. Once you find a single such case you will worry every day and night. A single situation not firing can be costly.
With this new inspiration, I remembered a famous Conan Doyle short story about Sherlock Holmes titled Silver Blaze. Sherlock resolved a mystery by noting that a watch dog did not bark in the night.
Silver Blaze Overview
There are three components to the Sliver Blaze scheme: A situation, a workflow policy and a Perl program. For an example implementation Right-click/Save As... ===> zip file. This goal of this example is to identify all non-responsive Linux OS Agents. The files can also be found here: https://github.com/jalvo2014/silverblaze
A) Timer situation: IBM_cycle_101
Formula: *IF *VALUE KLZ_System_Statistics.System_Name *NE xxx
Sampling Interval: 15 minutes
Run at Startup: No
The situation does not need to run at startup since it used only by a workflow policy. The KLZ_System_Statistics attribute group is used because it has a System_Name – or agent name – among the attributes.
B) A very simple Workflow Policy: IBM_Policy_101
The take action command
The action command is
The Linux/Unix touch command creates a zero length file or updates time on an existing file. See Appendix 1 for Windows batch file wintouch.bat to accomplish the same thing. The Windows example is included in the example files.
The example workflow policy has the same distribution as the situation: *LINUX_SYSTEM. This means the policy is active on each TEMS where a Linux OS agent connects. The Take Action options force the command run on the hub TEMS. The workflow policy correlation is “managed system”.
When the policy is started [or auto_started], the situation is automatically started on each Linux OS Agent. The situation runs in results-only mode and does not create events. Every 15 minutes the agent sends a new result to the TEMS. The workflow processes the result and then runs a command on the hub TEMS.
In this example the /tmp/ directory was used for the touch files. You can of course pick any target directory.
After the situation sends results and the workflow policy runs, the /tmp directory fills with files having the names of Linux OS Agents which are active and processing situations.
C) An example Perl program itm_unresp.pl identifies problem OS Agents using this logic:
a. Determine what agents are online using tacmd listsystems
b. Collect the names of the touch files
c. Print out names of online agents which do not have touch files
d. Print out names of touch files which are not listed as online agents
e. Print out the names of touch files which are late by some predetermined number of seconds.
This example Perl program is configured with user specified values at the beginning which tell how many seconds is considered late, user/password for tacmd, target directory for files, etc. The itm_unresp.pl has been tested on both Linux and Windows.
For Linux/Unix, the location of Perl is specified in the first line of itm_unresp.pl
If that is different on the system that itm_unresp.pl will run on, you will need to change that first line. On Windows the perl program libraries are present in the PATH after installation and that line is ignored. The "-w" enables certain warnings.
Controls for itm_unresp.pl
These controls are in the beginning of the itm_unresp.pl source. Modify them to match your requirements.
server hub name or ip address where hub TEMS runs
valid userid for tacmd login
password for userid
product code of agent [ux/nt/lz etc]
Directory where touch files stored – Windows c:/temp/
File extension of touch files
touch file lateness factor
When 1, adds a fake nodeid to test logic
Example Results Log
In the test environment, the log looked like this:
Online agent check start
node xxx180:LZ modify late
node xxx182.xxx.xxx.ibm.com:LZ modify late
Online agent XXX185.xxx.xxx.ibm.com:LZ missing from touch files
Online agent check complete
In this test lateness was defined as 30 seconds and the situation sampling interval was set to 60 seconds. This setup deliberately forced lateness messages. The modify value is the epoch seconds when the file was last modified. The late value is the current epoch time minus the lateness seconds defined in itm_unresp.pl.
The missing touch file message was produced by an option to add a fake online node.
There is a message type “node $node not in online capture” which means there is a touch file present but the agent is not currently online. I suspect that means the agent has gone offline and the touch file should be deleted. That logic is not yet implemented.
It might be inconvenient or impossible to run the itm_unresp.pl program on the hub TEMS. If so pick any system with a Windows/Linux/Unix OS Agent. Change the workflow policy so the touch [or Windows wintouch.bat] command runs on that Agent. Then you can run the itm_unresp.pl summary program on that same system with the same results.
Linux/Unix systems usually come with Perl already installed. If your target is a Windows system, then install Perl from www.activestate.com which has an excellent free version. The itm_unresp.pl program only uses built in or core facilities.
Outstanding Customer Results
Using the Silver Blaze scheme the customer determined that 167 agents were stalled after roughly an hour of effort. A study of 115 agent operations logs revealed evidence of a defect corrected at ITM 622 FP6 when the TEMA threading logic was reworked. An upgrade to ITM 623 FP2 was already underway and was thereby accelerated. Updating each Unix OS Agent was sufficient to resolve the issue for all ITM agents running on each system. In the meantime, stuck agents were recycled as needed and monitoring continued.
The underlying ITM issues have been resolved over time, but not everyone runs the latest maintenance level. In addition, the problem can be environmental like a mount point full or some competing process in a loop. [See Appendix 2 five APAR fix examples.]
Having a centralized facility to identify non-responsive agents will speed resolving such issues. Until the problem can be corrected, early identification and recycling will reduce the exposure time running agents in a non-responsive mode.
This scheme provides a way to view non-responsive agents reliably. It can also be used as a long term checker for these issues. After an initial scan and cleanup, the sampling interval should be changed to once a day or so.
Make log better looking.
Right now the itm_unresp.pl program needs to be run manually and then resulting log checked manually. Connect problem results to a monitored log to produce events.
Handle multiple agent types with one tacmd listsystems call.
Do you have any other ideas or edit suggestions? Please comment in blog entry or send email firstname.lastname@example.org. If you find improvements to the scheme, please let me know so everyone can benefit.
Appendix 1: Windows and the touch command
Windows does not have the touch command. In addition, managed system names contain a colon [:] which is illegal in a filename in Windows file system.
Here is a small batch file wintouch.bat that does same thing as Linux/Unix touch:
@type nul >>%FILE::=_%.touch & copy %FILE::=_%.touch +,, >nul 2>&1
Unless you are seriously deep into Windows geek-land, this surely seems mysterious. Here is an explanation:
The leading @ character suppresses the echoing of commands.
SetLocal EnableDelayedExpansion makes sure that environment variables are substituted line by line instead of all at once during the pre-execution phase.
Set FILE=%1 takes the first bat file argument [managed system name] and sets it into an environment variable FILE. That will normally be the managed system name.
%FILE::=_% creates a string from the FILE environment variable where each colon [:] is translated into an underline [_]. This avoids using the colon in the file name.
type nul >>filename - appends the null file to a given file, creating a zero length file if not present.
The ampersand [&] means to run the first command and then run the second command.
copy filename +,, - this copies a file onto itself, thereby updating the modify time. It can be expensive if the file is large but this is a zero length file.
>nul 2>&1 - this suppresses any standard output and standard error output from copy.
The file extension [here touch] must match the itm_unresp.pl program.
In practical use, you would create a wintouch.bat file based in the example zip file and save it at on Windows system in a known position. In the Workflow Policy take action command set the fully qualified name of the wintouch.bat command file. The itm_unresp.pl command is aware of the changed form of the Agent name and will make the right tests when run on a Windows system.
Appendix 2: Non-responsive agent APAR fix examples
These are examples of APAR fixes which handled cases where an agent might end up non-responsive. The list is not complete but area ones I remember. These are rarely observed when an agent is running with up to date maintenance. There are also many environmental problems which can have the same result.
First is a case where the Agent Support [TEMA] threading model needed work to avoid a deadlock. It could theoretically happen on any Linux/Unix environment but in practice was only seen on the Solaris Unix OS Agent. Corrected in ITM 622 FP6 and in ITM 623 FP1.
DEPLOYMENT COMMAND SOMETIMES HANGS ON SOLARIS
Second is a case where a slow cinfo caused miss-behavior, fixed in ITM 622 FP4.
WATCHDOG CONTINUES TO RESTART OS AGENT
Third is a case of an environmental problem that was not handled well, fixed in ITM 622 FP4.
NETWORK INTERFACE WITH MULTIPLE ADDRESSES BREAKS CINFO -R RESULTS IN KCAWD TERMINATING AGENT DUE TO HOSTNAME RESOLUTION
Fourth is a case where watchdog stopped the wrong process, corrected on ITM 622 FP6.
WATCHDOG CALLS TO CINFO SOMETIMES DO NOT TERMINATE.
Fifth is a case where a process type situation might take 30+ minutes to evaluate, corrected on ITM 622 FP6.
ITM KUXAGENT HAS PERFORMANCE PROBLEM IN RESOLVING TTY NAMES FOR PROCESSES ON HP-UX
These are rare cases and are often never seen by customers.