Sitworld: Diagnostic Snapshot Utility
John Alvord, IBM Corporation
I get called into a lot of unusual diagnostic cases. ITM has an terrific ability to capture detailed diagnostics and that works great for issues that can be recreated. However rarely occurring conditions - perhaps happening every month or so at random times - are a lot harder to capture. The diagnostic logs are a fixed maximum size [thank goodness] and by the time the condition is noticed and the logs captured, the relevant data is often lost. The condition even tougher if you need to collect diagnostic data from multiple locations - say hub TEMS and remote TEMS and an agent and TEPS at roughly the same time.
I was reflecting about this one day while working on the case of a situation event that should not have occurred. I realized that if there was a workflow policy waiting for that situation result then some logic could be performed. [Workflow policies work on results and not the TEMS concept "situation events".] Also since the Workflow Take Action activity can be performed at any managed system, the flexibility exists to grab data from anywhere.
The next step was to create a "capture diagnostic data now" program that could be run multiple times without overlaying data. The following project was created in 2011, before the current blog and still has a lot of value for advanced diagnostic capture.
The snapshot.pl utility captures the current diagnostic log segments and the operations log and stores it in a compressed time stamped file. The utility can be triggered by time, by a situation event, by a workflow policy or by some external program. In this way the current diagnostic information can be collected even when substantial tracing are required. The utility has been tested on AIX, Windows, Linux, SunOS, and HP-UX.
Preconditions – Things to test and know before using snapshot.pl
1) The environment must have the Perl script interpreter installed. All the testing has been at Perl 5.8.2 and higher. One test at Perl 4 failed completely. The easiest way to discover this is to login to the server where you plan to run snapshot.pl and enter the command “perl -v”. In one of the test environments, Perl 4 was the default install but Perl 5 was installed on another directory path. That was resolved by changing the first line of the Perl program where the install target can be recorded.
2) The Windows ITM environment must have a temporary directory such as c:\ibm\itm\tmp. The same logic is running in Windows/Linux/Unix and so having a uniform directory structure avoids platform specific controls.
3) The snapshot.pl default environment usually has the CANDLEHOME [Linux/Unix] or CANDLE_HOME [Windows] environment variables set. This represents the install directory. This environment variable will always be present when the utility is executed as a situation action command. It may have to be added when the utility is invoked manually or from a non-ITM environment such as a Windows AT command. Alternatively, the -ch option can be used to supply that information externally.
4) The snapshot.pl utility recovers information from the <installdir> logs directory and for Windows agents from the <installdir>\tmaitm6\logs directory. There are certain conditions that must hold before the recovery will work. These can be tested for by running the utility manually and correcting any problem conditions. Following are some checks that can be made ahead of time.
5) For all cases, examine the inventory files [*.inv] and see if they make sense. Inventory files keep track of the diagnostic log segments. For TEMS you should see a single file <hostname>_ms.inv. The user has some control and that could be different. For some agents there are multiple inventory files such as <hostname>_lz_klzagent.inv and <hostname>_lz_kcawd.inv. If the environment has been installed for some time, there could be outdated inventory files with different hostnames if, for example CTIRA_HOSTNAME began to be used or if the system hostname had been changed. If you discover such a case, then delete the unused inventory files.
6) The Linux/Unix TEMS operations log name is recovered from the logs directory ms.env file. You can verify this by doing
grep "Running: " <installdir>/logs/ms.env
On Windows the operations log is a file with a fixed name in the cms directory - kdsmain.msg. For Windows the kdsmain.ras file is also captured if present, which records exception callstacks.
7) The agent operations log has a filename that usually ends
[Linux/Unix] <initial>:<uppercased product code>:LG0
[Windows] <initial>_<product code>.LG0
Look for duplicate LG0 files and delete them. The snapshot.pl utility will fail if there are no such files or multiple files. This means some files cannot be unpacked on a Windows system since filenames there cannot include a colon except on the initial disk specification.
8) On Linux/Unix, tar or tar/compress command(s) are used to create the compressed file. While Windows has the capacity to create zip compressed folders, there is no command line to do that work. The snapshot.pl logic overcomes this limitation by dynamically generating a .vbs file to do the needed work. This is created in the tmp directory and has the filename snapshot_<product>_zip.vbs. The base name used can be altered with the -base option.
9) The snapshot.pl utility creates a folder in the tmp directory named "snapshot_workdir". The base name used can be altered with the -base option.
First Step - A Manual test
The snapshot.pl must be installed into the ITM environment. A zip file of that file is here. All testing was performed using the <installdir> bin directory, however almost any location will do. For Windows copying the snapshot.pl file is sufficient. For Linux/Unix the file must be prepared before use. First determine the needed attributes by doing a
ls -al <installdir>/bin/tacmd
Here is an example output
-rwxrwxrwx 1 root root 6464 Jul 16 2010 tacmd*
Use the following commands to make snapshot.pl have the same characteristics
chmod 777 snapshot.pl
chown <owner> snapshot.pl
chgrp <group> snapshot.pl
Make the current directory be <installdir>/bin and run the command by entering
Review any errors and correct as needed. For example, you might have to do a command
since you are not running in an ITM Action Command environment.
If this completes successfully, there will be a new file in the logs directory with a name like this
Unpack the file and verify that the expected diagnostic and operational files are present.
If you will be capturing agent logs, test with the product code or -t option
perl snapshot.pl -t ux
The goal is to run the snapshot.pl utility to collect diagnostic data near the time when the problem condition occurs. In every case, the Situation action command will look like.
/usr/bin/perl $CANDLEHOME/bin snapshot.pl [Linux/Unix]
C:\perl\bin\perl $CANDLE_HOME\bin snapshot.pl [Windows]
The fully qualified name of the Perl executable is needed because the ITM action command environment may not have the expected PATHs.
See section at end for complete parameter documentation.
Before doing any actual data capture, do a test with a Always True situation [LocalTime < 250000]. After distributing to the target agent, start the situation and verify the tar/zip file has been created and has the expected results. In that way you can verify that the target environment has Perl installed in the expected location and that the parameters are correct. If the tar/zip is not created, there will likely be an explanation in the TEMS or Agent operations log, which collects standard and error outputs from programs run in this way.
Use by invoking periodically
One way to capture data long term is to run the snapshot periodically, such as once an hour or less. The invocation can come from an ITM always true situation [LocalTime.TIME <250000 and a sampling interval of – say – one hour. The action command would look like this on Linux/Unix. The situation action command must be configured to run at each interval.
The utility could also be performed with an external process such as a Unix crontab entry or a Windows AT command. In that case the -ch option would be used to set the install directory.
Use from a Situation - Universal Messages example
The condition might be detectable by a situation. In one case messages were written to the TEMS operations log like this
KO41039 Error in request compileOnDemand. Status= 1157. Reason= 155.
KO41039 Error in request sqlRequest. Status= 1102. Reason= 155
of 13 examples of the KO41039 message [over 4 months] in a four cases it was followed by the message indicating the problem:
KDS9142I The TEMS HUB_xxxx is disconnected from the hub TEMS
To capture this condition, snapshot.pl was installed and a situation was created against the Universal Messages attribute group. The formula used
( Category == KO41039 AND SCAN(Message Text) == 155)
The snapshot.pl -delay option was used to delay capture for 60 seconds. That way the following logic could be traced.
The situation was distributed to *ALL_CMS.
After that some intensive tracing was installed as defined by IBM Support.
When the condition occurred again, the diagnostic and operational logs were captured and progress was made.
Use from a Situation - unexpected true event
A situation was used to alert on missing processes. On rare occasions the alert contained invalid data - for example a mount point with low free space but the name of the mount point was blanks.
To capture this condition, a duplicate situation was created with snapshot.pl as the action command. The needed tracing was also installed. When the false event occurred, the snapshot diagnostic data was recorded.
Use from a Workflow Policy
Sometimes data needs to be captured from multiple ITM services such as a remote TEMS, a hub TEMS and an agent. In this case a situation will usually provide the triggering event. The snapshot.pl utility must be installed and tested on all servers where the command will be run. The Workflow Take Action can be set to execute
- The agent
- The TEMS the agent reports to
- *any* managed system.
For this case, the snapshot.pl should be run as a separate process using a trailing ampersand for Linux/Unix or via a "start /min cmd /c …" for Windows. The reason is that remote commands have a hardcoded timeout of 50 seconds. This time limit will not apply to processes running in the background.
Use from an external Process
Any program running can issue the snapshot.pl utility. If you want to run it remotely, use SOAP to create a universal message at the target ITM service and have a situation waiting for that universal message. See an example of such usage in this technote:
Starting and Stopping ITM situations using external operations
Other use cases
There are probably more ways to use the snapshot.pl utility.
Reference - Command line options for the Snapshot.pl Utility
-h Produce help message an exit
-ch Install directory
-t Product code, default "ms"
-host Hostname, default the result of platform hostname command
-base Base name of snapshot and file work directory, default "snapshot"
-max Maximum number of snapshots, default 32
-nz No compression, compression on by default
-delay Delay seconds before capture, default 0 or no delay
-idir Sub-path of install directory where inserted files are found
-i File specification for inserted files. More than one specification can be used and wildcard [*] are processed. If -idir is not specified, the full path below the install directory must be specified.
-n A comment which will be recorded in a note.txt along with diagnostic files. -n must be the last option and the rest of the argument line is the comment.
If -ch is not supplied on the command line, this environment variable is used
The snapshot.pl is *not* an officially supported part of the ITM product. Use it at your own risk. If problems arise, the author will work to resolve the issue. At the same time, if you have suggestions or feature requests or improvements, please communicate those to the author.
The snapshot.pl utility captures operations and diagnostics logs and added files as needed.
Note: Radar Bubble on new Cruise Ship