IBM Tivoli Monitoring Wonderful World of Situations
jalvord 1200009463 1,539 Views
During my normal work. I see many interesting puzzles on how to accomplish useful work in IBM Tivoli Monitoring [ITM]. Often these revolve around situations.
Over time I will present some basic education on the subject but first there are some interesting cases that will benefit from some interactions. So for the moment, I will assume you are familiar with ITM and its jargon.
ITM L2 Support
Please note. While still interesting, this project has been largely superseded by
I have a great job. People come to me with puzzles and I get paid to investigate. A recent customer had 400+ Solaris systems running Unix OS Agent at ITM 622 FP5 and earlier levels. By chance they identified a single instance of a Unix OS agent that was not running situations. They were naturally worried there could be other cases.
Formula: *IF *VALUE KLZ_System_Statistics.System_Name *NE xxx
The situation does not need to run at startup since it used only by a workflow policy. The KLZ_System_Statistics attribute group is used because it has a System_Name – or agent name – among the attributes.
B) A very simple Workflow Policy: IBM_Policy_101
The take action command
The action command is
The Linux/Unix touch command creates a zero length file or updates time on an existing file. See Appendix 1 for Windows batch file wintouch.bat to accomplish the same thing. The Windows example is included in the example files.
C) An example Perl program itm_unresp.pl identifies problem OS Agents using this logic:
a. Determine what agents are online using tacmd listsystems
b. Collect the names of the touch files
c. Print out names of online agents which do not have touch files
d. Print out names of touch files which are not listed as online agents
e. Print out the names of touch files which are late by some predetermined number of seconds.
This example Perl program is configured with user specified values at the beginning which tell how many seconds is considered late, user/password for tacmd, target directory for files, etc. The itm_unresp.pl has been tested on both Linux and Windows.
For Linux/Unix, the location of Perl is specified in the first line of itm_unresp.pl
If that is different on the system that itm_unresp.pl will run on, you will need to change that first line. On Windows the perl program libraries are present in the PATH after installation and that line is ignored. The "-w" enables certain warnings.
These controls are in the beginning of the itm_unresp.pl source. Modify them to match your requirements.
In the test environment, the log looked like this:
In this test lateness was defined as 30 seconds and the situation sampling interval was set to 60 seconds. This setup deliberately forced lateness messages. The modify value is the epoch seconds when the file was last modified. The late value is the current epoch time minus the lateness seconds defined in itm_unresp.pl.
This scheme provides a way to view non-responsive agents reliably. It can also be used as a long term checker for these issues. After an initial scan and cleanup, the sampling interval should be changed to once a day or so.
Windows does not have the touch command. In addition, managed system names contain a colon [:] which is illegal in a filename in Windows file system.
>nul 2>&1 - this suppresses any standard output and standard error output from copy.
Second is a case where a slow cinfo caused miss-behavior, fixed in ITM 622 FP4.
Third is a case of an environmental problem that was not handled well, fixed in ITM 622 FP4.
Fourth is a case where watchdog stopped the wrong process, corrected on ITM 622 FP6.
Fifth is a case where a process type situation might take 30+ minutes to evaluate, corrected on ITM 622 FP6.
These are rare cases and are often never seen by customers.
ITM TEMS Stress Tester Experiment, Version 0.10000
By John Alvord
IBM Corporation, 21 March 2013
There are times when the TEMS comes under heavy stress. There are sometimes severe outcomes such as a crash, out of storage failures and loss of communication with other TEMSes and agents. Often there are few outward signs before the failure. These sorts of failures are very costly in lost time and support efforts.
The average time stays close to the target since after completing each cycle, the new target evaluation time is set achieve the original goal. Thus if the sampling interval was 60 seconds and the evaluation was completed at time 63 seconds, the next target would be 57 away. The average stays OK but the variations grow wider.
The “sliding window” means that only the only a recent set of measurements are used. That is important so that recent conditions are given more importance. The oldest observation is dropped after the window limit is passed.
There are two program objects included here: a situation and a Perl program which you can access using the following link.
An example of the situation is included: a file IBM_TEMS_stress.xml which can be loaded using this command
The situation example uses a 2 minute sampling interval. You may adjust this time as needed. The time must be in coordination with the action command script. On a fairly powerful AIX server this was measured as 0.30 CPU seconds per invocation.
The action command is set to run on every true cycle. In addition, the action command must run on the TEMS since this is a probe based situation [Note 2].
The processing example program is also included – itm_stress.pl. If you do discover issues and make corrections, please return them to the author for general usage. Logic for the sliding window calculations are documented in the comments with references.
Tailoring for the environment
$local_dir = "c:" . $local_dir if $gWin == 1; # Stress file in Windows
my $local_file = "stress.txt"; # current data file
my $local_log = "stress.log"; # Progress log
my $local_window = 60; # number of entries in sliding window
$local_dir is the directory where the two output files are kept.
$local_file is the current calculation file and keeps all data needed for the logic.
$local_log is the closest thing to an output file right now.
$local_window is the number of entries in the sliding window.
The test for Windows needs work.
The $local_window value and the situation sampling interval define how long a time will be measured. The default here is 60 cycles of 2 minutes each or 120 minutes.
Sample Stress Log Data
Here is what the data items mean
In this case the TEMS was not under any significant stress and so standard deviation was very low. Epoch is a standard Posix measurement of number of seconds since 1 Jan 1970.
The hypothesis is that the standard deviation will be small for a TEMS under little stress and will increase when the stress is increased. That has been true in some unit testing environments.
1) This situation could run on an agent with the action performed at a reporting TEMS. In this way communications stress could be measured as well.
2) Using a Workflow policy, this could run on a remote TEMS and the command run on the hub TEMS. In this way the combined hub/remote TEMS stress could be measured.
3)If the theory proves valid and useful, the standard deviation could be used to generate alerts in several different ways.
Note 1: See TEMS audit process technical note:
Auditing TEMS for High Impact Workloads
Note 2: Situations that run at a remote agent are known as Intelligent Remote Agents. Situations run as called functions functions from the TEMS dataserver as known as probe based situations. This is a timer based situation running on the TEMS and is probe based. In such cases, the action command must be defined to run on the TEMS.