Sitworld: Event History Audit
John Alvord, IBM Corporation
This project had several inspirations.
First came from a desire to estimate situation workload impact without any extra tracing and in general discovering aspects of how results are processed at the agent and events are processed in the TEMS.
Second arose from a customer case involving a situation that did not fire [a guard dog that didn't bark!]. After lengthy effort, the root cause was determined: the situation used a DisplayItem to make sure that all the monitored database processes offline were reported. In this problem case, the DisplayItem attribute "database name" was blank or null. Since there were a several such disabled database processes, only one of the blank ones generated an event and the rest were suppressed. The situation was rewritten to use a different DisplayItem that always supplied database name and a re-test showed events were created for all the database processes. Thus passed two months of pain. Diagnosing some such cases faster or ahead of time would improve monitoring with minimal effort.
Third is a goal to calculate other metrics such as the time delay between data processed at the agent and data processed at the TEMS. This can point out system performance issues and even system clock issues.
Use the supplied script eventdat.pl to captiure the TSITSTSH table at the hub and each remote TEMS. Then use the eventaud.pl script to detect and correct any problems. You will get more valid situation events. The eventaud.csv report contains explanations of the advisory messages and the report sections. Read the section here on the summary report. There are many other report sections and these are documented at the end of the report itself.
When situations are started, processing runs on the agents which evaluate conditions defined by the Situation Formula. The situation formula is a filter that takes the available data and determines which results should be passed on to the TEMS. Some formulas are run periodically and those are called Sampled Situations. Other formulas are run as needed to filter incoming sequential log-type data and those are called Pure events.
Sometimes a formula can pass back multiple results. Consider a check for unix mount point with free space lower than desired. Multiple disks can be lower than desired and multiple result rows returned. TEMS uses a DisplayItem to create individual situation events. A DisplayItem is a 128 [or less] beginning of another attribute. When these DisplayItems create unique values per result row, TEMS can create individual events. If DisplayItem is needed but not supplied, only one event is created, thereby hiding real events. The event is chosen somewhat randomly. If the selected DisplayItem is present but does not supply unique values [clearly an Agent issue - why supply a DisplayItem that does not uniquely identify a result row?] you can again miss events.
That issue can also happen with Pure situations. In addition the TEMS has performance logic such that if the same pure event with the same DisplayItem arrives from the same agent in the same second, the second and subsequent results are merged. At the TEP display the merged events can be seen. However they are not transmitted to the event receiver. If a DisplayItem is not defined [or not available], you can again have missed events. Even if you have a DisplayItem, if a flurry arrive in the same second with the same DisplayItem, you again have missed events. Sometimes no one minds: after all if you have one event or twenty in the same second - does anyone really care. There is a way to configure the TEMS to guarantee that every pure result is processed into a single event [see product documentation here]. I must point out that such a configuration can create a serious impact on TEMS and event receiver logic - Is there any real purpose in generating scores of events per second for essentially duplicate issues? As an alternative many such agent sources like Tivoli Log Agent have agent-side controls to hide duplicate identical reports.
Result Workload Measurement
Measuring result workload also has value. One recent case involved a simple sort of situation designed to measure Unix OS Agent CPU usage and alert when too high [more that 50%]. The Event Audit report showed that a staggering 96% of the incoming workload on that remote TEMS was coming from that situation. The logic to work backward... from the Situation Status History table, to the agent side result creation and the TEMS handling of the incoming results was complex but extremely rewarding.
This can also be rewarding with a negative result. A recent case showed signs of overload, but a time based estimate of incoming results showed nothing special around the time of the instability. The TEMS was processing peak levels of 3-4 results per second and most often cruised along at 0.5 to 1.0 results per second. As a result a lot of customer work gathering data and doing restarts was avoided. The root cause was an external influence which blocked tcp communication.
Other reports can show cases where the situation fluctuates between open and close and open and close, etc. Often that suggests a useless situation since it fails the basic goal of situations - to be rare, exceptional and fixable.
Getting the report
You can make this report yourself as described using an available package. You can also create a Case with IBM Support and upload one or more hub or remote TEMS pdcollects. Ask in the Case that the eventaud.csv report be sent back to you.
The package is ==>here<==. It contains
1) Perl script eventdat.pl - to gather the needed data from the ITM environment
2) Perl script eventaud.pl - to process the data and produce an Event Audit report.
I suggest eventdat.pl be placed an ITM installation tmp directory. For Windows you need to create the <installdir>\tmp directory. For Linux/Unix use the existing <installdir>/tmp directory. You can of course use any convenient directory. Any examples following will use these defaults
Linux and Unix almost always come with the Perl shell processor installed. For Windows you can install a no cost Community version from www.activestate.com if needed. You can also extract the files and copy them to a system where Perl is installed. No CPAN [community donated source] packages are needed.
Parameters for running eventdat.pl
All parameters are optional if defaults are taken
-h home installation directory for TEPS. Default is
This can also be supplied with an environment variable
Linux/Unix: export CANDLEHOME=/opt/IBM/ITM
Windows: set CANDLE_HOME=c:\IBM\ITM
-v produce progress messages, default off
-tems specify a TEMS to get event history - can supply multiples using multiple -tems. Default is all online TEMSes except FTO mirror hub TEMS.
-work directory to store work and report files, default is current directory
-off include offline agents, default off
-aff needed in rare database cases where affinities are in different structure
-redo used during testing by reusing result files
The eventdat.pl script will create several files
and one or more situation event history files of this form tems_name.TSITSTSH.LST.
The tems_name is HUB for the hub TEMS and otherwise it is the TEMS nodeid.
The eventdat.pl shell command also creates two shell files: run_eventaud.cmd and run_eventaud.sh. They invoke the eventaud script with needed parameters. ".cmd" for Windows, ".sh" for Linux/Unix... only difference is the line end characters to make each environment happier.
Parameters for running eventaud.pl
All parameters are optional if defaults are taken.
-h Help display
-v Verbose display - more messages
-nohdr skip report header - used in regression testing
-txt read input from TEMS2SQL processed files QA1CSITF.DB.TXT, QA1DNAME.DB.TXT, QA1CSTSH.DB.TXT
-lst [with no following TEMS name] get data from HUB.QA1CSITF.LST. HUB.QA1DNAME.LST, HUB.QA1CSTSH.LST
-lst with following TEMS name, get data from HUB.QA1CSITF.LST. HUB.QA1DNAME.LST, <temsnodeid>.QA1CSTSH.LST
-allresults display more results data, especially when the default single result is present
-sum produce a summary file eventaud.txt
-time [to be implemented]
-days limit look back to this many days before the most recent event, default 7 days, value 0 means use all data,
-o name of output file, default is eventaud.csv
-odir directory for output files, default /tmp or c:\temp
-tsitstsh filename of Situation Status History extract file
-workpath directory to store working files
Running eventaud,pl is just putting the needed files into a directory and running it like this:
cd /tmp where HUB.QA1CSITF.LST. HUB.QA1DNAME.LST, HUB.QA1CSTSH.LST \are present
perl eventaud.pl -lst
If you had a remote TEMS with the name RTEMS1 the commands would be
perl eventaud.pl -lst -tsitstsh RTEMS1
The report file is eventaud.csv and contains the advisories, reports and explanations at the end of the report.
Summary Report format
The eventaud.csv report is lengthy with advisories and report sections. To avoid the burden of over-documentation that few will read, the report itself contains documentation at the end which describes each advisory and report section including the meaning, portions of an example and suggested recovery action if required. This internal documentation is under revision as new diagnoses are made. For this blog post we will show and discuss a summary report section and what it all means. Most of the exception cases in the summary are explained in more detail in the report sections that follow.
EVENTREPORT000: Event/Result Summary Budget Report
Duration: 4476 Seconds
==> The time between the oldest event history and the most recent in seconds. The default -days 7 will translate to 604800 seconds. This especially important for testing environments where you may have some years old events. The only times measured at the Y [open event], N [close event] and X [problem] DELTASTAT column values. The S [Start Situation] is used in order to limit event calculation since the last time situation was started - usually when the TEMS started. Start Situations are not included in the time calculation. They are usually followed with a flood of N [close event] records which are not very interesting.
Total Open/Close Events: 337 4.52/min
==> Count of Y [open] and N[close] records over the duration captured. Total count and rate per minute is displayed. This are usually far less than results since most of the result rows are conforming existing Open event conditions.
Total Results: 3507 47.01/min
==> These are result rows sent from the agents. Sampled situation sent a "confirming" result every sampling interval. In fact if such a confirming result is not observed for three sampling intervals, the TEMS will generated an automatic close of the event. This can be confusing especially if the condition is supposed to be always true. It usually means there is some agent or TEMS or Networking interference happening. Definitely not good. In well running systems this never happens. If this is a hub TEMS, the results apply to all the hub and remote TEMS together and thus do not reflect stress on the hub System only.
Total Non-Forwarded Results: 1444 19.36/min [41.17%]
==> This records how many of the result rows were for Situations which were not configured to an event receiver like Netcool/Omnibus. This is only counted if some Situations were configured to send data to an event receiver. These should be considered because if no event receiver is processing the conditions, the issues will never be fixed and thus it is often a waste of time and resources. The last number is the percent non-forwarded compared to total results. This impacts the agent and the remote TEMS but not the hub TEMS... the hub TEMS only sees the Y and N results.
Total Result Bytes: 3506892 45.91 K/min Worry[9.18%]
==> This is an estimate of the number of bytes sent from the agent and Kbytes per minute. The Worry% is how close this is to the 500K/min level which is where you should be worried. This is not an absolute number. The actual point where a remote TEMS starts to break down could be higher or lower depending on the system power, competing workload and network. Your own experience is the best guide. The largest number ever seen was 93 megabytes/min and of course the remote TEMS were malfunctioning badly. Event Audit numbers are estimates only you would need to use TEMS Audit and tracing to get more precise numbers.
Total Non-Forwarded Result Bytes: 744464 9.75/min [21.23%]
==> This shows the portion of arriving result data associated with non-forwarded situations. See above for more discussion.
Sampled Results Confirm: 3479 46.64/min
==> Many incoming result rows are used to confirm existing open situation events. This is often a high proportion of the total. If the remote TEMS is under stress, you can reduce the stress by increasing the sampling interval, by adding a new remote TEMS to divide the workload, or by using a more powerful system,
Sampled Results Confirm Bytes: 3421828 44.79 K/min, 97.57% of total results
==> This is the bytes require to confirm open situation events. the number at the end is the per cent of total incoming result bytes.
Missing DisplayItem: 26 0.35/min
==> This counts the number of cases where a situation does not have DisplayItem configured and this has hidden situation events. This degrades the monitoring performance by not creating situation events to resolve problems.
Duplicate DisplayItem: 27 0.36/min
==> This counts the number of cases where a situation has a DisplayItem but at the agent multiple results were seen with identical DisplayItem values. This may degrade the monitoring performance by not creating situation events to resolve problems. Usually this means you cannot use that DisplayItem reliably. It may be considered an Agent issue since the Agent should not offer a DisplayItem that does not uniquely identify the result rows. The practical solution is to select another DisplayItem that does uniquely identify result rows. For example in Unix select Process_ID and not Process Command Name.
Null DisplayItem: 100 1133.57/min
==> This is much like the previous Duplicate case but it means Null values were returned. This has the same negative impact and the same considerations.
Pure Merged Results: 6 0.08/min
==> This reports on cases where multiple Pure Situation results, with same DisplayItem value, were processed at a single second at the TEMS and were suppressed by a different mechanism. If this is important, a per TEMS configuration can be made to enforce one result equals one event logic.
Open/Open transitions: 0
==> This rare condition is when an open event is followed by a second open event. It is not well understood.
Close/Close transitions: 0
==> This rare condition is when an close event is followed by a second close event. It is not well understood.
Delay Estimate opens over_minimum  over_average [1.81 seconds]
==> This is a stress indicator of how how the agent and network and TEMS handle incoming results. It can report as low as 0.0 seconds. If the TEMS is heavily loaded the number will tend to be larger. If you see a very large number that often means there is one or more agents running on a system, with a time vastly different that the TEMS time. That can be normal or abnormal and there is a report section devoted to which agents are showing cases where the time is sometimes much larger than the minimum time observed.
If things do not work as expected, please capture input LST files in a zip or compressed tar file and send to the author. I will endeavor to correct any issue promptly. Also please pass on any interesting ideas.
The information in the report will show cases where two or more TEMSes having differing information about particular agents. In the simplest cases that strongly suggests a case of duplicate agents. You can expect to see other associated posts concerning different aspects of event history.
This project is also maintained in github.com/jalvo2014/eventaud and will often be more up to date [and less tested] compared the the point releases. You can also use this github distribution to review history and propose changes via pull requests.
Add/correct some table row sizes
Add report032 by number of events
Correct report031/2 by number of node
Correct report999, sorting large numbers
Add/correct some table row sizes
Report on Acknowledge/Resurface conditions
monitor MS_Offline type sits
Add Critical Issues reporting
Add report on number of events
And copy section to summary
Correct TFWD checking logic
Add/Correct some table sizes
Correct table sizes
use TNAME to show full name in one report
Correct/add table sizes
Correct report004 merged result report
Add some new/corrected table sizes and report titles
Correct delay mode calculation in node delay analysis
Add irregular event processing report and advisory
Correct delay calculation
Don't tag unknown situations for detailed report section.
Skip report006 if no too_close timestamps
Add report025 for close-close and open-open transitions
Detect cases where DisplayItem is not in proper form
Note: Building a new Cruise Ship 2018