Sitworld: Re-re-re-mem-member Situation Status Cache Growth Analysis
John Alvord, IBM Corporation
Draft #1 – 1 August 2015 - Level 0.51000
Recently I had two cases where a remote TEMS process size grew and grew and performance was horrible. To speed up analysis of such cases the following project and tool was developed and now anyone can figure out one common case.
When an agent connects to a TEMS [hub or remote] the TEMS gets the right situations running at the agent and evaluates the agent results to determine if a situation event should be opened or closed. The agent sends results and has no awareness of "events". In fact the results sent to the TEMS might be feeding a workflow policy or *UNTIL *SIT processing. The agent sends one set of results and the TEMS makes copies for all the different purposes.
The TEMS uses a in-storage table called Situation Status Cache - TSITSTSC and you can view it as a disk table QA1CSTSC.DB/IDX. For example if a result with data arrives and the situation event has already been presented, a second event must be avoided. This table maintains current status. If the situation is sampled and has Persist=4 configured, that means 4 results must be returned in a row before a situation event is presented. If three in a row arrive and then a 0 row or false result, no situation event is ever created.
Here is the major data items preserved in the in storage copy of TSITSTSC:
|LCLTMSTMP||Time situation was evaluated at Agent|
|NODE||Managed System Name or Agent Name|
|SITNAME||Situation Name [can be index if long name]|
|DELTASTAT||Status - Y for open, N for close and others|
|ATOMIZE||DisplayItem if configured|
|SITCOUNT||Current Persist Count|
This is a very ordinary sort of processing assist table.
In one specific case, this in-storage table can grow "forever" or until the TEMS is recycled.
- Pure Situation - monitoring a log like Windows NT Event Log
- Large volume of results
- DisplayItem used and constantly changes - such as a Description that has an embedded date/time stamp
When TEMSes run as 32-bit program objects [kdsmain], the upper limit is somewhere under 2 gigabytes. There is one Linux configuration which allows somewhat under 3 gigabytes. The storage growth from the Situation Status cache eventually causes a TEMS failure. It also forces higher and higher CPU Resource consumption because the in-storage table is searched linearly.
These days many TEMSes run as 64 bit program objects. The failure mode now is that TEMS size and resource consumption rises until someone notices and recycles the TEMS. On one memorable occasion an AIX LPAR actually exhausted system paging space and experienced a forced shutdown.
Why Create Situations which Cause such problems?
One reason is that it is convenient to have that DisplayItem filled in. On the Portal Client the Situation Event Console will inform more about the impact or issue. For example a full mount point situation can show which mount point. That can also be useful in programming an event receiver logic. However for the problem case above with very high volume and long Descriptions that reason is hard to justify.
A second reason involves a little documented optimization whereby situation events can be merged. If a situation with no DisplayItem result arrives with 1) same node, 2) same situation and 3) same time to the second that result can be logically merged with the matching result. Within Portal Client, the multiple events can be displayed [to a maximum of 10]. From an event receiver [e.g. Omnibus] standpoint the second and subsequent result in the same second are never seen. If the DisplayItem is specifed and different each result will cause a separate situation event. In many cases this allows the event receiver to see all the events.
This merging can happen with Sampled situations but it is very rare and almost never causes a problem.,
Since 2010 there has been a TEMS configuration to prevent such merging. See this technote for full details and implementation instructions.
You configure a TEMS receiving results from an agent type and specific attribute from to perform a one result create one row logic for a specific attribute group.
An even better solution is to use a modern agent like Tivoli Log Agent - which has been part of ITM for many years. That Agent can be configured to send results directly to the event receiver and thus not burden TEMS at all.
Identifying the Situation Status Cache Issue
The easiest way to recognize the issue is to check the size of the TEMS QA1CSTSC.DB file. If this is more than 32meg *and* if it keeps growing over time, the problem may exist. If that file grows into the hundreds of megabytes the TEMS is trending toward failure. You might have to check more than one remote TEMS depending on how the agent workload is configured.
Until now, identifying the specific situations causing the issue has been extremely technical. This blog post and project will let you do it yourself anytime you want. This project contains a data capture command and an analysis program which will show you which situations are contributing to TSITSTSC growth in bytes per day. That report can be used to make the needed configuration changes and thus make monitoring stable and more efficient. This can also be done by IBM Support if needed.
The following assumes the default install directory. The data collection work is done on the system which runs the TEPS. You can certainly do this any number of ways. For example you could capture the data at the TEPS and then copy the files somewhere else to process. If you are using a non-default install directory then then shell files will need to be modified. The choice of where to store the program objects is arbitrary - pick whatever you want.
The package is here. It contains
1) Perl program sitcache.pl - standing for Situation Status Cache.
2) A sitcached.cmd [Windows] shell command to run the SQL statements.
3) A sitcached.tar file which contains Linux/Unix versions of the SQL files and a sitcached.sh file. This avoids problems with the line endings. Just untar that into the install directory.
I suggest these all be placed in a single directory. For Windows you need to create the tmp directory and the sql subdirectory. For Linux/Unix create the sql directory.
4) Most often you want to investigate a specific remote TEMS. the sitcached shell/cmd file takes an optional parameter of the TEMS nodeid [not the hostname].
Running the Program.
a) cd /opt/IBM/IBM/tmp/sql
b) If not using default install directory run specify like this: export CANDLEHOME=/opt/IBM/ITM
c) sh sitcached.sh - if interested in a specific remote TEMS sh sitcached.sh <temsnodeid>
4) perl sitcache.pl -lst
b) If not using default install directory run specify like this: SET CANDLE_HOME=c:\IBM\ITM
c) sitcached.cmd - if interested in a specific remote TEMS sitcached.cmd <temsnodeid>
d) perl sitcache.pl -lst
One file is created - sitcache.csv.
Here is a view of the CSV file from LibreOffice Calc. Some rows were deleted for this presentation
The Situation name presented is the FullName - as would be seen in Situation Editor. The report is shown in descending order by an estimate of number of bytes storage in storage and the growth in bytes per day. The Total line situation shows the number of seconds -- about 12 days in this case.
This specific case showed two situations which composed 80% of the storage growth. They were Unix Log Agent and when the remote TEMS was configured to do Pure result one row and the DisplayItem was removed the problem was resolved.
The Situation Cache tool was derived from Situation Distribution Report.
History and Earlier versions
If the current version of the Situation Cache tool does not work, you can try previous published binary object zip files. At the same time please contact me to resolve the issues. If you discover an issue try intermediate levels to isolate where the problem was introduced.
Photo Note: Attaching a Propulsion Unit to a New Cruise Ship in Italy