Sitworld: Event History #6 Lost events with Multiple Results with same DisplayItem at same TEMS second
John Alvord, IBM Corporation
Draft #1 – 17 April 2018 - Level 1.00000
The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.
This was seen in the Event Audit History Advisory section:
50,EVENTAUDIT1013W,TEMS,Situations  lost [merged] events Multiple Results with same DisplayItem at same TEMS second - see EVENTREPORT004
This particular case showed up when the DisplayItem was configured however two results were produced and processed at the TEMS at the same TEMS second. The Atomize value was identical however and therefore the second result did not produce an event. Therefore an event was lost.
And in that report section:
EVENTREPORT004: Situations with Multiple results at TEMS with same DisplayItem at same second
bnc_errpt_xulm_aixgr_01,Pure,1180410020221000,2,bnc_viomtl18x:KUL,ULLOGENT.UENTRYDESC,DC73C03A: SOFTWARE PROGRAM ERROR,
There is a situation bnc_errpt_xulm_aixgr_01. It is a Pure situation and at Agent second 1180410020221000 on agent bnc_viomtl18x:KUL there were 2 results. The DisplayItem was ULLOGENT.UENTRYDESC and the Atomize value was "DC73C03A: SOFTWARE PROGRAM ERROR". As a result an event was hidden.
Deep dive Into the report details
Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.
EVENTREPORT999: Full report sorted by Node/Situation/Time
Situation - Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.
Node - Managed System Name or Agent Name
Thrunode - The managed system that knows how to communicate with the agent, the remote TEMS in simple cases
Agent_Time - The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,...,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.
TEMS_Time - The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.
Deltastat - event status. You generally see Y for open and N for close. There are more not recorded here.
Reeval - Sampling interval [re-evaluation] in seconds and 0 means a pure event.
Results - How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.
Atomize - The table/column specification of the value used for Atomize. It can be null meaning not used.
DisplayItem - The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.
LineNumber - A debugging helper that tells what line of the TSITSTSH data dump supplied this information
PDT - The Predicate or Situation Formula as it is stored.
The Descriptor line - before we see the results.
bnc_errpt_xulm_aixgr_01,bnc_viomtl18x:KUL,REMOTE_camtram070nbfra,1180410020217999,1180410020221000,Y,0,2,ULLOGENT.UENTRYDESC,DC73C03A: SOFTWARE PROGRAM ERROR,2227,*IF *VALUE Log_Entries.Log_Name_U *EQ 'errlog' *AND *VALUE Log_Entries.Class *EQ Software *AND *SCAN Log_Entries.Description *NE 'A924A5FC:' *AND *SCAN Log_Entries.Description *NE '813FE820:' *AND *SCAN Log_Entries.Description *NE '8FED25B9:' *AND *SCAN Log_Entries.Description *NE 'C5C09FFA:' *AND *SCAN Log_Entries.Description *NE '1BA7DF4E:' *AND *SCAN Log_Entries.Description *NE 'A6DF45AA' *AND *VALUE Log_Entries.Type *IN ('P','T','U') *AND *SCAN Log_Entries.Description *NE 0873CF9F *AND *SCAN Log_Entries.Description *NE 2F64580C *AND *SCAN Log_Entries.Description *NE 573790AA *AND *SCAN Log_Entries.Description *NE FE2DEE00,
Situation was bnc_errpt_xulm_aixgr_01, agent was bnc_viomtl18x:KUL. thrunode was REMOTE_camtram070nbfra. Agent_time was 1180410020217999 and TEMS_time was 1180410020221000, 4 seconds later. It was an Open event [Y], the DisplayItem ULLOGENT.UENTRYDESC with a value of "DC73C03A: SOFTWARE PROGRAM ERROR". There is a line number 2227 of the input. The PDT is shown also,re was no DisplayItem, sampling interval was 300 seconds was and the situation formula checked for the existence of a certain mount point.
Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.
,,,,,,,P,*PREDICATE=ULLOGENT.ULOGNAME = N'errlog' AND ULLOGENT.ENTRYCLASS = 'S' AND STRSCAN(ULLOGENT.ENTRYDESC, 'A924A5FC:') = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, '813FE820:') = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, '8FED25B9:') = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, 'C5C09FFA:') = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, '1BA7DF4E:') = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, 'A6DF45AA') = 0 AND ( ULLOGENT.ENTRYTYPE = 'P' OR ULLOGENT.ENTRYTYPE = 'T' OR ULLOGENT.ENTRYTYPE = 'U' ) AND STRSCAN(ULLOGENT.ENTRYDESC, '0873CF9F') = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, '2F64580C') = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, '573790AA') = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, 'FE2DEE00') = 0,
Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there were many P lines and many result lines. More comments follow. Ignore the funny emoticons that some Browsers convert equal [=] followed by semicolon [;] into. If needed you can copy/paste the line into a line mode editor for study. Clearly the results were coming in very fast, but apparently they arrived in three separate bundles of 4 total results.
,,,,,,,0,ULLOGENT.ENTRYCLASS=S;ULLOGENT.ENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.ENTRYSRC=fscsi1;ULLOGENT.ENTRYSYS=viomtl18x;ULLOGENT.ENTRYTIME=1180410020000000;ULLOGENT.ENTRYTYPE=T;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=bnc_viomtl18x:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180410020043000;ULLOGENT.UENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.UENTRYSRC=fscsi1;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,
,,,,,,,1,ULLOGENT.ENTRYCLASS=S;ULLOGENT.ENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.ENTRYSRC=fscsi1;ULLOGENT.ENTRYSYS=viomtl18x;ULLOGENT.ENTRYTIME=1180410020000000;ULLOGENT.ENTRYTYPE=T;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=bnc_viomtl18x:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180410020045000;ULLOGENT.UENTRYDESC=DC73C03A: SOFTWARE PROGRAM ERROR;ULLOGENT.UENTRYSRC=fscsi1;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,
What is the problem and How to fix it?
As can be seen, the attributes are identical except for ULLOGENT.TIMESTAMP which is 1180410020043000 and 1180410020045000. It might not much difference that the same identical result is seen once or twice. However if it does make a difference, the TEMS can be configured for Pure Situations to force a separate situation event for each result row.
On the other hand, loading down the agent and the TEMS processing the same data over and over seems wasteful. The Tivoli Log Agent can be configured to filter duplicate events and thus save resources at the Agent and the TEMS. Or the situation formula can be altered to exclude this case.
A Warning Note
This situations may violate the general guidance that situations should be Rare, Exceptional and Fixable. It certainly doesn't seem rare. It feels like someone should fix the issue and thus avoid the monitoring overhead completely.
Tale #6 of using Event Audit History to understand and correct a type of Incorrect DisplayItem conditions and thus get more results.
History and Earlier versions
There are no binary objects associated with this project.