Sitworld: Event History #8 Situation Events Opening And Closing Frequently
John Alvord, IBM Corporation
Draft #1 – 21 April 2018 - Level 1.00000
The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.
This was seen in the Event Audit History Advisory section:
20,EVENTAUDIT1003W,TEMS,Situations  showing more than 1 open<->close transitions per hour per agent - see EVENTREPORT026
One situation showed a frequent rate of event opens and event closes
And in that report section:
EVENTREPORT026: Situations showing high Open<->Close rate
IBMSecOnGuardComm_W_Service,120,4.99,2,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ 'LS Communication Server' ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),
There is a situation IBMSecOnGuardComm_W_Service. It is a Sampled situation at 120 seconds. About 5 times per hour the situation opens and closes on two different agents.
From EVENTREPORT011 we see
CorpSecOnGuardComm_W_Service,NTSERVICE,1468,120,1677,38.06%,0.17,2120790,3113319720,93.82%,0,0,0,0,0,0,2120790,3113319720,0,0,1677,2,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ 'LS Communication Server' ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),
and it represented a whopping 38% of all events and 94% of the estimated incoming results workload on this remote TEMS. This is an estimate since it is a capture of what is seen in the Situation Event History table. Events that opened outside that capture could be influencing the results workload beyond what is seen here.
Deep dive Into the report details
Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.
This will show only a single open event and then close event, but there were many listed in the full report.
EVENTREPORT999: Full report sorted by Node/Situation/Time
Situation - Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.
Node - Managed System Name or Agent Name
Thrunode - The managed system that knows how to communicate with the agent, the remote TEMS in simple cases
Agent_Time - The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,...,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.
TEMS_Time - The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.
Deltastat - event status. You generally see Y for open and N for close. There are more not recorded here.
Reeval - Sampling interval [re-evaluation] in seconds and 0 means a pure event.
Results - How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.
Atomize - The table/column specification of the value used for Atomize. It can be null meaning not used.
DisplayItem - The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.
LineNumber - A debugging helper that tells what line of the TSITSTSH data dump supplied this information
PDT - The Predicate or Situation Formula as it is stored.
The Descriptor line - before we see the results.
IBMSecOnGuardComm_W_Service,Primary:LCDSA4173:NT,REMOTE_ibmz360,1180405132941999,1180405132941000,Y,120,1,NTSERVICE.SRVCNAME,LS Communication Server,362,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ 'LS Communication Server' ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),
Situation was IBMSecOnGuardComm_W_Service, agent was Primary:LCDSA4173:NT. thrunode was REMOTE_ibmz360. Agent_time was 1180405132941999 and TEMS_time was 1180405132941000, the same second. It was an Open event [Y], sampling interval was 120, there was one result, the DisplayItem was NTSERVICE.SRVCNAME with a value of "LS Communication Server". The record came from line number 362 in the input. The PDT is shown at the end.
Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.
,,,,,,,P,*PREDICATE=NTSERVICE.CURRSTAT <> 'Running' AND ( ( ( NTSERVICE.SRVCNAME = 'LS Communication Server' ) OR ( NTSERVICE.SRVCNAME = 'TermService' ) ) ),
Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and one result line. Sometimes there are many more but not this time.
,,,,,,,0,NTSERVICE.ACCONTID=LocalSystem;NTSERVICE.BINARYEX="C:\Program Files (x86)\OnGuard\Lnlcomsrvr.exe";NTSERVICE.CURRSTAT=Stopped;NTSERVICE.DISPNAME=LS Communication Server;NTSERVICE.LORGROUPNTSERVICE.ORIGINNODE=Primary:LCDSA4173:NT;NTSERVICE.SRVCNAME=LS Communication Server;NTSERVICE.STARTYPE=Automatic;NTSERVICE.TIMESTAMP=1180405132753000;NTSERVICE.UACCONTID=LocalSystem;NTSERVICE.UBINARYEX="C:\Program Files (x86)\OnGuard\Lnlcomsrvr.exe";NTSERVICE.UDISPNAME=LS Communication Server;NTSERVICE.USRVCNAME=LS Communication Server,
Next there is another descriptor line for the N or Close Event record 16 minutes after the open event.
IBMSecOnGuardComm_W_Service,Primary:LCDSA4173:NT,REMOTE_ibmz360,1180405134541999,1180405134541000,N,120,0,NTSERVICE.SRVCNAME,LS Communication Server,374,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ 'LS Communication Server' ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),
What is the problem and How to fix it?
From this capture, 38% of the events and 94% of the result workload was from this situation. Doing that work constitutes a substantial investment. It fails the basic test of a good situation which is to be Rare, Exceptional, and Fixable. It is certainly not a rare condition, seems to be happening all the time. It is happening a lot and no one is "fixing" the condition.
The formula itself does not make sense. It fires if two particular Windows Services are not in Running state. It is very rare for a process to be always running. Perhaps if it is a highly computational task, like calculating how molecules fold you might see it. However most real life processes spend a good amount of time sleeping or waiting for new work or waiting for paging to complete or many other cases. That is why systems can run many tasks: the workload is shareable since most tasks are not running most of the time.
So this situation is not rare, not exceptional and clearly no one is "fixing" it. Therefore the situation should be rethought and reworked until it is rare, exceptional and fixable. If that is not possible, the situation should be stopped and deleted to make room for other useful work at the agent(s) and the TEMS and the event receivers.
A Warning Sign
Situations that open and close more than once an hour should be reviewed for reasonableness. They are a sign that no one really cases about the events.
Tale #8 of using Event Audit History to understand and review a possibly meaningless situation and thus save resources.
History and Earlier versions
There are no binary objects associated with this project.