• Share
  • ?
  • Profiles ▼
  • Communities ▼
  • Apps ▼

Blogs

  • My Blogs
  • Public Blogs
  • My Updates
  • Administration
  • Log in to participate

▼ Tags

 

▼ Archive

  • February 2019
  • October 2018
  • August 2018
  • July 2018
  • May 2018
  • April 2018
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • November 2016
  • September 2016
  • June 2016
  • May 2016
  • March 2016
  • January 2016
  • December 2015
  • September 2015
  • August 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • November 2014
  • October 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013

▼ Blog Authors

IBM Tivoli Monitoring Wonderful World of Situations

View All Entries
Clicking the button causes a full page refresh. The user could go to the "Entry list" region to view the new content.) Entry list

Sitworld: Event History #8 Situation Events Opening And Closing Frequently

jalvord 1200009463 | ‎ | 2,967 Views

 

Sitworld: Event History #8 Situation Events Opening And Closing Frequently

 

image

John Alvord, IBM Corporation
jalvord@us.ibm.com

 

Draft #1 – 21 April 2018 - Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

 

This was seen in the Event Audit History Advisory section:

20,EVENTAUDIT1003W,TEMS,Situations [1] showing more than 1 open<->close transitions per hour per agent - see EVENTREPORT026 

 

One situation showed a frequent rate of event opens and event closes

 

And in that report section:

EVENTREPORT026: Situations showing high Open<->Close rate
Situation,Reeval,Rate,Node_ct,PDT
IBMSecOnGuardComm_W_Service,120,4.99,2,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ 'LS Communication Server' ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),
  

 

There is a situation IBMSecOnGuardComm_W_Service. It is a Sampled situation at 120 seconds. About 5 times per hour the situation opens and closes on two different agents.

 

From EVENTREPORT011 we see

 

Situation,Table,Rowsize,Reeval,Event,Event%,Event/min,Results,ResultBytes,Result%,Miss,MissBytes,Dup,DupBytes,Null,NullBytes,SampConfirm,SampConfirmBytes,PureMerge,PureMergeBytes,transitions,nodes,PDT
CorpSecOnGuardComm_W_Service,NTSERVICE,1468,120,1677,38.06%,0.17,2120790,3113319720,93.82%,0,0,0,0,0,0,2120790,3113319720,0,0,1677,2,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ 'LS Communication Server' ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),
 

and it represented  a whopping 38% of all events and 94% of the estimated incoming results workload on this remote TEMS. This is an estimate since it is a capture of what is seen in the Situation Event History table. Events that opened outside that capture could be influencing the results workload beyond what is seen here.

 

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

This will show only a single open event and then close event, but there were many listed in the full report.

 

EVENTREPORT999: Full report sorted by Node/Situation/Time
Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT  

 

Situation - Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node - Managed System Name or Agent Name

Thrunode - The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time - The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,...,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time - The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat - event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval - Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results - How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize - The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem - The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber - A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  - The Predicate or Situation Formula as it is stored.

 

 

The Descriptor line - before we see the results.

IBMSecOnGuardComm_W_Service,Primary:LCDSA4173:NT,REMOTE_ibmz360,1180405132941999,1180405132941000,Y,120,1,NTSERVICE.SRVCNAME,LS Communication Server,362,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ 'LS Communication Server' ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),

 

Situation was IBMSecOnGuardComm_W_Service, agent was Primary:LCDSA4173:NT. thrunode was REMOTE_ibmz360. Agent_time was 1180405132941999 and TEMS_time was 1180405132941000, the same second. It was an Open event [Y], sampling interval was 120, there was one result, the DisplayItem was NTSERVICE.SRVCNAME with a value of  "LS Communication Server". The record came from line number 362 in the input. The PDT is shown at the end.

 

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

 

,,,,,,,P,*PREDICATE=NTSERVICE.CURRSTAT <> 'Running' AND ( ( ( NTSERVICE.SRVCNAME = 'LS Communication Server' ) OR ( NTSERVICE.SRVCNAME = 'TermService' ) ) ),

 

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and one result line. Sometimes there are many more but not this time. 


,,,,,,,0,NTSERVICE.ACCONTID=LocalSystem;NTSERVICE.BINARYEX="C:\Program Files (x86)\OnGuard\Lnlcomsrvr.exe";NTSERVICE.CURRSTAT=Stopped;NTSERVICE.DISPNAME=LS Communication Server;NTSERVICE.LORGROUP=;NTSERVICE.ORIGINNODE=Primary:LCDSA4173:NT;NTSERVICE.SRVCNAME=LS Communication Server;NTSERVICE.STARTYPE=Automatic;NTSERVICE.TIMESTAMP=1180405132753000;NTSERVICE.UACCONTID=LocalSystem;NTSERVICE.UBINARYEX="C:\Program Files (x86)\OnGuard\Lnlcomsrvr.exe";NTSERVICE.UDISPNAME=LS Communication Server;NTSERVICE.USRVCNAME=LS Communication Server,

 

Next there is another descriptor line for the N or Close Event record 16 minutes after the open event.

  

IBMSecOnGuardComm_W_Service,Primary:LCDSA4173:NT,REMOTE_ibmz360,1180405134541999,1180405134541000,N,120,0,NTSERVICE.SRVCNAME,LS Communication Server,374,*IF *VALUE NT_Services.Current_State *NE Running *AND ( ( ( *VALUE NT_Services.Service_Name *EQ 'LS Communication Server' ) *OR ( *VALUE NT_Services.Service_Name *EQ TermService ) ) ),

 

What is the problem and How to fix it?

From this capture, 38% of the events and 94% of the result workload was from this situation. Doing that work constitutes a substantial investment. It fails the basic test of a good situation which is to be Rare, Exceptional, and Fixable. It is certainly not a rare condition, seems to be happening all the time. It is happening a lot and no one is "fixing" the condition. 

The formula itself does not make sense. It fires if two particular Windows Services are not in Running state. It is very rare for a process to be always running. Perhaps if it is a highly computational task, like calculating how molecules fold you might see it. However most real life processes spend a good amount of time sleeping or waiting for new work or waiting for paging to complete or many other cases. That is why systems can run many tasks: the workload is shareable since most tasks are not running most of the time.

So this situation is not rare, not exceptional and clearly no one is "fixing" it. Therefore the situation should be rethought and reworked until it is rare, exceptional and fixable. If that is not possible, the situation should be stopped and deleted to make room for other useful work at the agent(s) and the TEMS and the event receivers.

 

A Warning Sign

Situations that open and close more than once an hour should be reviewed for reasonableness. They are a sign that no one really cases about the events.

Summary

Tale #8 of using Event Audit History to understand and review a possibly meaningless situation and thus save resources.

 

Sitworld: Table of Contents

 

History and Earlier versions

There are no binary objects associated with this project.

1.000000
initial release

 

Photo Note: Red Canisters - Cruise Ship Build 2016

Modified on by jalvord 1200009463
  • Add a Comment Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry
Notify Other People
notification

Send Email Notification

+

Quarantine this entry

deleteEntry
duplicateEntry

Mark as Duplicate

  • Previous Entry
  • Main
  • Next Entry
Feed for Blog Entries | Feed for Blog Comments | Feed for Comments for this Entry