• Share
  • ?
  • Profiles ▼
  • Communities ▼
  • Apps ▼

Blogs

  • My Blogs
  • Public Blogs
  • My Updates

This community can have members from outside your organization. IBM Tivoli Monitoring Wonderful World of Situations

  • Log in to participate

▼ Tags

▼ Similar Entries

Use a non-zero NPGTH...

Blog: Db2 for z/OS ...
Paul_McWilliams 110000JT36
Updated
1 people likes thisLikes 1
CommentsComments 1

Linux on IBM Z perfo...

Blog: Ingolf's z/VS...
Ingolf24 120000DRN3
Updated
0 people like thisLikes 0
No CommentsComments 0

Building baas: (serv...

Blog: IBM and Googl...
MJonker 100000GP4U
Updated
1 people likes thisLikes 1
No CommentsComments 0

Cache Invalidation U...

Blog: CSE-WebSphere...
ShoebBihari 3100001AME
Updated
1 people likes thisLikes 1
No CommentsComments 0

How to Identify Clas...

Blog: Application I...
MicheleCalcavecchia 270000HCF1
Updated
1 people likes thisLikes 1
No CommentsComments 0

▼ Similar Ideas

Re: 2014 2nd Edition...

Ideation Blog: IBM PureData-...
shubho 270001FMSR
Updated
No Votes 0 No CommentsComments 0

Statistics in Netezz...

Ideation Blog: IBM PureData-...
DeepashriKrishnaraja 270001C7Y3
Updated
Votes 1 CommentsComments 3

Importance of settin...

Ideation Blog: IBM PureData-...
DeepashriKrishnaraja 270001C7Y3
Updated
Votes 2 CommentsComments 5

Understanding Netezz...

Ideation Blog: IBM PureData-...
vinoy 270001RPDP
Updated
Votes 2 CommentsComments 2

▼ Archive

  • April 2018
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • November 2016
  • September 2016
  • June 2016
  • May 2016
  • March 2016
  • January 2016
  • December 2015
  • September 2015
  • August 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • November 2014
  • October 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013

▼ Blog Authors

IBM Tivoli Monitoring Wonderful World of Situations

View All Entries
Clicking the button causes a full page refresh. The user could go to the "Entry list" region to view the new content.) Entry list

Sitworld: SOAP Flash Flood

jalvord 1200009463 | | Tags:  situation performance ‎ | 6,091 Views

Sitworld: SOAP Flash Flood

image
John Alvord, IBM Corporation
jalvord@us.ibm.com

 

Follow on twitter

Inspiration

I was working at a customer site with occasional SOAP issues and ITM instability. There was large hub TEMS diagnostic tracing in place including SOAP request activity. One Monday afternoon about 1pm the ITM environment went unstable. All eight remote TEMS went offline and thus all agents showed as offline. An attempt to do a tacmd login did not complete. About an hour later remote TEMS and the connected agents started coming back online. After 2 hours everything was running normally.

 

Diagnostics

As it happened the hub TEMS was running with this trace

+52CEEACF.0000         KBB_RAS1: ERROR (UNIT:kpxrpcrq,Entry:"IRA_NCS_Sample" ER ST)  (UNIT:ko4sitma,Entry:"IBInterface::processOneResult" ER DET) (UNIT:kdsstc1,Entry:"ProcessTable" ER ST) (UNIT:kshreq,ENTRY:"buildSQL" ER DET)  (UNIT:kshprs,ENTRY:"WSQL_StartSoapRequest" ER FLOW)  (UNIT:kshstrt,ENTRY:"default_service" ALL) (UNIT:kdsruc1 ST ER)  (UNIT:kfaadloc ST ER) (UNIT:kdssqprs IN ER METRICS)  (UNIT:kfaprpst ST ER)(UNIT:kfansins FLOW ST ER)

In particular the SOAP request SQLs were captured. The log lines looked like this

(52CEF7ED.0015-ACC:kshreq.cpp,955,"buildSQL") Using pre-built SQL: SELECT NODELIST, NODE FROM O4SRV.TNODELST WHERE NODELIST='*HUB'

The SOAP request count totaled 10,097 in 244,194 seconds [roughly 2.8 days]. That is about one SOAP per 24 seconds. That is low compared to what I have seen,

I manually reviewed the time range when the instability was observed. There seemed to be a sudden burst of SOAP activity about that time.  I updated TEMS Audit in two ways

  1. Add a SOAP Burst advisory – if there were more then 300 requests in a minute
  2. Add a SOAP Detail report on a minute by minute basis [ignoring zero count minutes]

There had been a burst at that time. Here is the advisory.

Advisory: SOAP Burst requests per minute 1049% higher then nominal 300 at line 280017 in C:/pmrdata/59608,113,848…

Here is what that section of the SOAP detail report looked like [edited for space reasons].

Secs   Count Line   Log-segment            
.....................                      
114080     5  196164 C:/pmrdata/59608,113,848
114120  2053  227910 C:/pmrdata/59608,113,848
114180  3447  280017 C:/pmrdata/59608,113,848
117000  1439  345137 C:/pmrdata/59608,113,848
117001     1  345161 C:/pmrdata/59608,113,848

Looking at the log segments in detail, there were almost 7000 SOAP requests in 133 seconds.

 

Detective work

I first imagined an evil actor. However the SQL did not look strange or unusual.

SELECT FULLNAME
FROM O4SRV.TNAME
WHERE ID = "XX_NT0_EVL_WIE_ID4624_ADMIN

SELECT  ID,OBJCLASS,SEQUENCE,TYPE,LSTDATE,LSTUSRPRF,LSTRELEASE,
LOCFLAG,RESERVED,MAP FROM O4SRV.EVNTMAP
WHERE ID='XX_NT0_EVL_WIE_ID4624_ADMIN'

SELECT NODEL
FROM O4SRV.TOBJACCL
WHERE OBJNAME='XX_NT0_EVL_WIE_ID4624_ADMIN' AND
OBJCLASS = '5140' AND
SYSTEM.PARMA("QIBNODE", "QOMEGAVIEW", 32)

Almost half were TOBJACCL requests… about 3000. That happened to be the number of situations in the environment. TOBJACCL is the TEMS table that defines distribution - what objects should run at what managed systems.

Later in the week, the customer reported back that someone had routinely done a

tacmd login …
tacmd bulkExportSit -d

And now it made sense. That bulkExportSit function got a list of all the situations. The -d option meanst to include the distribution in the xml capture of the situation object.

The TOBJACCL table is kept in a TEMS database table – QA1DOBJA.DB/IDX. It was pretty large – about 8 megabytes. In a 133 second period, that table needed to be read over 3000 times. That totals 24 gigabytes of data. The TEMS got so backlogged with work that it did not communicate with the remote TEMS. The remote TEMS tasks went into offline status. The ITM environment did recover in 2 hours or so,

 

Results

The tacmd logic could do things more efficiently – like reading TOBJACCL and other tables just once. There is IBM Support work underway to determine if an APAR fix is warranted. The customer started doing that work on quiet times in the weekend,

 

Summary

A puzzling issue was fully diagnosed.

Sitworld: Table of Contents

 

Note: Looking South from Big Sur - Los Angeles is 300 miles away

 

 

Modified on by jalvord 1200009463
  • Add a Comment Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry
Notify Other People
notification

Send Email Notification

+

Quarantine this entry

deleteEntry
duplicateEntry

Mark as Duplicate

  • Previous Entry
  • Main
  • Next Entry
Feed for Blog Entries | Feed for Blog Comments | Feed for Comments for this Entry