Sitworld: SOAP Flash Flood
John Alvord, IBM Corporation
I was working at a customer site with occasional SOAP issues and ITM instability. There was large hub TEMS diagnostic tracing in place including SOAP request activity. One Monday afternoon about 1pm the ITM environment went unstable. All eight remote TEMS went offline and thus all agents showed as offline. An attempt to do a tacmd login did not complete. About an hour later remote TEMS and the connected agents started coming back online. After 2 hours everything was running normally.
As it happened the hub TEMS was running with this trace
+52CEEACF.0000 KBB_RAS1: ERROR (UNIT:kpxrpcrq,Entry:"IRA_NCS_Sample" ER ST) (UNIT:ko4sitma,Entry:"IBInterface::processOneResult" ER DET) (UNIT:kdsstc1,Entry:"ProcessTable" ER ST) (UNIT:kshreq,ENTRY:"buildSQL" ER DET) (UNIT:kshprs,ENTRY:"WSQL_StartSoapRequest" ER FLOW) (UNIT:kshstrt,ENTRY:"default_service" ALL) (UNIT:kdsruc1 ST ER) (UNIT:kfaadloc ST ER) (UNIT:kdssqprs IN ER METRICS) (UNIT:kfaprpst ST ER)(UNIT:kfansins FLOW ST ER)
In particular the SOAP request SQLs were captured. The log lines looked like this
(52CEF7ED.0015-ACC:kshreq.cpp,955,"buildSQL") Using pre-built SQL: SELECT NODELIST, NODE FROM O4SRV.TNODELST WHERE NODELIST='*HUB'
The SOAP request count totaled 10,097 in 244,194 seconds [roughly 2.8 days]. That is about one SOAP per 24 seconds. That is low compared to what I have seen,
I manually reviewed the time range when the instability was observed. There seemed to be a sudden burst of SOAP activity about that time. I updated TEMS Audit in two ways
- Add a SOAP Burst advisory – if there were more then 300 requests in a minute
- Add a SOAP Detail report on a minute by minute basis [ignoring zero count minutes]
There had been a burst at that time. Here is the advisory.
Advisory: SOAP Burst requests per minute 1049% higher then nominal 300 at line 280017 in C:/pmrdata/59608,113,848…
Here is what that section of the SOAP detail report looked like [edited for space reasons].
Secs Count Line Log-segment
114080 5 196164 C:/pmrdata/59608,113,848
114120 2053 227910 C:/pmrdata/59608,113,848
114180 3447 280017 C:/pmrdata/59608,113,848
117000 1439 345137 C:/pmrdata/59608,113,848
117001 1 345161 C:/pmrdata/59608,113,848
Looking at the log segments in detail, there were almost 7000 SOAP requests in 133 seconds.
I first imagined an evil actor. However the SQL did not look strange or unusual.
WHERE ID = "XX_NT0_EVL_WIE_ID4624_ADMIN
LOCFLAG,RESERVED,MAP FROM O4SRV.EVNTMAP
WHERE OBJNAME='XX_NT0_EVL_WIE_ID4624_ADMIN' AND
OBJCLASS = '5140' AND
SYSTEM.PARMA("QIBNODE", "QOMEGAVIEW", 32)
Almost half were TOBJACCL requests… about 3000. That happened to be the number of situations in the environment. TOBJACCL is the TEMS table that defines distribution - what objects should run at what managed systems.
Later in the week, the customer reported back that someone had routinely done a
tacmd login …
tacmd bulkExportSit -d
And now it made sense. That bulkExportSit function got a list of all the situations. The -d option meanst to include the distribution in the xml capture of the situation object.
The TOBJACCL table is kept in a TEMS database table – QA1DOBJA.DB/IDX. It was pretty large – about 8 megabytes. In a 133 second period, that table needed to be read over 3000 times. That totals 24 gigabytes of data. The TEMS got so backlogged with work that it did not communicate with the remote TEMS. The remote TEMS tasks went into offline status. The ITM environment did recover in 2 hours or so,
The tacmd logic could do things more efficiently – like reading TOBJACCL and other tables just once. There is IBM Support work underway to determine if an APAR fix is warranted. The customer started doing that work on quiet times in the weekend,
A puzzling issue was fully diagnosed.
Note: Looking South from Big Sur - Los Angeles is 300 miles away