Sitworld: AOA Critical Issue - High Incoming Workload
John Alvord, IBM Corporation
In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks. At the same time the reports have become more complex and challenging to digest.
With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document presents one specific critical issue - high incoming workload, usually from situations.
Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored - although uninstalling the TEMS would be a good idea.
Getting more information
If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.
TEMS Audit - temsaud.csv [any hub or remote TEMS]
Database Health Checker - datahealth.csv [any hub TEMS]
Event History Audit - eventaud.csv [any hub or remote TEMS]
There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.
Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.
High TEMS workload indications
eventaud.crit: Estimated Incoming result rate $ppc_result_rate worried $ppc_worry_pc
temsaud.crit: Hub TEMS has lost connection to HUB $hublost_total times
temsaud.crit: High incoming results $trespermin per minute worried[$wpc]
TEMSes can be destabilized by high incoming workload. That is usually from agents sending situation result data. Addition sources are from agents sending historical data, from real time data request, and from agents that do SQL for internal purposes such as ITCAM for Transactions. However it is mostly situation results. When a situation is true, the agent sends confirmation results each sampling interval. That composes most of the situation workload.
The usual worry point is 500K bytes/minute or 100% worry. That choice is taken from experience. Certainly installations can go higher or run into problems at a lower point. It all depends on the system where the TEMS is running and how much capacity and network performance is available. The peak rate seen was 93 megs/min and the 128 [and 8 hub TEMSes] were just about killed.
The eventaud.crit incoming results creates an estimate of workload based on the recent history. It sometimes under estimates the actual load because a situation could be true but not recorded in the 8192 wrap around data. If you see it high, reality might well be higher.
The parallel temsaud.crit incoming results requires a TEMS trace to be present KBB_RAS1=error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er). Some clients turn that on permanently. The diagnostic trace added output is minimal [one line per result set arriving].
The last indication "Hub TEMS has lost connection to HUB" implies a severe hub TEMS work overload. The warning message is paradoxical but makes sense in context. The SITMON process is attempting to update a status using an SQL to the dataserver. There has been a time out of 20 minutes and the SQL is not complete. Most times that is a severe workload issue... however it could be other things such as excessive TEMS action commands or a external process starving the hub TEMS of cpu time.
Often these need a proper TEMS Audit workload trace and analysis. When situation(s) is identified as a culprit it can be evaluated for reasonableness. Situations should be
4) Resources available to fix issue
If those do not apply, it is a waste of resource to run the situation and send out tickets. One one memorable occasion, 80% of a hub TEMS workload came from a single situation on a single Unix OS Agent running on a system that was supposed to be powered off and decommissioned. The situation was not associated with any TEPS node and was not forwarded to an event receiver... it was just sitting in the background burning up resources and hurting important processing.
The information in the report will show how to handle cases when the TEMS is being subjected to high workload.
Note: 2018 - Home Grown Meyer Lemons