Sitworld: AOA Critical Issue - TEMS to TEMS High Latency network connection
John Alvord, IBM Corporation
In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks. At the same time the reports have become more complex and challenging to digest.
With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document presents one specific critical issue - High Latency Connection between two TEMS - usually a hub TEMS and a remote TEMS.
Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored - although uninstalling the TEMS would be a good idea.
Getting more information
If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.
TEMS Audit - temsaud.csv [any hub or remote TEMS]
Database Health Checker - datahealth.csv [any hub TEMS]
Event History Audit - eventaud.csv [any hub or remote TEMS]
There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.
Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.
temsaud.crit: Early remote SQL failures [&syncdist_early]
TEMS to TEMS communication requires relatively low latency communications and near zero packet loss. There is no absolute rule about when problems occur. However many large customers keep latency under 20 milliseconds. Many customers have latency at 100 milliseconds . At 250 milliseconds or more most customers have problems. The symptoms are many - basically the distant TEMS will show as offline and not do the expected work. This all depends on how much work is happening. With less work there is a better change of success,
One useful tool is APM: ITM Communications Validation. Especially useful is the special form of ping using large packets in the Do Not Fragment mode. That is basically what ITM uses.
Often the quoted message above is seen on high latency links. When first connecting to the hub TEMS, the remote TEMS copies a number of large tables using Remote SQL which has a default timeout of 600 seconds. When that fails, the message is produced.
If the latency cannot be reduced, the usual work around is to configure a hub TEMS at the distant site. It can send events to an event receiver and that process is not latency sensitive.
This document shows how to manage high latency issues between two TEMSes.
Note: 2018 - Home Grown Meyer Lemons