Sitworld: AOA Critical Issue - TEMS Possible TCP Blockage
John Alvord, IBM Corporation
In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks. At the same time the reports have become more complex and challenging to digest.
With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document the issue where there is evidence of TEMS database file damage.
Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored - although uninstalling the TEMS would be a good idea.
We are still learning about this following rare condition. Some cases have been diagnosed and this document will be updated as we have new information.
Getting more information
If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.
TEMS Audit - temsaud.csv [any hub or remote TEMS]
Database Health Checker - datahealth.csv [any hub TEMS]
Event History Audit - eventaud.csv [any hub or remote TEMS]
There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.
Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.
TEMS Possible TCP Blockage
This error is identified in the TEMS Audit task:
tmsaud.crit: Possible TCP Blockage: Recv-Q[13,1290] Send-Q[10,33203]
This a relatively unusual condition. TCP communications traffic normally flows smoothly and if on Linux/Unix you do this command "netstat -an" you will see something like this
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp 0 0 *.21 *.* LISTEN
tcp4 0 0 188.8.131.52.22 184.108.40.206.62599 ESTABLISHED
tcp 0 0 *.111 *.* LISTEN
tcp4 0 0 220.127.116.11.22 18.104.22.168.64999 ESTABLISHED
Recv-Q is the number of bytes in the receive queue... ready for reading but not processed yet.
Send-Q us the number of bytes in the send queue... ready to send but not delivered yet.
Local Address is on the current system. The two above port 22 is for the SSH daemon if I remember.
Foreign Address is the remote system.
The Critical issue warning is limited to TCP sockets reading from or writing to ITM associated port numbers. The above example critical issue text mean that there were 13 ITM related socket connections where the Recv-Q was over 1024 bytes and the maximum was 1290 bytes. There were also 10 Send-Q ITM socket connections over 1024 bytes and the maximum was 33203. Normally you would review the TEMS Audit report itself to see the details.
This is most often seen at a remote TEMS and the remote TEMS shows as going offline solidly or intermittently. Recycling the remote TEMS often temporarily relieves the issue but it often recurs. It was seen at a hub TEMS once.
Some cases have been diagnosed and follow.
1) A TEMS to Agent socket showed high Send-Q. When the ITM agents on that foreign system were stopped, the remote TEMS was stable after a TEMS restart. On the system where the ITM agents were stopped, there was a database server running with extremely high amounts of TCP traffic - mostly through localhost. The admins for that sgent system were involved and they recycled the database server and the problem stopped happening. The ITM agents connected afterwards and ran fine with no impact on the TEMS.
2) A problem similar to (1) but at the system running the agents, there was an very busy ITM Summarization and Pruning agent that was running almost 24x7. The S&P was re-configured to use only a single thread instead of eight threads. After that the remote TEMS ran without any impact.
3) Client had a large system with 5000+ mal-configured Windows OS Agents. In particular each OS Agent had the normal KDC_FAMILIES specified in the Windows Registry and also [invalidly] a KDE_TRANSPORT= line in the KNTENV file. This caused constant switching back and forth and this TCP blockage occurred many times a week at several remote TEMS. In this case the netstat -an showed a large number of foreign systems with high Recv-Q buffer bytes. When the Windows OS Agents were properly configured the TEMSes ran without incident.
4) Several large installations had many accidental duplicate agent name cases. This caused many issues and TCP Blockage was seen in some pdcollects.
5) In one case the customer had a single WPA [HD agent - collects historical data from agents for trans-shipment to the database.] At times this intense activity caused TCP Blockage condition.
6) A site had a very high level of Virtual Hub Table updates. This causes intense communication loading every few minutes, all concentrated in a single second. This was seen to cause TCP blockage in some cases. See Sitworld: ITM Virtual Table Termite Control Project for how to correct the issue.
7) A remote TEMS was going constantly offline and then online at the hub TEMS. The netstat -an at the remote TEMS showed a single agent with High Send-Q and many agents with low Recv-Q. We logged into the system running the High Send-Q agent. It was a Unix OS Agent and that was the only ITM agent running. A normal stop was issued ./itmcmd agent stop ux, and the stop failed. A forced stop was issued ./itmcmd agent -f stop ux and this worked. The Unix OS Agent was started ./itmcmd agent start ux. After that the remote TEMS behaved normally.
If there are a few large Send-Q buffer cases, identify what ITM agents are running on the foreign addresses and stop them after getting pdcollects for IBM Support to review. Remember that in most cases there will be a few problem cases and a lot more victims. Look for something at the agent system generating high TCP traffic.
Look for general problems as seen in the TEMS Audit: duplicate agent name cases, agents with many listening ports, etc. If the remote TEMS or hub TEMS is overloaded that should be alleviated as part of the solution.
The information in the report explains how to manage a AOA Critical Issue concerning possible TCP blockage.
Added example case 7.
Note: 2018 - Home Grown Meyer Lemons