Sitworld: Adventures in Communications #1
John Alvord, IBM Corporation
Draft #1 – 5 July 2018 - Level 1.00000
There have been a lot of challenging to solve communication issues recently. It is interesting to look at the issues and the resolutions.
This symptom was a stalled remote TEMS that hardly did any communication. Restarting the remote TEMS resolved the issue for a day or so but eventually it got stuck again.
This was seen in the TEMS Audit
Advisory: 99,TEMSAUDIT1088W,TCP,TCP Queue Delays 22 Send-Q [max 66131] Recv-Q [max 9448] - see Report TEMSREPORT051
These means that 22 TCP sockets were showing a non-zero buffer usage. The maximum Send-Q buffer was 66131 bytes.
In the Report051 section:
f1000e0005d0cbb8 tcp4 0 66131 22.214.171.124.65100 126.96.36.199.39482 ESTABLISHED
So the local address was a Warehouse Proxy Agent [WPA or HD] and the target was some system where agents were running. After reviewing the hub TEMS database it appeared thata Tivoli Log Agent and also a Summarization and Pruning agent were running on that system
This report section comes from a netstat -an capture. There were more high buffer values. In such cases usually one is the culprit and the rest are victims. High Send-Q buffer is almost always the key indicator. You look at the foreign address - an agent system - and review that system. If that also has high Send-Q/Recv-Q values, it needs a closer look. We suggest stopping all the ITM agents on systems which the netstat -an sees as high Send-Q. High [more than 8192 bytes]. After stopping all the agents, recycle the affected TEMS. Ideally you review the potential problem agent systems but you could just start each agent up one at a time and watch for issues
So What is the big Deal?
Well running systems never show large Send or Receive buffer bytes pending. Whatever is there is always transient. When there is a lot of bytes pending, the buffer for doing new TCP work is exhausted and no new communications can proceed. This is a definite worst case and the condition often persists until the TEMS is recycled. In the meantime all the agents go offline as well as the TEMS. So it really is a bad condition. There is no monitoring going on at those agents and recovery is disruptive. Monitoring is degraded.
The ITM TEMS is a real time system that defers to other processes - does its best to be a good neighbor. If there is a batch process using a LOT of TCP, then the TEMS can be blocked out for long periods - and sometimes until it is recycled. If this happens at an agent, a normal ITM process like TEMS can attempt contect and be blocked itself. If this was the agent side TCP issues backs up and locks up the TEMS the agent is connected to. When the buffer space used is full, the TEMS itself is logically blocked and unable to work properly.
What was happening HERE!!
In this case there was a Summarization and Pruning agent running. This is a vital service when you are collecting historical data . Without it the storage space would grow and grow "forever". The S&P agent was configured with 8 threads. That meant at when it was operational [at 2am in the morning for several hours] S&P would dominate all the TCP communications. It was running as a batch process working as fast as it could. The ITM communications were blocked out. The TEMS and WPA services attempted to communicate. That could not continue since the Agent side system was blocked up. And this the TEMS/WPA services were totally blocked. In the end the TEMS/WPA needed to be recycled. And the next night the same risk was present.
You might see the same thing happening on a system with a WPA. One reviewed recently was using 30 threads and it was running on the same system as the hub TEMS. Reducing that to 4 threads [the system had 8 cores] eliminated the conflict. Better yet would be to configure WPAs at each remote TEMS so the WPA communication workload would be spread out and the hub TEMS WPA would have little network competition.
The S&P agent was reconfigured to use only a single thread. It took a bit longer to complete overnight but now it "played nice" with communications and the TEMS/WPA ran smoothly.
Communications adventure #1 - caused by an over-active Summarization and Pruning agent.
History and Earlier versions
There are no binary objects associated with this project.