IBM Support

Probes running in P2P mode

Troubleshooting


Problem

Primary probes stop receiving heartbeat from the secondary probe. The secondary probe stops logging any P2P messages and will not send events if the primary probe is unavailable. The secondary probe grows in memory.

Symptom

The primary probe works fine however the secondary probe displays unbounded memory growth. Depending on the incoming event rate (and whether the probe is 32-bit or 64-bit) the memory growth can get to a point where it exhausts all available memory on the host, causing the secondary probe to exit.

The primary probe log will show an increasing number of outstanding heartbeats for failures to connect to the secondary probe.

Another symptom is that the TCP port on the secondary probe used for peer-to-peer heartbeat gets stuck in a CLOSE_WAIT or SYNC_REV state.

The secondary probe will also not send events to the ObjectServer(s), even if the primary probe stops working or it loses contact with the primary probe.

The problems usually begins to happen after the secondary probe has been running for a few days (although it can happen at any time).

The key symptom is the absence of P2P messages in the secondary probe log file (they are logged at *INFO* level). These should be logged every 2 seconds (unless a different -beatinterval has been specified).

Cause

A race condition in an internal library causes the thread responsible for peer-to-peer heartbeat in the secondary probe to hang under certain conditions.

A test fix is available for this problem, other than that the only resolution is to kill and restart the secondary probe.

Environment

Solaris, Linux and AIX only. Only affects the secondary probe in a peer-to-peer heartbeat configuration.

Diagnosing The Problem

Check the secondary log file (it must be running in a message level of at least INFO) for P2P messages. If no P2P messages are seen (they should be logged every 2 seconds unless the -beatinterval property has been set otherwise) then it is recommended to obtain the test fix.

Resolving The Problem

Contact Support for a test fix for IV56455.

You will need to provide the output of listIU.sh and uname -a.

[{"Product":{"code":"SSSHTQ","label":"Tivoli Netcool\/OMNIbus"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Not Applicable","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"}],"Version":"7.4.0;8.1.0","Edition":"All Editions","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
03 January 2023

UID

swg21683079