IBM Support

Probes running in P2P mode

Troubleshooting


Problem

Master probes stop receiving heartbeat from the slave probe. The slave probe stops logging any P2P messages and will not being sending events if the master probe is unavailable. The slave probe grows in memory.

Symptom

The master probe works fine however the slave probe displays unbounded memory growth. Depending on the incoming event rate (and whether the probe is 32-bit or 64-bit) the memory growth can get to a point where it exhausts all available memory on the host, causing the slave probe to exit.

The master probe log will show an increasing number of outstanding heartbeats and/or failures to connect to the slave probe.

Another symptom is that the TCP port on the slave probe used for peer-to-peer heartbeating gets stuck in a CLOSE_WAIT or SYNC_REV state.

The slave probe will also not send events to the ObjectServer(s), even if the master probe stops working or it loses contact with the master probe.

The problems usually begins to happen after the slave probe has been running for a few days (although it can happen at any time).

The key symptom is the absence of P2P messages in the slave probe log file (they are logged at INFO level). These should be logged every 2 seconds (unless a different -beatinterval has been specified).

Cause

A race condition in an internal library causes the thread responsible for peer-to-peer heartbeating in the slave probe to hang under certain conditions.

A testfix is available for this problem, other than that the only resolution is to kill and restart the slave probe.

Environment

Solaris, Linux and AIX only. Only affects the slave probe in a peer-to-peer heartbeating configuration.

Diagnosing The Problem

Check the slave probe log file (it must be running in a messagelevel of at least INFO) for P2P messages. If no P2P messages are seen (they should be logged every 2 seconds unless the -beatinterval property has been set otherwise) then it is recommended to obtain the testfix.

Resolving The Problem

Contact Support for a test fix for IV56455.

You will need to provide the output of listIU.sh and uname -a.

[{"Product":{"code":"SSSHTQ","label":"Tivoli Netcool\/OMNIbus"},"Business Unit":{"code":"BU004","label":"Hybrid Cloud"},"Component":"Not Applicable","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"}],"Version":"7.4.0;8.1.0","Edition":"All Editions"}]

Document Information

Modified date:
17 June 2018

UID

swg21683079