Probe key performance indicators

Probes can be configured to generate ProbeWatch heartbeat events as a self-monitoring mechanism to help monitor performance, diagnose performance problems, and highlight performance bottlenecks before they affect the system.

Tip: For more information about setting up IBM Tivoli Monitoring agents, see the IBM Tivoli Monitoring information center at http://publib.boulder.ibm.com/infocenter/tivihelp/v15r1/index.jsp

The following KPIs can be established to monitor the health of probes:

Number of events received by a probe in previous nseconds
The number of events received by the probe in the previous n seconds can be derived from the NumEventsProcessed column of the master.probestats table as a delta from the previous reported value for each probe. Probe throughput generates work for the ObjectServer, a flood of events from a specific probe should be investigated. It might highlight a problem with the probe, the probe rules file, or the devices or applications that are being monitored by that probe. Compare the current value against the previous values for this KPI to identify abnormal behaviour.
Probe CPU usage
The CPU usage of the probe is contained in the CPUTimeSec column of the master.probestats table. An IBM Tivoli Monitoring agent installed on the probe computer can also measure the CPU usage of the probe. CPU resources are finite. If the probe process is at maximum CPU, events are queued in the probe until the probe can process them. Consequently, probe input might build up, which can cause delays in processing, or, depending on the probe, can cause loss of data. Contributory factors are be the incoming event load and the rules file processing.
Probe memory footprint
The memory footprint of the probe is contained in the ProbeMemory column of the master.probestats table. An IBM Tivoli Monitoring agent installed on the probe computer can also measure the memory usage of the probe. Memory is a finite resource and probe memory should not grow unbounded. Memory usage of a probe process should be relatively stable, although some increase is expected as caches and buffers build. The memory footprint of a probe will increase when the first SIGHUP signal is sent to the probe to instruct the probe to reread its rules file. This increase is expected as the new rules file is read and parsed before the memory used by the existing rules file is released. This is necessary so that the probe always has a valid rules file. Subsequent SIGHUP signals to reread the rules file should cause only a comparatively small increase in the memory usage. Use of associative arrays might also contribute to increased memory usage of the probe, because the arrays are built up by the events that are processed by the rules file. The memory footprint of the nco_p_mttrapd probe is distinctive because it maintains a large buffer for incoming traps. This can often account for over 50MB of memory growth as the first 2000 traps are received. After the memory for the trap queue buffer has been allocated the memory usage should settle down. Other unexplained unbounded memory growth needs to be investigated.
Average time spent processing rules
The average time spent processing the rules file is contained in the AvgRulesFileTime column of the master.probestats table.
Inefficiencies in the rules file may cause delays in event processing. The time spent processing the rules file is one of the major factors in limiting maximum throughput of a probe. If rules file processing is taking, on average, 5,000 microseconds (millionths of a second) then the probe will only be able to process 200 events per second maximum.