Monitoring poller capacity

To prevent problems from occurring on pollers, monitor the poller metrics by outputting them on the command-line interface. The metrics are displayed as bar charts. The metrics show when a poller is reaching capacity, for example, if poor database performance is causing it to fall behind.

By default, the metrics are written to a trace in NCHOME/log/precision every 2 minutes. There is one trace for each poller. The file name of the trace has the format ncp_poller.SnmpPoller.domain.metrics for the default poller and ncp_poller.SnmpPoller.pollername.domain.metrics for all other pollers. For example, ncp_poller.SnmpPoller.Poller23507.NCOMS.metrics. The following table describes the metrics.

Table 1. Poller metrics
Metric Measures Measured in (the units of the y-axis of each bar chart)
Health The percentage of devices that are polled during a policy cycle. If this value is 100%, the poller is working properly. If the value is below 100%, not all the devices are polled during the polling interval. The poller cannot keep up with policy load. %
Memory The memory that the poller is using. Memory usage increases as more devices are discovered or more policies are enabled. MB
BatchQueueSize The number of batches that are waiting for a thread. Count
PollDataQueueSize The number of INSERT statements that are queued to the NCPOLLDATA database. Shows whether the poller is successfully storing polling data. Count
PollDataRowCount The insertion rate to the raw poll data table ncpolldata.pollData, expressed as the number of records inserted during one hour. This metric is useful only if historical polling is used. Count

Before you begin

Ensure that the terminal on which you output the bar charts has a minimum width of 140 characters. Otherwise, the bar chart does not display properly because of the line wrapping.

Procedure

Run the itnm_poller.pl script as shown in the examples.
The following example shows how to display the charts for the default poller, on the NCOMS domain, from the most recent time stamp, over the default 4 hour period:
ncp_perl itnm_poller.pl -domain NCOMS -metrics
The following example shows how to run the script for the same domain for a specific poller, over the last 12 hours:
ncp_perl itnm_poller.pl -domain NCOMS -poller Poller23507 -metrics -window 12
The following example shows how to run the script for the same domain and poller, from a specific time stamp, over a period of 8 hours:
ncp_perl itnm_poller.pl -domain NCOMS -poller Poller23507 -metrics -timestamp 2013-12-10T17:30:36 -window 8

Do not run the script with the -metrics option and the -status option simultaneously.

Results

Pay close attention to the scale of the y-axis. A bar chart that appears flat for a long period, for example 24 hours, might show differences in values when the chart is viewed over a shorter period, for example 4 hours.

Example

The following example shows a sample chart for the Health metric.

Health (%) for Policy 'Default Chassis Ping'
PollDef My Poll Definition 1' (Type:SNMP Link State) 
  A value less than 100% indicates the policy is behind and some devices
were not polled during the last polling cycle 

    100 ------------------+-----------------------------------------------------------++++--------------------------------
                          |+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++||||++++++++++++++++++++++++++++++++
                          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
                          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
                          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
     50 ------------------||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
                          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
                          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
                          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
                          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
      0 -+++++++++++++++++------------------------------------------------------------------------------------------------
         ^         ^         ^         ^         ^         ^     
   ^         ^         ^         ^         ^         ^    
       13:46     14:06     14:26     14:46     15:06     15:26   
 15:46     16:06     16:26     16:46     17:06     17:266
                  Time Window (4 hours):  2013-12-08T13:46:25  to
 2013-12-08T17:46:25  (Sample interval: 2 minutes)

What to do next

Review the problems and possible causes in the following table and take action as appropriate.

Table 2. Possible causes of problems with poller metrics
Metric Problem Possible cause Actions
Health Value is consistently below 100%. The percentage can fall temporarily below 100% after the poller is started, or if change information is received from the MODEL database.
  • Increase the polling interval by changing the poll policy
  • Add more pollers.
Memory Memory grows unbounded The connection to the database was lost. Alternatively, the polling load is too great to sustain, or the rate of data storage is too great to sustain.
  • Contact your database administrator.
  • Add more pollers.
BatchQueue The number of batches that are waiting for a thread is greater than 0 and increasing. The number of threads is exhausted, which can indicate that the downstream SNMP dispatcher is close to capacity. Although it is possible to increase the number of threads by setting the BatchExtraThreads property in the NcPollerSchema.cfgfile, it is not the best solution. It is possible that increasing the number of threads worsens the problem. Safer solutions are as follows:
  • Add more pollers.
  • Contact your system administrator to investigate adding RAM to the host.
Tip: Set a threshold on the number of batches that are in the queue for processing. You are alerted in the poller log when the threshold is breached.1
PollDataQueueSize The number of INSERT statements in the queue grows exponentially. The connection to the database was lost or the frequency of INSERT statements is greater than the poller can handle.
  • Contact your database administrator.
  • Add more pollers.
PollDataRowCount The number or rows exceeds the threshold after pruning is completed. The default threshold is 5,000,000 and the default pruning interval is 1 hour. The polling load is too heavy and so the number of rows is too great to be pruned within the pruning interval. Alternatively, problems occurred in the database, which is causing problems with pruning. Contact your database administrator.
Table notes:
  1. To set a threshold, change the value of the set BatchQueueThreshold property in the $NCHOME/etc/precision/NcPollerSchema.cfg to a suitable value. For example, to set the threshold to 10 batches of queued polls:
    update config.properties set BatchQueueThreshold = 10;
    When the queue exceeds the specified threshold, a message is written that is similar to the following example:
    2013-04-19T12:37:58:Poller:NCOMS:DataQueueSize:10;

If an error is displayed, check in the $NCHOME/etc/precision/NcPollerSchema.cfg file whether the CollectPollerMetrics parameter is disabled. This parameter is enabled by default, but, if it is disabled, enable it. You can use the OQL interface to enable the parameter at run time. For example:

ncp_oql -domain NCOMS -service SnmpPoller -poller Poller23507 -query “update config.properties set CollectPollerMetrics=1;”