Monitoring poller capacity
To prevent problems from occurring on pollers, monitor the poller metrics by outputting them on the command-line interface. The metrics are displayed as bar charts. The metrics show when a poller is reaching capacity, for example, if poor database performance is causing it to fall behind.
By default, the metrics are written to a trace in NCHOME/log/precision every 2 minutes. There is one trace for each poller. The file name of the trace has the format ncp_poller.SnmpPoller.domain.metrics for the default poller and ncp_poller.SnmpPoller.pollername.domain.metrics for all other pollers. For example, ncp_poller.SnmpPoller.Poller23507.NCOMS.metrics. The following table describes the metrics.
Metric | Measures | Measured in (the units of the y-axis of each bar chart) |
---|---|---|
Health | The percentage of devices that are polled during a policy cycle. If this value is 100%, the poller is working properly. If the value is below 100%, not all the devices are polled during the polling interval. The poller cannot keep up with policy load. | % |
Memory | The memory that the poller is using. Memory usage increases as more devices are discovered or more policies are enabled. | MB |
BatchQueueSize | The number of batches that are waiting for a thread. | Count |
PollDataQueueSize | The number of INSERT statements that are queued to the NCPOLLDATA database. Shows whether the poller is successfully storing polling data. | Count |
PollDataRowCount | The insertion rate to the raw poll data table ncpolldata.pollData, expressed as the number of records inserted during one hour. This metric is useful only if historical polling is used. | Count |
Before you begin
Procedure
ncp_perl itnm_poller.pl -domain NCOMS -metrics
The following example
shows how to run the script for the same domain for a specific poller, over the last 12
hours:ncp_perl itnm_poller.pl -domain NCOMS -poller Poller23507 -metrics -window 12
The
following example shows how to run the script for the same domain and poller, from a specific time
stamp, over a period of 8
hours:ncp_perl itnm_poller.pl -domain NCOMS -poller Poller23507 -metrics -timestamp 2013-12-10T17:30:36 -window 8
Do not run the script with the -metrics option and the -status option simultaneously.
Results
Example
The following example shows a sample chart for the Health metric.
Health (%) for Policy 'Default Chassis Ping'
PollDef My Poll Definition 1' (Type:SNMP Link State)
A value less than 100% indicates the policy is behind and some devices
were not polled during the last polling cycle
100 ------------------+-----------------------------------------------------------++++--------------------------------
|+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++||||++++++++++++++++++++++++++++++++
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
50 ------------------||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 -+++++++++++++++++------------------------------------------------------------------------------------------------
^ ^ ^ ^ ^ ^
^ ^ ^ ^ ^ ^
13:46 14:06 14:26 14:46 15:06 15:26
15:46 16:06 16:26 16:46 17:06 17:266
Time Window (4 hours): 2013-12-08T13:46:25 to
2013-12-08T17:46:25 (Sample interval: 2 minutes)
What to do next
Review the problems and possible causes in the following table and take action as appropriate.
Metric | Problem | Possible cause | Actions |
---|---|---|---|
Health | Value is consistently below 100%. | The percentage can fall temporarily below 100% after the poller is started, or if change information is received from the MODEL database. |
|
Memory | Memory grows unbounded | The connection to the database was lost. Alternatively, the polling load is too great to sustain, or the rate of data storage is too great to sustain. |
|
BatchQueue | The number of batches that are waiting for a thread is greater than 0 and increasing. | The number of threads is exhausted, which can indicate that the downstream SNMP dispatcher is close to capacity. | Although it is possible to increase the number of threads by
setting the BatchExtraThreads property in the
NcPollerSchema.cfgfile, it is not the best solution. It is
possible that increasing the number of threads worsens the problem. Safer solutions
are as follows:
Tip: Set a threshold on the number of batches that are in the queue
for processing. You are alerted in the poller log when the threshold is
breached.1
|
PollDataQueueSize | The number of INSERT statements in the queue grows exponentially. | The connection to the database was lost or the frequency of INSERT statements is greater than the poller can handle. |
|
PollDataRowCount | The number or rows exceeds the threshold after pruning is completed. The default threshold is 5,000,000 and the default pruning interval is 1 hour. | The polling load is too heavy and so the number of rows is too great to be pruned within the pruning interval. Alternatively, problems occurred in the database, which is causing problems with pruning. | Contact your database administrator. |
Table notes:
|
If an error is displayed, check in the $NCHOME/etc/precision/NcPollerSchema.cfg file whether the CollectPollerMetrics parameter is disabled. This parameter is enabled by default, but, if it is disabled, enable it. You can use the OQL interface to enable the parameter at run time. For example:
ncp_oql -domain NCOMS -service SnmpPoller -poller Poller23507 -query “update config.properties set CollectPollerMetrics=1;”