Self-monitoring thresholds

The Tivoli Netcool/OMNIbus self-monitoring threshold values are stored in the master.sm_thresholds table. You can modify the threshold values with the SQL Interactive Interface (nco_sql or isql) or with the Netcool/OMNIbus Administrator (nco_config).

The default threshold values are calculated to align with best practice for ObjectServer performance. However, you might need to alter the thresholds under the following conditions:
  • Policy might specify a lower number than the default value for the overall number of journals that must be present at one time.
  • Load testing with a set of triggers and connected clients might reveal that a server can handle a lower number of resident events than the default. In this case, the initial threshold should be triggered when the total count reaches a lower value.

The following table provides a list of the self-monitoring thresholds, a description for each one, and the default threshold values for the Severity of the resulting alert event. Alert events are generated when the lowest Severity threshold (Severity 3) is breached. When an alert event is generated, the Severity of that alert event is determined by the value of the metric being monitored.

The Severity values are the threshold points. If the metric value is greater than or equal to the threshold value, the alert event has the highest corresponding Severity. For example, if the sm_client_time_individual metric value is 29, no alert event is generated. If the metric value is 35, a Severity 3 alert event is generated. If the metric value is 51, a Severity 5 alert event is generated.

Table 1. Self-monitoring thresholds

Threshold Name

Description

Thresholds

Sev 3

Sev 4

Sev 5

sm_client_time_individual

The threshold for the time that is consumed by any individual connected client within a single ObjectServer granularity period. Any client that takes 30 seconds or more to fire within any one granularity period generates an alert event.

30

40

50

sm_client_time_total

The threshold for the time that is consumed by all connected clients within a single ObjectServer granularity period. If all connected clients collectively consume 40 seconds or more of ObjectServer time within any one granularity period, an alert event is generated.

40

60

90

sm_connections

The threshold values for the number of remaining available connections. If there are 50 or fewer available connections in the ObjectServer, an alert event is generated.

50

30

10

sm_db_stats_details_count

The threshold for the number of rows in the alerts.details table. If there are 3000 or more rows present, an alert event is generated.

3000

10000

20000

sm_db_stats_event_count

The threshold for the number of rows in the alerts.status table. If there are 80000 or more rows present, an alert event is generated.

80000

90000

100000

sm_db_stats_journal_count

The threshold for the number of rows in the alerts.journal table. If there are 20000 or more rows present, an alert event is generated.

20000

35000

50000

sm_dbops_stats_details_inserts

The threshold for the number of inserts into the alerts.details table in each five-minute ObjectServer statistics period. If 10000 or more rows are inserted in a given five-minute period, a Severity 3 alert event is generated.

This metric never generates a Severity 4 (Major) or Severity 5 (Critical) event because it is not as critical a performance indicator as, for example, the amount of time that triggers are taking to run. However, if other indicators are also indicating alerts, the alert that is created by this threshold provides valuable forensic information for understanding the overall situation.

10000

0

0

sm_dbops_stats_journal_inserts

The threshold for the number of inserts into the alerts.journal table in each five-minute ObjectServer statistics period. If 10000 or more rows are inserted in a given five-minute period, a Severity 3 alert event is generated.

This metric never generates a Severity 4 (Major) or Severity 5 (Critical) event because it is not as critical a performance indicator as, for example, the amount of time that triggers are taking to run. However, if other indicators are also indicating alerts, the alert that is created by this threshold provides valuable forensic information for understanding the overall situation.

10000

0

0

sm_dbops_stats_status_inserts

The threshold for the number of inserts into the alerts.status table in each five-minute ObjectServer statistics period. If 10000 or more rows are inserted in a given five-minute period, a Severity 3 alert event is generated.

Note: This metric includes both inserts and reinserts.

This metric never generates a Severity 4 (Major) or Severity 5 (Critical) event because it is not as critical a performance indicator as, for example, the amount of time that triggers are taking to run. However, if other indicators are also indicating alerts, the alert that is created by this threshold provides valuable forensic information for understanding the overall situation.

10000

0

0

sm_memstore

The threshold for the percentage of the soft limit memory allocation that is in use by the ObjectServer. If 75% or more of the soft limit is in use, an alert event is generated. If more than 95% of the soft limit is in use, the alert event Summary recommends increasing the soft limit.

75

85

95

sm_time_to_display

The threshold for the average amount of time that is taken for events to get to the Display layer ObjectServers. If events are taking 60 seconds or more to propagate to the Display layer ObjectServers, an alert event is generated.

If a Display layer is not present in your deployment, this metric is not monitored.

60

120

180

sm_top_classes

The threshold for the number of events that are received into the Aggregation layer, per Class, in each five-minute ObjectServer statistics period. If 600 or more events are received for any one Class in a given five-minute period, an alert event is generated.

This threshold takes into account all events that are received for a Class from both the Aggregation layer and the Collection layer, where present.

600

800

1000

sm_top_nodes

The threshold for the number of events that are received into the Aggregation layer, per Node, in each five-minute ObjectServer statistics period. If 100 or more events are received for any one Node in a given five-minute period, an alert event is generated.

If just a Sev 5 alert event is needed, with a threshold of 500 and no Sev 3 or Sev 4 alert events are required, then set all three thresholds (Sev 3, Sev 4, and Sev 5) to 500.

If a Sev 5 alert event is needed, with a threshold of 500, and a Sev 4 alert event is needed at a threshold of 250, but no Sev 3 alert event is required, then the Sev 3 alert event should also be set to 250.

This threshold takes into account all events that are received for a Node at both the Aggregation layer and the Collection layer, where present.

100

200

500

sm_top_probes

The threshold for the number of events that are received into the Aggregation or Collection layer, per probe, in each five-minute ObjectServer statistics period. If 600 or more events are received from any one probe in a given five-minute period, an alert event is generated.

This threshold takes into account all probes that are connected to the Aggregation layer and the Collection layer, where present.

600

800

1000

sm_triggers_individual

The threshold for any individual trigger within a single ObjectServer granularity period. Any trigger that takes 20 seconds or more to fire within any one granularity period generates an alert event.

20

25

30

sm_triggers_total

The threshold for all triggers collectively within a single ObjectServer granularity period. If the total trigger time is 50 seconds or more within a granularity period, an alert event is generated.

50

70

90

sm_triggers_reporting_period

The threshold for the profiler reporting period. The reporting period must always be within a fraction of a second of the granularity time (the default granularity is 60 seconds). If the reporting period is 61 seconds or more, an alert event is generated.

If the reporting period begins to increase, this indicates that the ObjectServer is overloaded. This can happen for several reasons, including a poorly performing trigger or excessive client interaction time.

If the reporting period is regularly greater than the granularity period, users might see delays in event presentation. Along with other alerts that are likely to be generated in this case, this information is useful to administrators when determining the cause of the overload.

61

70

90