Threshold monitoring for system health

Threshold monitoring pre-requisites
If you did not use the IBM Spectrum Scale™ installation toolkit or disabled the performance monitoring installation during your system setup (./spectrumscale config perfmon -r off), please make sure your system meets the following configuration requirements:
  • IBM Spectrum Scale version 4.2.2 or later(on all nodes).
  • PMSensors and PMCollectors must be on version 4.2.2 or later.
  • CCR must be enabled on the cluster.
  • GPFSPool and GPFSFileset sensors are enabled automatically, when all above requirements are met.
The available filesystem available capacity depends on the fullness of its fileset-inode spaces, capacity usage, and memory utilization in each data or metadata pool. Therefore, the predefined capacity threshold limit for a filesystem is broken down to the thresholds rules of:
  • Fileset-inode spaces
  • Data pool capacity
  • Metadata pool capacity
  • Memory free utilization

The violation of any rule results in the parent filesystem receiving a capacity issue notification. The pmsensors such as GPFSPool and GPFSFileset are activated automatically and bound to the first collector node, and tracks the inode and pool space usage of the filesystem. For more information on pmsensors, see Configuring the performance monitoring tool. For a new filesystem, the process can be slow and can be improved by restarting sensors on the first collector node.

For capacity utilization rules, the warn level is set to 80%, and the error level to 90%. For memory utilization rule, the warn level is set to 100 MB, and the error level to 50 MB. The metrics value are frequently compared with rules boundaries by internal monitor process. As soon as one of the metric values exceeds their threshold limit, the system health daemon receives an event notification from monitoring process and generates log event and updates the health status of the filesystem having capacity problems.

Thresholds monitoring known limitations
The filesystem health status change may not get updated in the following situations:
  1. The pool or fileset capacity utilization returned from error range to warn range.
  2. If pools or inode spaces (independent filesets) have been removed (workaround: The status will be automatically updated with the next restart of the monitoring component on the collector node).
  3. Start of changeIf multiple threshold rules have overlapping entities in their filter scope for the same metric, the system invokes the metric value evaluation with different threshold boundaries in parallel and updates the entire state concurrently.End of change

New features for threshold monitoring

Start of changeStarting with version 4.2.3, the predefined thresholds rules are extended with a new threshold rule monitoring "memory free" utilization on cluster nodes. IBM Spectrum Scale user can also delete or add any or all of the existing thresholds rules.End of change