System self-monitoring

System health evaluation is performed by Policy Engine running on the Hub. The evaluation is based on a set of rules called policies.

Each policy is responsible for searching for a particular issue in data collected in Monitoring Data Cache and when it finds it, it performs issue recovery.

Policies are also responsible for raising and closing alerts via Alerts Engine. Policies can be stateful, that is they can keep local data available between evaluations. This can be used for tracking in-progress actions requested by a policy, storing information about sent alerts etc.

Each policy has access to complete monitoring data. Policies run only when Platform Manager is in state ACTIVE.

A policy can:
  • Request resource manager action or call action script on local or remote node.
  • Raise or close an alert.
  • Change system state, for instance, disable a node.
Following is a list of policies responsible for self-healing:
  • Power cycle unreachable servers
  • Start core container of enabled node when it is not started
  • Start docker service when it is not running on enabled node
  • Start floating IPs on the Master node when they are not started
  • Start platform management service on enabled node
  • Activate platform management service on a node on active system
  • Deactivate a node when it is reported as Failed by resmgr
  • Start ntpd service if it is not running
  • Trigger time synchronization when the time is not synchronized
  • Trigger FSN battery reconditioning when it is needed
  • Start Call Home Daemon on the Hub node when it is not running
  • Start GPFS cluster when it is not running
  • Start GPFS node when it is not started
  • Mount GPFS file system on a node when it is not mounted
  • Start GPFS NSD when it is not started