System self-monitoring

System health evaluation is performed by Policy Engine running on the Hub. The evaluation is based on a set of rules called policies.

Each policy is responsible for searching for a particular issue in data collected in Monitoring Data Cache and when it finds it, it performs issue recovery.

Policies are also responsible for raising and closing alerts via Alerts Engine. Policies can be stateful, that is they can keep local data available between evaluations. This can be used for tracking in-progress actions requested by a policy, storing information about sent alerts etc.

Each policy has access to complete monitoring data. Policies run only when Platform Manager is in state ACTIVE.

A policy can:

Request resource manager action or call action script on local or remote node.
Raise or close an alert.
Change system state, for instance, disable a node.

Following is a list of policies responsible for self-healing:

Power cycle unreachable servers
Start core container of enabled node when it is not started
Start docker service when it is not running on enabled node
Start floating IPs on the Master node when they are not started
Start platform management service on enabled node
Activate platform management service on a node on active system
Deactivate a node when it is reported as Failed by resmgr
Start ntpd service if it is not running
Trigger time synchronization when the time is not synchronized
Trigger FSN battery reconditioning when it is needed
Start Call Home Daemon on the Hub node when it is not running
Start GPFS cluster when it is not running
Start GPFS node when it is not started
Mount GPFS file system on a node when it is not mounted
Start GPFS NSD when it is not started