System self-monitoring
System health evaluation is performed by the Policy Engine running on the Platform Manager Hub. The evaluation is based on a set of rules called policies.
One of the three control plane nodes is selected by Platform Manager to be the Hub node. It means
the node runs the Policy Engine, and it provides some extra logs for monitoring and diagnosing a
problem. To view which node is the Platform Manager Hub you can run ap node -d
.
Each policy is responsible for searching for a particular issue in data collected in Monitoring Data Cache and when it finds it, it performs issue recovery.
Policies are also responsible for raising and closing alerts via Alerts Engine. Policies can be stateful, that is they can keep local data available between evaluations. This can be used for tracking in-progress actions requested by a policy, storing information about sent alerts etc.
Each policy has access to complete monitoring data. Policies run only when Platform Manager is in state ACTIVE.
- Request resource manager action or call action script on local or remote node.
- Raise or close an alert.
- Change system state, for instance, disable a node.
- Power cycle unreachable servers
- Start core container of enabled node when it is not started
- Start docker service when it is not running on enabled node
- Start floating IPs on the Master node when they are not started
- Start platform management service on enabled node
- Activate platform management service on a node on active system
- Deactivate a node when it is reported as Failed by resmgr
- Start ntpd service if it is not running
- Trigger time synchronization when the time is not synchronized
- Trigger FSN battery reconditioning when it is needed
- Start Call Home Daemon on the Hub node when it is not running
- Start GPFS cluster when it is not running
- Start GPFS node when it is not started
- Mount GPFS file system on a node when it is not mounted
- Start GPFS NSD when it is not started