Managing alerts in Cloud Pak for Data System
Platform Manager comes with a set of policies that are used to detect problems in the
system, based on the monitoring data of hardware and software components. The policies determine
which recovery actions are taken to restore system health. Sometimes the optimal health cannot be
restored automatically and user attention is required. In such cases, you can define which action
should be taken by the system.
In platform manager, alerts are always raised and closed
by policies. A set of default alerts is provided with Cloud Pak for Data System, and you can modify alert rules.
There are two types of alerts:
- Event - stateless alert - related to a point-in-time event, such as the failure of an action or
the unexpected restart of a component. For example:
- ACTION_FAILED: Container start-up action failed.
- APPLIANCE_EVENT: Node disabled by user.
- APPLIANCE_EVENT: Unreachable node restart requested.
- Issue - stateful alert - ongoing issue that remains open until the problem is fixed. For
example:
- SW_NEEDS_ATTENTION: GPFS node failed to start.
- APPLIANCE_APPLICATION_DOWN: Appliance application went down due to disabled node.
Depending on the type of alert, the system can manage it by taking one of the following
actions:
- Informational - log only
- Actionable - send email and/or open SNMP trap
- Serviceable - use Call Home to open a PMR