Managing alerts in Cloud Pak for Data System

Platform Manager comes with a set of policies that are used to detect problems in the system, based on the monitoring data of hardware and software components. The policies determine which recovery actions are taken to restore system health. Sometimes the optimal health cannot be restored automatically and user attention is required. In such cases, you can define which action should be taken by the system. In platform manager, alerts are always raised and closed by policies. A set of default alerts is provided with Cloud Pak for Data System, and you can modify alert rules.
There are two types of alerts:
  • Event - stateless alert - related to a point-in-time event, such as the failure of an action or the unexpected restart of a component. For example:
    • ACTION_FAILED: Container start-up action failed.
    • APPLIANCE_EVENT: Node disabled by user.
    • APPLIANCE_EVENT: Unreachable node restart requested.
  • Issue - stateful alert - ongoing issue that remains open until the problem is fixed. For example:
    • SW_NEEDS_ATTENTION: GPFS node failed to start.
    • APPLIANCE_APPLICATION_DOWN: Appliance application went down due to disabled node.
Each alert has a unique ID. For information on all alert attributes, see Alert attributes.
Depending on the type of alert, the system can manage it by taking one of the following actions:
  • Informational - log only
  • Actionable - send email and/or open SNMP trap
  • Serviceable - use Call Home to open a PMR
Read the following topics to learn more about configuring different types of alerts and their actions.