Managing alerts in Integrated Analytics System

The appliance platform manager comes with a set of policies that are used to detect problems in the system, based on the monitoring data of hardware and software components. The policies determine which recovery actions are taken to restore system health. Sometimes the optimal health cannot be restored automatically and user attention is required. In such cases, you can define which action should be taken by the system. In platform manager, alerts are always raised and closed by policies. A set of default alerts is provided with Integrated Analytics System, and you can modify alert rules.
There are two types of alerts:
  • Event - stateless alert - related to a point-in-time event, such as the failure of an action or the unexpected restart of a component. For example:
    • ACTION_FAILED: Container start-up action failed.
    • APPLIANCE_EVENT: Node disabled by user.
    • APPLIANCE_EVENT: Unreachable node restart requested.
  • Issue - stateful alert - ongoing issue that remains open until the problem is fixed. For example:
    • SW_NEEDS_ATTENTION: GPFS node failed to start.
    • APPLIANCE_APPLICATION_DOWN: Appliance application went down due to disabled node.
Each alert has a unique ID. For information on all alert attributes, see Alert attributes.
You can view a list of all possible alerts in your system by running the following command:
[root@node0101 ~]# ap issues --show_registry

Alerts Registry
+-------------+----------------------------+-------+---------------------------------------------------------------+----------+
| Reason Code | Type                       | Group | Title                                                         | Stateful |
+-------------+----------------------------+-------+---------------------------------------------------------------+----------+
| 101         | HW_SERVICE_REQUESTED       | HW    | Server is unreachable and cannot be recovered                 | YES      |
| 102         | HW_SERVICE_REQUESTED       | HW    | Server failed and was disabled                                | YES      |
| 103         | HW_SERVICE_REQUESTED       | HW    | Major component is unreachable                                | YES      |
| 104         | HW_SERVICE_REQUESTED       | HW    | Major component failed                                        | YES      |
| 105         | HW_SERVICE_REQUESTED       | HW    | Subcomponent failed                                           | YES      |
| 106         | HW_SERVICE_REQUESTED       | HW    | FSP unrecoverable events detected                             | NO       |
| 107         | HW_SERVICE_REQUESTED       | HW    | FSN thermal issues                                            | YES      |
| 108         | HW_SERVICE_REQUESTED       | HW    | Subcomponent is unreachable                                   | YES      |
...
| 849         | APPLIANCE_EVENT            | SW    | Application initialization has failed                         | NO       |
| 901         | STORAGE_UTILIZATION        | SW    | Storage utilization above threshold                           | YES      |
+-------------+----------------------------+-------+---------------------------------------------------------------+----------+

Generated: 2019-08-07 10:55:39
Depending on the type of alert, the system can manage it by taking one of the following actions:
  • Informational - log only
  • Actionable - send email and/or open SNMP trap
  • Serviceable - use Call Home to open a PMR

Read the following topics to learn more about configuring different types of alerts and their actions.