Managing alerts in Integrated Analytics System
The appliance platform manager comes with a set of policies that are used to detect
problems in the system, based on the monitoring data of hardware and software components. The
policies determine which recovery actions are taken to restore system health. Sometimes the optimal
health cannot be restored automatically and user attention is required. In such cases, you can
define which action should be taken by the system.
In platform manager, alerts are always
raised and closed by policies. A set of default alerts is provided with Integrated Analytics System, and you can modify alert rules.
There are two types of alerts:
- Event - stateless alert - related to a point-in-time event, such as the failure of an action or
the unexpected restart of a component. For example:
- ACTION_FAILED: Container start-up action failed.
- APPLIANCE_EVENT: Node disabled by user.
- APPLIANCE_EVENT: Unreachable node restart requested.
- Issue - stateful alert - ongoing issue that remains open until the problem is fixed. For
example:
- SW_NEEDS_ATTENTION: GPFS node failed to start.
- APPLIANCE_APPLICATION_DOWN: Appliance application went down due to disabled node.
You can view a list of all possible alerts in your system by running the following
command:
[root@node0101 ~]# ap issues --show_registry
Alerts Registry
+-------------+----------------------------+-------+---------------------------------------------------------------+----------+
| Reason Code | Type | Group | Title | Stateful |
+-------------+----------------------------+-------+---------------------------------------------------------------+----------+
| 101 | HW_SERVICE_REQUESTED | HW | Server is unreachable and cannot be recovered | YES |
| 102 | HW_SERVICE_REQUESTED | HW | Server failed and was disabled | YES |
| 103 | HW_SERVICE_REQUESTED | HW | Major component is unreachable | YES |
| 104 | HW_SERVICE_REQUESTED | HW | Major component failed | YES |
| 105 | HW_SERVICE_REQUESTED | HW | Subcomponent failed | YES |
| 106 | HW_SERVICE_REQUESTED | HW | FSP unrecoverable events detected | NO |
| 107 | HW_SERVICE_REQUESTED | HW | FSN thermal issues | YES |
| 108 | HW_SERVICE_REQUESTED | HW | Subcomponent is unreachable | YES |
...
| 849 | APPLIANCE_EVENT | SW | Application initialization has failed | NO |
| 901 | STORAGE_UTILIZATION | SW | Storage utilization above threshold | YES |
+-------------+----------------------------+-------+---------------------------------------------------------------+----------+
Generated: 2019-08-07 10:55:39
Depending on the type of alert, the system can manage it by taking one of the following
actions:
- Informational - log only
- Actionable - send email and/or open SNMP trap
- Serviceable - use Call Home to open a PMR
Read the following topics to learn more about configuring different types of alerts and their actions.