Event type and monitoring status for system health

An event might trigger a change in the state of a system.

The following three types of events are reported in the system:

State-changing events: These events change the state of a component or entity from good to bad or from bad to good depending on the corresponding state of the event.
Note: An event is raised when the health status of the component goes from good to bad. For example, an event is raised that changes the status of a component from HEALTHY to DEGRADED. However, if the state was already DEGRADED based on another active event, there will be no change in the status of the component. Also if the state of the entity was FAILED, a DEGRADED event wouldn't change the component's state, because a FAILED status is more dominant than the DEGRADED status.
Tip: These are similar to state-changing events, but can be hidden by the user. Like state-changing events, a tip is removed automatically if the problem is resolved. A tip event always changes the state to of a component from HEALTHY to TIPS if the event is not hidden.
Note: If the state of a component changes to TIPS, it can be hidden. However, you can still view the active hidden events using the mmhealth node show ComponentName --verbose command, if the cause for the event still exists.
Information events: These are short notification events that will only be shown in the event log, but do not change the state of the components.

The monitoring interval is between 15 and 30 seconds, depending on the component. However, there are services that are monitored less often (e.g. once per 30 minutes) to save system resources. You can find more information about the events from the Monitoring > Events page in the IBM Spectrum Scale GUI or by issuing the mmhealth event show command.

The following are the possible status of nodes and services:

UNKNOWN - Status of the node or the service hosted on the node is not known.
HEALTHY - The node or the service hosted on the node is working as expected. There are no active error events.
CHECKING - The monitoring of a service or a component hosted on the node is starting at the moment. This state is a transient state and is updated when the startup is completed.
TIPS - There might be an issue with the configuration and tuning of the components. This status is only assigned to a tip event
DEGRADED - The node or the service hosted on the node is not working as expected. That is, a problem occurred with the component but it did not result in a complete failure.
FAILED - The node or the service hosted on the node failed due to errors or cannot be reached anymore.
DEPEND - The node or the services hosted on the node have failed due to the failure of some components. For example, an NFS or SMB service shows this status if authentication has failed.

The status are graded as follows: HEALTHY < TIPS < DEGRADED < FAILED. For example, the status of the service hosted on a node becomes FAILED if there is at least one active event in the FAILED status for that corresponding service. The FAILED status gets more priority than the DEGRADED which is followed by TIPS and then HEALTHY, while setting the status of the service. That is, if a service has an active event with a HEALTHY status and another active event with a FAILED status, then the system sets the status of the service as FAILED.

Some directed maintenance procedures or DMPs are available to solve issues caused by tip events. For information on DMPs, see Directed maintenance procedures for tip events.