Event type and monitoring status for system health

An event might trigger a change in the state of a system.

The following three types of events are reported in the system:
  • State-changing events: The state-changing events change the state of a component or entity from good to bad or from bad to good depending on the corresponding state of the event.
    Note: An event is raised when the health status of the component goes from good to bad. For example, an event is raised that changes the status of a component from HEALTHY to DEGRADED. However, if the state was already DEGRADED based on another active event, there is no change in the status of the component. Also, when the state of the entity was FAILED, a DEGRADED event would not change the component's state because a FAILED status is more dominant than the DEGRADED status.
  • Tip: The tips are similar to state-changing events, but can be hidden by the user. Like state-changing events, a tip is removed automatically when the problem is resolved. A tip event always changes the state to of a component from HEALTHY to TIPS if the event is not hidden.
    Note: If the state of a component changes to TIPS, it can be hidden. However, you can still view the active hidden events by using the mmhealth node show ComponentName --verbose command, if the cause for the event still exists.
  • Information events: The information events are notices that are shown in the event log or in brackets in the mmhealth node show command. They do not change the state of the component. They disappear after 24 hours or when they are resolved by the mmhealth event resolve command.

The monitoring interval is 15 - 30 seconds, depending on the component. However, the services that are monitored less often (for example, once per 30 minutes), save the system resources. You can find more information about the events from the Monitoring > Events page in the IBM Storage Scale GUI or by issuing the mmhealth event show command.

The following are the possible status of nodes and services:
  • UNKNOWN - Status of the node or the service that is hosted on the node is not known Start of changebecause of a problem with monitoring. In most cases, this is accompanied by an exception in the /var/adm/ras/mmsysmonitor.log file where the root cause of the problem is described.End of change
  • HEALTHY - The node or the service that is hosted on the node is working as expected. There are no active error events.
  • CHECKING - The monitoring of a service or a component that is hosted on the node is starting at the moment. This state is a transient state, which changes to another state when the mmsysmon daemon initialization is completed.
  • TIPS - An issue might be reported with the configuration and tuning of the components. This status is only assigned to a tip event.
  • DEGRADED - The node or the service that is hosted on the node is not working as expected. Start of changeThis means that a problem with the component did not cause a complete component failure.End of change
  • FAILED - The node or the service that is hosted on the node failed due to errors or cannot be reached anymore.
  • DEPEND - The node or the services that are hosted on the node failed due to the failure of some components. For example, an NFS or SMB service shows this status whether authentication failed.
    Figure 1. IBM Storage Scale components dependency
    IBM Storage Scale components dependency

The status is graded as follows: HEALTHY < TIPS < DEGRADED < FAILED. For example, the status of the service that is hosted on a node becomes FAILED if there is at least one active event in the FAILED status for that corresponding service. The FAILED status gets more priority than the DEGRADED, which is followed by TIPS and then HEALTHY, while setting the status of the service. That is, if a service has an active event with a HEALTHY status and another active event with a FAILED status, then the system sets the status of the service as FAILED.

Some directed maintenance procedures or DMPs are available to solve issues caused by tip events. For information, see Directed maintenance procedures for tip events.

New encryption events are added that are identified by their unique ID. Events with different IDs can be raised multiple times, but they are listed only once for each unique ID. Therefore, multiple events can be displayed at the same time, but only one for each unique ID, regardless of how many times they are raised.

These events are cleared by using the mmhealth event resolve <event name> <event id> command.

For more information, see the Encryption events.