mmhealth command

Monitors health status of nodes.

Synopsis

mmhealth node show [ GPFS | NETWORK [ UserDefinedSubComponent ] 
                   | FILESYSTEM [UserDefinedSubComponent ] | DISK | CES | AUTH | AUTH_OBJ 
                   | BLOCK | CESNETWORK | NFS | OBJECT | SMB | CLOUDGATEWAY | GUI
                   | PERFMON ] [-N {Node[,Node..] | NodeFile | NodeClass}] 
                   [--verbose] [--unhealthy]

mmhealth node eventlog [[--hour | --day | --week | --month] | [--verbose]]

Start of change or End of change

mmhealth event show [ EventName | EventID ]

Start of change or End of change

mmhealth cluster show [ NODE | GPFS | NETWORK [ UserDefinedSubComponent ] 
                   | FILESYSTEM  | DISK | CES | AUTH | AUTH_OBJ 
                   | BLOCK | CESNETWORK | NFS | OBJECT | SMB | CLOUDGATEWAY | GUI
                   | PERFMON ] [--verbose]

Start of change or End of change

mmhealth thresholds list

Availability

Available with IBM Spectrum Scale™ Express Edition or higher.

Description

Use the mmhealth command to monitor the health of the node and services hosted on the node in IBM Spectrum Scale.

By using this command, the IBM Spectrum Scale administrator can monitor the health of each node and services hosted on that node. This command also shows the events that are responsible for the unhealthy status of the services hosted on that node. This data might be helpful for monitoring and analyzing the reasons for the unhealthy status of the node. So, the mmhealth command acts as a problem determination tool to identify which services of the node are unhealthy and events responsible for their unhealthy status.

Start of change The mmhealth command also monitors the state of all the IBM Spectrum Scale RAID components such as array, pdisk, vdisk and enclosure of the nodes that belong to the recovery group. End of change

For more information about the system monitoring feature, see IBM Spectrum Scale: Administration Guide

The mmhealth command also shows the details of threshold rules to avoid file systems out of space errors. The space availability of the filesystem component depends upon the occupancy level of fileset-inode spaces and capacity usage in each data or metadata pool. The violation of any single rule triggers the parent filesystem capacity issue notification. The capacity metrics are frequently compared with the rules boundaries by internal monitor process. If any of the metric values exceeds their threshold limit, then the system health (deamon/service) will receive an event notification from monitor process and generate a RAS event for the filesystem for space issues. For all threshold rules the warning level is set to 80%, and the error level to 90%. You can use the mmlspool command to track the inode and pool space usage.

Parameters

node

Displays the health status, specifically, at node level.

show

Displays the health status of the specified component with:

Displays the detailed health status of the specified component.

UserDefinedSubComponent: Displays services that are named by the customer, categorized by one of the other hosted services. For example, a file system named gpfs0 is a subcomponent of file system.

-N

Allows the system to make remote calls to the other nodes in the cluster for:

Node[,Node....]: Specifies the node or list of nodes that must be monitored for the health status.

NodeFile: Specifies a file, containing a list of node descriptors, one per line, to be monitored for health status.

NodeClass: Specifies a node class that must be monitored for the health status.

--verbose

Shows the detailed health status of a node, including its sub-components.

--unhealthy

Displays the unhealthy components only.

eventlog

Shows the event history for a specified period of time. If no time period is specified, it displays all the events by default:

[--hour | --day | --week| --month]: Displays the event history for the specified time period.

[--verbose]: Displays additional information about the event like component name and event ID in the eventlog.

event show

Shows the detailed description of the specified event:

EventName: Displays the detailed description of the specified event name.
EventID: Displays the detailed description of the specified event ID.

cluster

Displays the health status of all nodes and monitored node components in the cluster.

show

Displays the health status of the specified component with:

NODE | GPFS | NETWORK | FILESYSTEM | DISK | CES | AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB | CLOUDGATEWAY | GUI | PERFMON: Displays the detailed health status of the specified component.

--verbose: Shows the detailed health status of a node, including its sub-components.

thresholds list

Displays the list of the threshold rules defined for the system. End of change

Exit status

0: Successful completion.
nonzero: A failure has occurred.

Security

You must have root authority to run the mmhealth command.

The node on which the command is issued must be able to execute remote shell commands on any other node in the cluster without the use of a password and without producing any extraneous messages. See the information about the requirements for administering a GPFS system in the IBM Spectrum Scale: Administration Guide.

Examples

To show the health status of the current node, issue this command:

mmhealth node show

The system displays output similar to this: Start of change

Node name:      test_node
Node status:    HEALTHY
Status Change:  39 min. ago

Component          Status        Status Change    Reasons
-------------------------------------------------------------------
GPFS               HEALTHY       39 min. ago       -
NETWORK            HEALTHY       40 min. ago       -
FILESYSTEM         HEALTHY       39 min. ago       -
DISK               HEALTHY       39 min. ago       -
CES                HEALTHY       39 min. ago       -
PERFMON            HEALTHY       40 min. ago       -

To view the health status of a specific node, issue this command:

mmhealth node show -N test_node2

The system displays output similar to this: Start of change

Node name:      test_node2
Node status:    CHECKING
Status Change:  Now

Component       Status        Status Change    Reasons
-------------------------------------------------------------------
GPFS            CHECKING      Now              -
NETWORK         HEALTHY       Now              -
FILESYSTEM      CHECKING      Now              -
DISK            CHECKING      Now              -
CES             CHECKING      Now              -
PERFMON         HEALTHY       Now              -

To view the health status of all the nodes, issue this command:

mmhealth node show -N all

The system displays output similar to this: Start of change

Node name:    test_node
Node status:  DEGRADED

Component           Status        Status Change     Reasons
-------------------------------------------------------------
GPFS                HEALTHY          Now             -
CES                 FAILED           Now             smbd_down
FileSystem          HEALTHY          Now             -

Node name:            test_node2
Node status:          HEALTHY

Component           Status        Status Change    Reasons
------------------------------------------------------------
GPFS                HEALTHY       Now              -
CES                 HEALTHY       Now              -
FileSystem          HEALTHY       Now              -

To view the detailed health status of the component and its sub-component, issue this command:

mmhealth node show ces

The system displays output similar to this: Start of change

Node name:      test_node

Component       Status        Status Change    Reasons
-------------------------------------------------------------------
CES             HEALTHY       2 min. ago       -
  AUTH          DISABLED      2 min. ago       -
  AUTH_OBJ      DISABLED      2 min. ago       -
  BLOCK         DISABLED      2 min. ago       -
  CESNETWORK    HEALTHY       2 min. ago       -
  NFS           HEALTHY       2 min. ago       -
  OBJECT        DISABLED      2 min. ago       -
  SMB           HEALTHY       2 min. ago       -

To view the health status of only unhealthy components, issue this command:

mmhealth node show --unhealthy

The system displays output similar to this: Start of change

Node name:        test_node
Node status:       FAILED
Status Change:  1 min. ago

Component       Status        Status Change    Reasons
-------------------------------------------------------------------
GPFS            FAILED        1 min. ago       gpfs_down, quorum_down
FILESYSTEM      DEPEND        1 min. ago       unmounted_fs_check
CES             DEPEND        1 min. ago       ces_network_ips_down, nfs_in_grace

To view the health status of sub-components of a node's component, issue this command:

mmhealth node show --verbose

The system displays output similar to this: Start of change

Node name:      gssio1-hs.gpfs.net
Node status:    HEALTHY

Component                                    Status              Reasons
-------------------------------------------------------------------
GPFS                                         DEGRADED            -
NETWORK                                      HEALTHY             -
  bond0                                      HEALTHY             -
  ib0                                        HEALTHY             -
  ib1                                        HEALTHY             -
FILESYSTEM                                   DEGRADED            stale_mount, stale_mount, stale_mount
  Basic1                                     FAILED              stale_mount
  Basic2                                     FAILED              stale_mount
  Custom1                                    HEALTHY             -
  gpfs0                                      FAILED              stale_mount
  gpfs1                                      FAILED              stale_mount
DISK                                         DEGRADED            disk_down
  rg_gssio1_hs_Basic1_data_0                 HEALTHY             -
  rg_gssio1_hs_Basic1_system_0               HEALTHY             -
  rg_gssio1_hs_Basic2_data_0                 HEALTHY             -
  rg_gssio1_hs_Basic2_system_0               HEALTHY             -
  rg_gssio1_hs_Custom1_data1_0               HEALTHY             -
  rg_gssio1_hs_Custom1_system_0              DEGRADED            disk_down
  rg_gssio1_hs_Data_8M_2p_1_gpfs0            HEALTHY             -
  rg_gssio1_hs_Data_8M_3p_1_gpfs1            HEALTHY             -
  rg_gssio1_hs_MetaData_1M_3W_1_gpfs0        HEALTHY             -
  rg_gssio1_hs_MetaData_1M_4W_1_gpfs1        HEALTHY             -
  rg_gssio2_hs_Basic1_data_0                 HEALTHY             -
  rg_gssio2_hs_Basic1_system_0               HEALTHY             -
  rg_gssio2_hs_Basic2_data_0                 HEALTHY             -
  rg_gssio2_hs_Basic2_system_0               HEALTHY             -
  rg_gssio2_hs_Custom1_data1_0               HEALTHY             -
  rg_gssio2_hs_Custom1_system_0              HEALTHY             -
  rg_gssio2_hs_Data_8M_2p_1_gpfs0            HEALTHY             -
  rg_gssio2_hs_Data_8M_3p_1_gpfs1            HEALTHY             -
  rg_gssio2_hs_MetaData_1M_3W_1_gpfs0        HEALTHY             -
  rg_gssio2_hs_MetaData_1M_4W_1_gpfs1        HEALTHY             -
NATIVE_RAID                                  DEGRADED            gnr_pdisk_replaceable, gnr_rg_failed, enclosure_needsservice
  ARRAY                                      DEGRADED            -
    rg_gssio2-hs/DA1                         HEALTHY             -
    rg_gssio2-hs/DA2                         HEALTHY             -
    rg_gssio2-hs/NVR                         HEALTHY             -
    rg_gssio2-hs/SSD                         HEALTHY             -
  ENCLOSURE                                  DEGRADED            enclosure_needsservice
    SV52122944                               DEGRADED            enclosure_needsservice
    SV53058375                               HEALTHY             -
  PHYSICALDISK                               DEGRADED            gnr_pdisk_replaceable
    rg_gssio2-hs/e1d1s01                     FAILED              gnr_pdisk_replaceable
    rg_gssio2-hs/e1d1s07                     HEALTHY             -
    rg_gssio2-hs/e1d1s08                     HEALTHY             -
    rg_gssio2-hs/e1d1s09                     HEALTHY             -
    rg_gssio2-hs/e1d1s10                     HEALTHY             -
    rg_gssio2-hs/e1d1s11                     HEALTHY             -
    rg_gssio2-hs/e1d1s12                     HEALTHY             -
    rg_gssio2-hs/e1d2s07                     HEALTHY             -
    rg_gssio2-hs/e1d2s08                     HEALTHY             -
    rg_gssio2-hs/e1d2s09                     HEALTHY             -
    rg_gssio2-hs/e1d2s10                     HEALTHY             -
    rg_gssio2-hs/e1d2s11                     HEALTHY             -
    rg_gssio2-hs/e1d2s12                     HEALTHY             -
    rg_gssio2-hs/e1d3s07                     HEALTHY             -
    rg_gssio2-hs/e1d3s08                     HEALTHY             -
    rg_gssio2-hs/e1d3s09                     HEALTHY             -
    rg_gssio2-hs/e1d3s10                     HEALTHY             -
    rg_gssio2-hs/e1d3s11                     HEALTHY             -
    rg_gssio2-hs/e1d3s12                     HEALTHY             -
    rg_gssio2-hs/e1d4s07                     HEALTHY             -
    rg_gssio2-hs/e1d4s08                     HEALTHY             -
    rg_gssio2-hs/e1d4s09                     HEALTHY             -
    rg_gssio2-hs/e1d4s10                     HEALTHY             -
    rg_gssio2-hs/e1d4s11                     HEALTHY             -
    rg_gssio2-hs/e1d4s12                     HEALTHY             -
    rg_gssio2-hs/e1d5s07                     HEALTHY             -
    rg_gssio2-hs/e1d5s08                     HEALTHY             -
    rg_gssio2-hs/e1d5s09                     HEALTHY             -
    rg_gssio2-hs/e1d5s10                     HEALTHY             -
    rg_gssio2-hs/e1d5s11                     HEALTHY             -
    rg_gssio2-hs/e2d1s07                     HEALTHY             -
    rg_gssio2-hs/e2d1s08                     HEALTHY             -
    rg_gssio2-hs/e2d1s09                     HEALTHY             -
    rg_gssio2-hs/e2d1s10                     HEALTHY             -
    rg_gssio2-hs/e2d1s11                     HEALTHY             -
    rg_gssio2-hs/e2d1s12                     HEALTHY             -
    rg_gssio2-hs/e2d2s07                     HEALTHY             -
    rg_gssio2-hs/e2d2s08                     HEALTHY             -
    rg_gssio2-hs/e2d2s09                     HEALTHY             -
    rg_gssio2-hs/e2d2s10                     HEALTHY             -
    rg_gssio2-hs/e2d2s11                     HEALTHY             -
    rg_gssio2-hs/e2d2s12                     HEALTHY             -
    rg_gssio2-hs/e2d3s07                     HEALTHY             -
    rg_gssio2-hs/e2d3s08                     HEALTHY             -
    rg_gssio2-hs/e2d3s09                     HEALTHY             -
    rg_gssio2-hs/e2d3s10                     HEALTHY             -
    rg_gssio2-hs/e2d3s11                     HEALTHY             -
    rg_gssio2-hs/e2d3s12                     HEALTHY             -
    rg_gssio2-hs/e2d4s07                     HEALTHY             -
    rg_gssio2-hs/e2d4s08                     HEALTHY             -
    rg_gssio2-hs/e2d4s09                     HEALTHY             -
    rg_gssio2-hs/e2d4s10                     HEALTHY             -
    rg_gssio2-hs/e2d4s11                     HEALTHY             -
    rg_gssio2-hs/e2d4s12                     HEALTHY             -
    rg_gssio2-hs/e2d5s07                     HEALTHY             -
    rg_gssio2-hs/e2d5s08                     HEALTHY             -
    rg_gssio2-hs/e2d5s09                     HEALTHY             -
    rg_gssio2-hs/e2d5s10                     HEALTHY             -
    rg_gssio2-hs/e2d5s11                     HEALTHY             -
    rg_gssio2-hs/e2d5s12ssd                  HEALTHY             -
    rg_gssio2-hs/n1s02                       HEALTHY             -
    rg_gssio2-hs/n2s02                       HEALTHY             -
  RECOVERYGROUP                              DEGRADED            gnr_rg_failed
    rg_gssio1-hs                             FAILED              gnr_rg_failed
    rg_gssio2-hs                             HEALTHY             -
  VIRTUALDISK                                DEGRADED            -
    rg_gssio2_hs_Basic1_data_0               HEALTHY             -
    rg_gssio2_hs_Basic1_system_0             HEALTHY             -
    rg_gssio2_hs_Basic2_data_0               HEALTHY             -
    rg_gssio2_hs_Basic2_system_0             HEALTHY             -
    rg_gssio2_hs_Custom1_data1_0             HEALTHY             -
    rg_gssio2_hs_Custom1_system_0            HEALTHY             -
    rg_gssio2_hs_Data_8M_2p_1_gpfs0          HEALTHY             -
    rg_gssio2_hs_Data_8M_3p_1_gpfs1          HEALTHY             -
    rg_gssio2_hs_MetaData_1M_3W_1_gpfs0      HEALTHY             -
    rg_gssio2_hs_MetaData_1M_4W_1_gpfs1      HEALTHY             -
    rg_gssio2_hs_loghome                     HEALTHY             -
    rg_gssio2_hs_logtip                      HEALTHY             -
    rg_gssio2_hs_logtipbackup                HEALTHY             -
PERFMON                                      HEALTHY             -

To view the eventlog history of the node for the last hour, issue this command:

mmhealth node eventlog --hour