System health monitoring use cases

The following sections describe the use case for the mmhealth command

Use case 1: Checking the health status of the nodes and their corresponding services by using the following commands:

To show the health status of the current node:

mmhealth node show

The system displays output similar to this:

Node name:      test_node
Node status:    HEALTHY
Status Change:  39 min. ago

Component          Status        Status Change    Reasons
-------------------------------------------------------------------
GPFS               HEALTHY       39 min. ago       -
NETWORK            HEALTHY       40 min. ago       -
FILESYSTEM         HEALTHY       39 min. ago       -
DISK               HEALTHY       39 min. ago       -
CES                HEALTHY       39 min. ago       -
PERFMON            HEALTHY       40 min. ago       -

To view the health status of a specific node, issue this command:

mmhealth node show -N test_node2

The system displays output similar to this:

Node name:      test_node2
Node status:    CHECKING
Status Change:  Now

Component       Status        Status Change    Reasons
-------------------------------------------------------------------
GPFS            CHECKING      Now              -
NETWORK         HEALTHY       Now              -
FILESYSTEM      CHECKING      Now              -
DISK            CHECKING      Now              -
CES             CHECKING      Now              -
PERFMON         HEALTHY       Now              -

To view the health status of all the nodes, issue this command:

mmhealth node show -N all

The system displays output similar to this:

Node name:    test_node
Node status:  DEGRADED

Component           Status        Status Change     Reasons
-------------------------------------------------------------
GPFS                HEALTHY          Now             -
CES                 FAILED           Now             smbd_down
FileSystem          HEALTHY          Now             -

Node name:            test_node2
Node status:          HEALTHY

Component           Status        Status Change    Reasons
------------------------------------------------------------
GPFS                HEALTHY       Now              -
CES                 HEALTHY       Now              -
FileSystem          HEALTHY       Now              -

To view the detailed health status of the component and its sub-component, issue this command:

mmhealth node show ces

The system displays output similar to this:

Node name:      test_node

Component       Status        Status Change    Reasons
-------------------------------------------------------------------
CES             HEALTHY       2 min. ago       -
  AUTH          DISABLED      2 min. ago       -
  AUTH_OBJ      DISABLED      2 min. ago       -
  BLOCK         DISABLED      2 min. ago       -
  CESNETWORK    HEALTHY       2 min. ago       -
  NFS           HEALTHY       2 min. ago       -
  OBJECT        DISABLED      2 min. ago       -
  SMB           HEALTHY       2 min. ago       -

To view the health status of only unhealthy components, issue this command:

mmhealth node show --unhealthy

The system displays output similar to this:

Node name:        test_node
Node status:       FAILED
Status Change:  1 min. ago

Component       Status        Status Change    Reasons
-------------------------------------------------------------------
GPFS            FAILED        1 min. ago       gpfs_down, quorum_down
FILESYSTEM      DEPEND        1 min. ago       unmounted_fs_check
CES             DEPEND        1 min. ago       ces_network_ips_down, nfs_in_grace

To view the health status of sub-components of a node's component, issue this command:

mmhealth node show --verbose

The system displays output similar to this:

Node name:      gssio1-hs.gpfs.net
Node status:    HEALTHY

Component                                    Status              Reasons
--------------------------------------------------------------------------
GPFS                                         DEGRADED            -
NETWORK                                      HEALTHY             -
  bond0                                      HEALTHY             -
  ib0                                        HEALTHY             -
  ib1                                        HEALTHY             -
FILESYSTEM                                   DEGRADED            stale_mount, stale_mount, stale_mount
  Basic1                                     FAILED              stale_mount
  Basic2                                     FAILED              stale_mount
  Custom1                                    HEALTHY             -
  gpfs0                                      FAILED              stale_mount
  gpfs1                                      FAILED              stale_mount
DISK                                         DEGRADED            disk_down
  rg_gssio1_hs_Basic1_data_0                 HEALTHY             -
  rg_gssio1_hs_Basic1_system_0               HEALTHY             -
  rg_gssio1_hs_Basic2_data_0                 HEALTHY             -
  rg_gssio1_hs_Basic2_system_0               HEALTHY             -
  rg_gssio1_hs_Custom1_data1_0               HEALTHY             -
  rg_gssio1_hs_Custom1_system_0              DEGRADED            disk_down
  rg_gssio1_hs_Data_8M_2p_1_gpfs0            HEALTHY             -
  rg_gssio1_hs_Data_8M_3p_1_gpfs1            HEALTHY             -
  rg_gssio1_hs_MetaData_1M_3W_1_gpfs0        HEALTHY             -
  rg_gssio1_hs_MetaData_1M_4W_1_gpfs1        HEALTHY             -
  rg_gssio2_hs_Basic1_data_0                 HEALTHY             -
  rg_gssio2_hs_Basic1_system_0               HEALTHY             -
  rg_gssio2_hs_Basic2_data_0                 HEALTHY             -
  rg_gssio2_hs_Basic2_system_0               HEALTHY             -
  rg_gssio2_hs_Custom1_data1_0               HEALTHY             -
  rg_gssio2_hs_Custom1_system_0              HEALTHY             -
  rg_gssio2_hs_Data_8M_2p_1_gpfs0            HEALTHY             -
  rg_gssio2_hs_Data_8M_3p_1_gpfs1            HEALTHY             -
  rg_gssio2_hs_MetaData_1M_3W_1_gpfs0        HEALTHY             -
  rg_gssio2_hs_MetaData_1M_4W_1_gpfs1        HEALTHY             -
NATIVE_RAID                                  DEGRADED            gnr_pdisk_replaceable, gnr_rg_failed, enclosure_needsservice
  ARRAY                                      DEGRADED            -
    rg_gssio2-hs/DA1                         HEALTHY             -
    rg_gssio2-hs/DA2                         HEALTHY             -
    rg_gssio2-hs/NVR                         HEALTHY             -
    rg_gssio2-hs/SSD                         HEALTHY             -
  ENCLOSURE                                  DEGRADED            enclosure_needsservice
    SV52122944                               DEGRADED            enclosure_needsservice
    SV53058375                               HEALTHY             -
  PHYSICALDISK                               DEGRADED            gnr_pdisk_replaceable
    rg_gssio2-hs/e1d1s01                     FAILED              gnr_pdisk_replaceable
    rg_gssio2-hs/e1d1s07                     HEALTHY             -
    rg_gssio2-hs/e1d1s08                     HEALTHY             -
    rg_gssio2-hs/e1d1s09                     HEALTHY             -
    rg_gssio2-hs/e1d1s10                     HEALTHY             -
    rg_gssio2-hs/e1d1s11                     HEALTHY             -
    rg_gssio2-hs/e1d1s12                     HEALTHY             -
    rg_gssio2-hs/e1d2s07                     HEALTHY             -
    rg_gssio2-hs/e1d2s08                     HEALTHY             -
    rg_gssio2-hs/e1d2s09                     HEALTHY             -
    rg_gssio2-hs/e1d2s10                     HEALTHY             -
    rg_gssio2-hs/e1d2s11                     HEALTHY             -
    rg_gssio2-hs/e1d2s12                     HEALTHY             -
    rg_gssio2-hs/e1d3s07                     HEALTHY             -
    rg_gssio2-hs/e1d3s08                     HEALTHY             -
    rg_gssio2-hs/e1d3s09                     HEALTHY             -
    rg_gssio2-hs/e1d3s10                     HEALTHY             -
    rg_gssio2-hs/e1d3s11                     HEALTHY             -
    rg_gssio2-hs/e1d3s12                     HEALTHY             -
    rg_gssio2-hs/e1d4s07                     HEALTHY             -
    rg_gssio2-hs/e1d4s08                     HEALTHY             -
    rg_gssio2-hs/e1d4s09                     HEALTHY             -
    rg_gssio2-hs/e1d4s10                     HEALTHY             -
    rg_gssio2-hs/e1d4s11                     HEALTHY             -
    rg_gssio2-hs/e1d4s12                     HEALTHY             -
    rg_gssio2-hs/e1d5s07                     HEALTHY             -
    rg_gssio2-hs/e1d5s08                     HEALTHY             -
    rg_gssio2-hs/e1d5s09                     HEALTHY             -
    rg_gssio2-hs/e1d5s10                     HEALTHY             -
    rg_gssio2-hs/e1d5s11                     HEALTHY             -
    rg_gssio2-hs/e2d1s07                     HEALTHY             -
    rg_gssio2-hs/e2d1s08                     HEALTHY             -
    rg_gssio2-hs/e2d1s09                     HEALTHY             -
    rg_gssio2-hs/e2d1s10                     HEALTHY             -
    rg_gssio2-hs/e2d1s11                     HEALTHY             -
    rg_gssio2-hs/e2d1s12                     HEALTHY             -
    rg_gssio2-hs/e2d2s07                     HEALTHY             -
    rg_gssio2-hs/e2d2s08                     HEALTHY             -
    rg_gssio2-hs/e2d2s09                     HEALTHY             -
    rg_gssio2-hs/e2d2s10                     HEALTHY             -
    rg_gssio2-hs/e2d2s11                     HEALTHY             -
    rg_gssio2-hs/e2d2s12                     HEALTHY             -
    rg_gssio2-hs/e2d3s07                     HEALTHY             -
    rg_gssio2-hs/e2d3s08                     HEALTHY             -
    rg_gssio2-hs/e2d3s09                     HEALTHY             -
    rg_gssio2-hs/e2d3s10                     HEALTHY             -
    rg_gssio2-hs/e2d3s11                     HEALTHY             -
    rg_gssio2-hs/e2d3s12                     HEALTHY             -
    rg_gssio2-hs/e2d4s07                     HEALTHY             -
    rg_gssio2-hs/e2d4s08                     HEALTHY             -
    rg_gssio2-hs/e2d4s09                     HEALTHY             -
    rg_gssio2-hs/e2d4s10                     HEALTHY             -
    rg_gssio2-hs/e2d4s11                     HEALTHY             -
    rg_gssio2-hs/e2d4s12                     HEALTHY             -
    rg_gssio2-hs/e2d5s07                     HEALTHY             -
    rg_gssio2-hs/e2d5s08                     HEALTHY             -
    rg_gssio2-hs/e2d5s09                     HEALTHY             -
    rg_gssio2-hs/e2d5s10                     HEALTHY             -
    rg_gssio2-hs/e2d5s11                     HEALTHY             -
    rg_gssio2-hs/e2d5s12ssd                  HEALTHY             -
    rg_gssio2-hs/n1s02                       HEALTHY             -
    rg_gssio2-hs/n2s02                       HEALTHY             -
  RECOVERYGROUP                              DEGRADED            gnr_rg_failed
    rg_gssio1-hs                             FAILED              gnr_rg_failed
    rg_gssio2-hs                             HEALTHY             -
  VIRTUALDISK                                DEGRADED            -
    rg_gssio2_hs_Basic1_data_0               HEALTHY             -
    rg_gssio2_hs_Basic1_system_0             HEALTHY             -
    rg_gssio2_hs_Basic2_data_0               HEALTHY             -
    rg_gssio2_hs_Basic2_system_0             HEALTHY             -
    rg_gssio2_hs_Custom1_data1_0             HEALTHY             -
    rg_gssio2_hs_Custom1_system_0            HEALTHY             -
    rg_gssio2_hs_Data_8M_2p_1_gpfs0          HEALTHY             -
    rg_gssio2_hs_Data_8M_3p_1_gpfs1          HEALTHY             -
    rg_gssio2_hs_MetaData_1M_3W_1_gpfs0      HEALTHY             -
    rg_gssio2_hs_MetaData_1M_4W_1_gpfs1      HEALTHY             -
    rg_gssio2_hs_loghome                     HEALTHY             -
    rg_gssio2_hs_logtip                      HEALTHY             -
    rg_gssio2_hs_logtipbackup                HEALTHY             -
PERFMON                                      HEALTHY             -

To view the eventlog history of the node for the last hour, issue this command:

mmhealth node eventlog --hour

The system displays output similar to this:

Node name:      test-21.localnet.com
Timestamp                             Event Name                Severity   Details
2016-10-28 06:59:34.045980 CEST       monitor_started           INFO       The IBM Spectrum Scale monitoring 
                                                                           service has been started
2016-10-28 07:01:21.919943 CEST       fs_remount_mount          INFO       The filesystem objfs was mounted internal
2016-10-28 07:01:32.434703 CEST       disk_found                INFO       The disk disk1 was found
2016-10-28 07:01:32.669125 CEST       disk_found                INFO       The disk disk8 was found
2016-10-28 07:01:36.975902 CEST       filesystem_found          INFO       Filesystem objfs was found
2016-10-28 07:01:37.226157 CEST       unmounted_fs_check        WARNING    The filesystem objfs is probably needed, 
                                                                           but not mounted
2016-10-28 07:01:52.113691 CEST       mounted_fs_check          INFO       The filesystem objfs is mounted
2016-10-28 07:01:52.283545 CEST       fs_remount_mount          INFO       The filesystem objfs was mounted normal
2016-10-28 07:02:07.026093 CEST       mounted_fs_check          INFO       The filesystem objfs is mounted
2016-10-28 07:14:58.498854 CEST       ces_network_ips_down      WARNING    No CES relevant NICs detected
2016-10-28 07:15:07.702351 CEST       nodestatechange_info      INFO       A CES node state change: 
                                                                           Node 1 add startup flag
2016-10-28 07:15:37.322997 CEST       nodestatechange_info      INFO       A CES node state change: 
                                                                           Node 1 remove startup flag
2016-10-28 07:15:43.741149 CEST       ces_network_ips_up        INFO       CES-relevant IPs are served by found NICs
2016-10-28 07:15:44.028031 CEST       ces_network_vanished      INFO       CES NIC eth0 has vanished

To view the eventlog history of the node for the last hour, issue this command:

mmhealth node eventlog --hour --verbose

The system displays output similar to this:

Node name:      test-21.localnet.com
Timestamp                         Component    Event Name            Event ID Severity Details
2016-10-28 06:59:34.045980 CEST   gpfs         monitor_started       999726   INFO     The IBM Spectrum Scale monitoring service has been started
2016-10-28 07:01:21.919943 CEST   filesystem   fs_remount_mount      999306   INFO     The filesystem objfs was mounted internal
2016-10-28 07:01:32.434703 CEST   disk         disk_found            999424   INFO     The disk disk1 was found
2016-10-28 07:01:32.669125 CEST   disk         disk_found            999424   INFO     The disk disk8 was found
2016-10-28 07:01:36.975902 CEST   filesystem   filesystem_found      999299   INFO     Filesystem objfs was found
2016-10-28 07:01:37.226157 CEST   filesystem   unmounted_fs_check    999298   WARNING  The filesystem objfs is probably needed, but not mounted
2016-10-28 07:01:52.113691 CEST   filesystem   mounted_fs_check      999301   INFO     The filesystem objfs is mounted
2016-10-28 07:01:52.283545 CEST   filesystem   fs_remount_mount      999306   INFO     The filesystem objfs was mounted normal
2016-10-28 07:02:07.026093 CEST   filesystem   mounted_fs_check      999301   INFO     The filesystem objfs is mounted
2016-10-28 07:14:58.498854 CEST   cesnetwork   ces_network_ips_down  999426   WARNING  No CES relevant NICs detected
2016-10-28 07:15:07.702351 CEST   gpfs         nodestatechange_info  999220   INFO     A CES node state change: Node 1 add startup flag
2016-10-28 07:15:37.322997 CEST   gpfs         nodestatechange_info  999220   INFO     A CES node state change: Node 1 remove startup flag
2016-10-28 07:15:43.741149 CEST   cesnetwork   ces_network_ips_up    999427   INFO     CES-relevant IPs are served by found NICs
2016-10-28 07:15:44.028031 CEST   cesnetwork   ces_network_vanished  999434   INFO    CES NIC eth0 has vanished

To view the detailed description of an event, issue mmhealth event show command. This is an example for quorum_down event:

mmhealth event show quorum_down

The system displays output similar to this:

Event Name:         quorum_down
Event ID:           999289
Description:        Reasons could be network or hardware issues, or a shutdown of the cluster service.
                    The event does not necessarily indicate an issue with the cluster quorum state.
Cause:              The local node does not have quorum. The cluster service might not be running.
User Action:        Check if the cluster quorum nodes are running and can be reached over the network. 
                    Check local firewall settings
Severity:           ERROR
State:              DEGRADED  
8:08:54 PM
2016-09-27 11:31:52.520002 CEST    move_cesip_from    INFO   Address 192.168.3.27 was moved from this node to node 3
2016-09-27 11:32:40.576867 CEST    nfs_dbus_ok        INFO   NFS check via DBus successful
2016-09-27 11:33:36.483188 CEST    pmsensors_down     ERROR  pmsensors service should be started and is stopped
2016-09-27 11:34:06.188747 CEST    pmsensors_up       INFO   pmsensors service as expected, state is started


2016-09-27 11:31:52.520002 CEST   cesnetwork    move_cesip_from    999244   INFO   Address 192.168.3.27 was moved from this node to node 3
2016-09-27 11:32:40.576867 CEST   nfs           nfs_dbus_ok        999239   INFO   NFS check via DBus successful
2016-09-27 11:33:36.483188 CEST   perfmon       pmsensors_down     999342   ERROR  pmsensors service should be started and is stopped
2016-09-27 11:34:06.188747 CEST   perfmon       pmsensors_up       999341   INFO   pmsensors service as expected, state is started

To view the detailed description of the cluster, issue mmhealth cluster show command:

mmhealth cluster show

The system displays output similar to this:

Component     Total   Failed   Degraded   Healthy   Other
-----------------------------------------------------------------
NODE             50        1          1        48       -
GPFS             50        1          -        49       -
NETWORK          50        -          -        50       -
FILESYSTEM        3        -          -         3       -
DISK             50        -          -        50       -
CES               5        -          5         -       -
CLOUDGATEWAY      2        -          -         2       -
PERFMON          48        -          5        43       -

Note: The cluster must have the minimum release level as 4.2.2.0 or higher to use mmhealth cluster show command. Also, this command does not support Windows operating system.

To view more information of the cluster health status, issue this command:

mmhealth cluster show --verbose

The system displays output similar to this:

Component     Total   Failed   Degraded   Healthy   Other
-----------------------------------------------------------------
NODE             50        1          1        48       -
GPFS             50        1          -        49       -
NETWORK          50        -          -        50       -
FILESYSTEM
  FS1            15        -          -        15       -
  FS2             5        -          -         5       -
  FS3            20        -          -        20       -
DISK             50        -          -        50       -
CES               5        -          5         -       -
  AUTH            5        -          -         -       5
  AUTH_OBJ        5        5          -         -       -
  BLOCK           5        -          -         -       5
  CESNETWORK      5        -          -         5       -
  NFS             5        -          -         5       -
  OBJECT          5        -          -         5       -
  SMB             5        -          -         5       -
CLOUDGATEWAY      2        -          -         2       -
PERFMON          48        -          5        43       -

Start of change

To view the state of the file system:

mmhealth node show filesystem -v
Node name: ibmnode1.ibm.com
Component     Status        Status Change          Reasons
--------------------------------------------------------------------------------------------------------
FILESYSTEM    HEALTHY       2019-01-30 14:32:24    fs_maintenance_mode(gpfs0), unmounted_fs_check(gpfs0)
gpfs0         SUSPENDED     2019-01-30 14:32:22    fs_maintenance_mode(gpfs0), unmounted_fs_check(gpfs0)
objfs         HEALTHY       2019-01-30 14:32:22    -

You will see output similar to the following example:

Event                 Parameter     Severity    Active Since          Event Message
-------------------------------------------------------------------------------------------------------
fs_maintenance_mode   gpfs0         INFO        2019-01-30 14:32:20   Filesystem gpfs0 is set to 
                                                                      maintenance mode.
unmounted_fs_check    gpfs0         WARNING     2019-01-30 14:32:21   The filesystem gpfs0 is  
                                                                      probably needed, but not mounted
fs_working_mode       objfs         INFO        2019-01-30 14:32:21   Filesystem objfs is 
                                                                      not in maintenance mode.
mounted_fs_check      objfs         INFO        2019-01-30 14:32:21   The filesystem objfs is mounted

End of change