System health monitoring use cases
The following sections describe the use case for the mmhealth command
Use case 1: Checking the health status of the nodes and their corresponding services by
using the following commands:
- To show the health status of the current node:
The system displays output similar to this:mmhealth node show
Node name: test_node Node status: HEALTHY Status Change: 39 min. ago Component Status Status Change Reasons ------------------------------------------------------------------- GPFS HEALTHY 39 min. ago - NETWORK HEALTHY 40 min. ago - FILESYSTEM HEALTHY 39 min. ago - DISK HEALTHY 39 min. ago - CES HEALTHY 39 min. ago - PERFMON HEALTHY 40 min. ago -
- To view the health status of a specific node, issue this
command:
The system displays output similar to this:mmhealth node show -N test_node2
Node name: test_node2 Node status: CHECKING Status Change: Now Component Status Status Change Reasons ------------------------------------------------------------------- GPFS CHECKING Now - NETWORK HEALTHY Now - FILESYSTEM CHECKING Now - DISK CHECKING Now - CES CHECKING Now - PERFMON HEALTHY Now -
- To view the health status of all the nodes, issue this
command:
The system displays output similar to this:mmhealth node show -N all
Node name: test_node Node status: DEGRADED Component Status Status Change Reasons ------------------------------------------------------------- GPFS HEALTHY Now - CES FAILED Now smbd_down FileSystem HEALTHY Now - Node name: test_node2 Node status: HEALTHY Component Status Status Change Reasons ------------------------------------------------------------ GPFS HEALTHY Now - CES HEALTHY Now - FileSystem HEALTHY Now -
- To view the detailed health status of the component and its sub-component, issue this
command:
The system displays output similar to this:mmhealth node show ces
Node name: test_node Component Status Status Change Reasons ------------------------------------------------------------------- CES HEALTHY 2 min. ago - AUTH DISABLED 2 min. ago - AUTH_OBJ DISABLED 2 min. ago - BLOCK DISABLED 2 min. ago - CESNETWORK HEALTHY 2 min. ago - NFS HEALTHY 2 min. ago - OBJECT DISABLED 2 min. ago - SMB HEALTHY 2 min. ago -
- To view the health status of only unhealthy components, issue this
command:
The system displays output similar to this:mmhealth node show --unhealthy
Node name: test_node Node status: FAILED Status Change: 1 min. ago Component Status Status Change Reasons ------------------------------------------------------------------- GPFS FAILED 1 min. ago gpfs_down, quorum_down FILESYSTEM DEPEND 1 min. ago unmounted_fs_check CES DEPEND 1 min. ago ces_network_ips_down, nfs_in_grace
- To view the health status of sub-components of a node's component, issue this
command:
The system displays output similar to this:mmhealth node show --verbose
Node name: gssio1-hs.gpfs.net Node status: HEALTHY Component Status Reasons -------------------------------------------------------------------------- GPFS DEGRADED - NETWORK HEALTHY - bond0 HEALTHY - ib0 HEALTHY - ib1 HEALTHY - FILESYSTEM DEGRADED stale_mount, stale_mount, stale_mount Basic1 FAILED stale_mount Basic2 FAILED stale_mount Custom1 HEALTHY - gpfs0 FAILED stale_mount gpfs1 FAILED stale_mount DISK DEGRADED disk_down rg_gssio1_hs_Basic1_data_0 HEALTHY - rg_gssio1_hs_Basic1_system_0 HEALTHY - rg_gssio1_hs_Basic2_data_0 HEALTHY - rg_gssio1_hs_Basic2_system_0 HEALTHY - rg_gssio1_hs_Custom1_data1_0 HEALTHY - rg_gssio1_hs_Custom1_system_0 DEGRADED disk_down rg_gssio1_hs_Data_8M_2p_1_gpfs0 HEALTHY - rg_gssio1_hs_Data_8M_3p_1_gpfs1 HEALTHY - rg_gssio1_hs_MetaData_1M_3W_1_gpfs0 HEALTHY - rg_gssio1_hs_MetaData_1M_4W_1_gpfs1 HEALTHY - rg_gssio2_hs_Basic1_data_0 HEALTHY - rg_gssio2_hs_Basic1_system_0 HEALTHY - rg_gssio2_hs_Basic2_data_0 HEALTHY - rg_gssio2_hs_Basic2_system_0 HEALTHY - rg_gssio2_hs_Custom1_data1_0 HEALTHY - rg_gssio2_hs_Custom1_system_0 HEALTHY - rg_gssio2_hs_Data_8M_2p_1_gpfs0 HEALTHY - rg_gssio2_hs_Data_8M_3p_1_gpfs1 HEALTHY - rg_gssio2_hs_MetaData_1M_3W_1_gpfs0 HEALTHY - rg_gssio2_hs_MetaData_1M_4W_1_gpfs1 HEALTHY - NATIVE_RAID DEGRADED gnr_pdisk_replaceable, gnr_rg_failed, enclosure_needsservice ARRAY DEGRADED - rg_gssio2-hs/DA1 HEALTHY - rg_gssio2-hs/DA2 HEALTHY - rg_gssio2-hs/NVR HEALTHY - rg_gssio2-hs/SSD HEALTHY - ENCLOSURE DEGRADED enclosure_needsservice SV52122944 DEGRADED enclosure_needsservice SV53058375 HEALTHY - PHYSICALDISK DEGRADED gnr_pdisk_replaceable rg_gssio2-hs/e1d1s01 FAILED gnr_pdisk_replaceable rg_gssio2-hs/e1d1s07 HEALTHY - rg_gssio2-hs/e1d1s08 HEALTHY - rg_gssio2-hs/e1d1s09 HEALTHY - rg_gssio2-hs/e1d1s10 HEALTHY - rg_gssio2-hs/e1d1s11 HEALTHY - rg_gssio2-hs/e1d1s12 HEALTHY - rg_gssio2-hs/e1d2s07 HEALTHY - rg_gssio2-hs/e1d2s08 HEALTHY - rg_gssio2-hs/e1d2s09 HEALTHY - rg_gssio2-hs/e1d2s10 HEALTHY - rg_gssio2-hs/e1d2s11 HEALTHY - rg_gssio2-hs/e1d2s12 HEALTHY - rg_gssio2-hs/e1d3s07 HEALTHY - rg_gssio2-hs/e1d3s08 HEALTHY - rg_gssio2-hs/e1d3s09 HEALTHY - rg_gssio2-hs/e1d3s10 HEALTHY - rg_gssio2-hs/e1d3s11 HEALTHY - rg_gssio2-hs/e1d3s12 HEALTHY - rg_gssio2-hs/e1d4s07 HEALTHY - rg_gssio2-hs/e1d4s08 HEALTHY - rg_gssio2-hs/e1d4s09 HEALTHY - rg_gssio2-hs/e1d4s10 HEALTHY - rg_gssio2-hs/e1d4s11 HEALTHY - rg_gssio2-hs/e1d4s12 HEALTHY - rg_gssio2-hs/e1d5s07 HEALTHY - rg_gssio2-hs/e1d5s08 HEALTHY - rg_gssio2-hs/e1d5s09 HEALTHY - rg_gssio2-hs/e1d5s10 HEALTHY - rg_gssio2-hs/e1d5s11 HEALTHY - rg_gssio2-hs/e2d1s07 HEALTHY - rg_gssio2-hs/e2d1s08 HEALTHY - rg_gssio2-hs/e2d1s09 HEALTHY - rg_gssio2-hs/e2d1s10 HEALTHY - rg_gssio2-hs/e2d1s11 HEALTHY - rg_gssio2-hs/e2d1s12 HEALTHY - rg_gssio2-hs/e2d2s07 HEALTHY - rg_gssio2-hs/e2d2s08 HEALTHY - rg_gssio2-hs/e2d2s09 HEALTHY - rg_gssio2-hs/e2d2s10 HEALTHY - rg_gssio2-hs/e2d2s11 HEALTHY - rg_gssio2-hs/e2d2s12 HEALTHY - rg_gssio2-hs/e2d3s07 HEALTHY - rg_gssio2-hs/e2d3s08 HEALTHY - rg_gssio2-hs/e2d3s09 HEALTHY - rg_gssio2-hs/e2d3s10 HEALTHY - rg_gssio2-hs/e2d3s11 HEALTHY - rg_gssio2-hs/e2d3s12 HEALTHY - rg_gssio2-hs/e2d4s07 HEALTHY - rg_gssio2-hs/e2d4s08 HEALTHY - rg_gssio2-hs/e2d4s09 HEALTHY - rg_gssio2-hs/e2d4s10 HEALTHY - rg_gssio2-hs/e2d4s11 HEALTHY - rg_gssio2-hs/e2d4s12 HEALTHY - rg_gssio2-hs/e2d5s07 HEALTHY - rg_gssio2-hs/e2d5s08 HEALTHY - rg_gssio2-hs/e2d5s09 HEALTHY - rg_gssio2-hs/e2d5s10 HEALTHY - rg_gssio2-hs/e2d5s11 HEALTHY - rg_gssio2-hs/e2d5s12ssd HEALTHY - rg_gssio2-hs/n1s02 HEALTHY - rg_gssio2-hs/n2s02 HEALTHY - RECOVERYGROUP DEGRADED gnr_rg_failed rg_gssio1-hs FAILED gnr_rg_failed rg_gssio2-hs HEALTHY - VIRTUALDISK DEGRADED - rg_gssio2_hs_Basic1_data_0 HEALTHY - rg_gssio2_hs_Basic1_system_0 HEALTHY - rg_gssio2_hs_Basic2_data_0 HEALTHY - rg_gssio2_hs_Basic2_system_0 HEALTHY - rg_gssio2_hs_Custom1_data1_0 HEALTHY - rg_gssio2_hs_Custom1_system_0 HEALTHY - rg_gssio2_hs_Data_8M_2p_1_gpfs0 HEALTHY - rg_gssio2_hs_Data_8M_3p_1_gpfs1 HEALTHY - rg_gssio2_hs_MetaData_1M_3W_1_gpfs0 HEALTHY - rg_gssio2_hs_MetaData_1M_4W_1_gpfs1 HEALTHY - rg_gssio2_hs_loghome HEALTHY - rg_gssio2_hs_logtip HEALTHY - rg_gssio2_hs_logtipbackup HEALTHY - PERFMON HEALTHY -
- To view the eventlog history of the node for the last hour, issue this
command:
The system displays output similar to this:mmhealth node eventlog --hour
Node name: test-21.localnet.com Timestamp Event Name Severity Details 2016-10-28 06:59:34.045980 CEST monitor_started INFO The IBM Spectrum Scale monitoring service has been started 2016-10-28 07:01:21.919943 CEST fs_remount_mount INFO The filesystem objfs was mounted internal 2016-10-28 07:01:32.434703 CEST disk_found INFO The disk disk1 was found 2016-10-28 07:01:32.669125 CEST disk_found INFO The disk disk8 was found 2016-10-28 07:01:36.975902 CEST filesystem_found INFO Filesystem objfs was found 2016-10-28 07:01:37.226157 CEST unmounted_fs_check WARNING The filesystem objfs is probably needed, but not mounted 2016-10-28 07:01:52.113691 CEST mounted_fs_check INFO The filesystem objfs is mounted 2016-10-28 07:01:52.283545 CEST fs_remount_mount INFO The filesystem objfs was mounted normal 2016-10-28 07:02:07.026093 CEST mounted_fs_check INFO The filesystem objfs is mounted 2016-10-28 07:14:58.498854 CEST ces_network_ips_down WARNING No CES relevant NICs detected 2016-10-28 07:15:07.702351 CEST nodestatechange_info INFO A CES node state change: Node 1 add startup flag 2016-10-28 07:15:37.322997 CEST nodestatechange_info INFO A CES node state change: Node 1 remove startup flag 2016-10-28 07:15:43.741149 CEST ces_network_ips_up INFO CES-relevant IPs are served by found NICs 2016-10-28 07:15:44.028031 CEST ces_network_vanished INFO CES NIC eth0 has vanished
- To view the eventlog history of the node for the last hour, issue this
command:
The system displays output similar to this:mmhealth node eventlog --hour --verbose
Node name: test-21.localnet.com Timestamp Component Event Name Event ID Severity Details 2016-10-28 06:59:34.045980 CEST gpfs monitor_started 999726 INFO The IBM Spectrum Scale monitoring service has been started 2016-10-28 07:01:21.919943 CEST filesystem fs_remount_mount 999306 INFO The filesystem objfs was mounted internal 2016-10-28 07:01:32.434703 CEST disk disk_found 999424 INFO The disk disk1 was found 2016-10-28 07:01:32.669125 CEST disk disk_found 999424 INFO The disk disk8 was found 2016-10-28 07:01:36.975902 CEST filesystem filesystem_found 999299 INFO Filesystem objfs was found 2016-10-28 07:01:37.226157 CEST filesystem unmounted_fs_check 999298 WARNING The filesystem objfs is probably needed, but not mounted 2016-10-28 07:01:52.113691 CEST filesystem mounted_fs_check 999301 INFO The filesystem objfs is mounted 2016-10-28 07:01:52.283545 CEST filesystem fs_remount_mount 999306 INFO The filesystem objfs was mounted normal 2016-10-28 07:02:07.026093 CEST filesystem mounted_fs_check 999301 INFO The filesystem objfs is mounted 2016-10-28 07:14:58.498854 CEST cesnetwork ces_network_ips_down 999426 WARNING No CES relevant NICs detected 2016-10-28 07:15:07.702351 CEST gpfs nodestatechange_info 999220 INFO A CES node state change: Node 1 add startup flag 2016-10-28 07:15:37.322997 CEST gpfs nodestatechange_info 999220 INFO A CES node state change: Node 1 remove startup flag 2016-10-28 07:15:43.741149 CEST cesnetwork ces_network_ips_up 999427 INFO CES-relevant IPs are served by found NICs 2016-10-28 07:15:44.028031 CEST cesnetwork ces_network_vanished 999434 INFO CES NIC eth0 has vanished
- To view the detailed description of an event, issue mmhealth event show
command. This is an example for quorum_down
event:
The system displays output similar to this:mmhealth event show quorum_down
Event Name: quorum_down Event ID: 999289 Description: Reasons could be network or hardware issues, or a shutdown of the cluster service. The event does not necessarily indicate an issue with the cluster quorum state. Cause: The local node does not have quorum. The cluster service might not be running. User Action: Check if the cluster quorum nodes are running and can be reached over the network. Check local firewall settings Severity: ERROR State: DEGRADED 8:08:54 PM 2016-09-27 11:31:52.520002 CEST move_cesip_from INFO Address 192.168.3.27 was moved from this node to node 3 2016-09-27 11:32:40.576867 CEST nfs_dbus_ok INFO NFS check via DBus successful 2016-09-27 11:33:36.483188 CEST pmsensors_down ERROR pmsensors service should be started and is stopped 2016-09-27 11:34:06.188747 CEST pmsensors_up INFO pmsensors service as expected, state is started 2016-09-27 11:31:52.520002 CEST cesnetwork move_cesip_from 999244 INFO Address 192.168.3.27 was moved from this node to node 3 2016-09-27 11:32:40.576867 CEST nfs nfs_dbus_ok 999239 INFO NFS check via DBus successful 2016-09-27 11:33:36.483188 CEST perfmon pmsensors_down 999342 ERROR pmsensors service should be started and is stopped 2016-09-27 11:34:06.188747 CEST perfmon pmsensors_up 999341 INFO pmsensors service as expected, state is started
- To view the detailed description of the cluster, issue mmhealth cluster show
command:
The system displays output similar to this:mmhealth cluster show
Component Total Failed Degraded Healthy Other ----------------------------------------------------------------- NODE 50 1 1 48 - GPFS 50 1 - 49 - NETWORK 50 - - 50 - FILESYSTEM 3 - - 3 - DISK 50 - - 50 - CES 5 - 5 - - CLOUDGATEWAY 2 - - 2 - PERFMON 48 - 5 43 -
Note: The cluster must have the minimum release level as 4.2.2.0 or higher to use mmhealth cluster show command. Also, this command does not support Windows operating system. - To view more information of the cluster health status, issue this
command:
The system displays output similar to this:mmhealth cluster show --verbose
Component Total Failed Degraded Healthy Other ----------------------------------------------------------------- NODE 50 1 1 48 - GPFS 50 1 - 49 - NETWORK 50 - - 50 - FILESYSTEM FS1 15 - - 15 - FS2 5 - - 5 - FS3 20 - - 20 - DISK 50 - - 50 - CES 5 - 5 - - AUTH 5 - - - 5 AUTH_OBJ 5 5 - - - BLOCK 5 - - - 5 CESNETWORK 5 - - 5 - NFS 5 - - 5 - OBJECT 5 - - 5 - SMB 5 - - 5 - CLOUDGATEWAY 2 - - 2 - PERFMON 48 - 5 43 -
- To
view the state of the file
system:
You will see output similar to the following example:mmhealth node show filesystem -v Node name: ibmnode1.ibm.com Component Status Status Change Reasons -------------------------------------------------------------------------------------------------------- FILESYSTEM HEALTHY 2019-01-30 14:32:24 fs_maintenance_mode(gpfs0), unmounted_fs_check(gpfs0) gpfs0 SUSPENDED 2019-01-30 14:32:22 fs_maintenance_mode(gpfs0), unmounted_fs_check(gpfs0) objfs HEALTHY 2019-01-30 14:32:22 -
Event Parameter Severity Active Since Event Message ------------------------------------------------------------------------------------------------------- fs_maintenance_mode gpfs0 INFO 2019-01-30 14:32:20 Filesystem gpfs0 is set to maintenance mode. unmounted_fs_check gpfs0 WARNING 2019-01-30 14:32:21 The filesystem gpfs0 is probably needed, but not mounted fs_working_mode objfs INFO 2019-01-30 14:32:21 Filesystem objfs is not in maintenance mode. mounted_fs_check objfs INFO 2019-01-30 14:32:21 The filesystem objfs is mounted