mmhealth command

Monitors health status of nodes.

Synopsis

mmhealth node show [ GPFS | NETWORK [ UserDefinedSubComponent ] 
                   | FILESYSTEM [UserDefinedSubComponent ] | DISK [UserDefinedSubComponent ]
                   | CES | AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB 
                   | HADOOP |CLOUDGATEWAY | GUI | PERFMON | THRESHOLD
                   | AFM [ UserDefinedSubComponent]  ]
                   [-N {Node[,Node..] | NodeFile | NodeClass}] 
                   [-Y] [--verbose] [--unhealthy]

or

mmhealth node eventlog [[--hour | --day | --week | month] | [--clear] | [--verbose]]
                      [-N {Node[,Node..] | NodeFile | NodeClass}]
                      [-Y]

or

mmhealth event show [ EventName | EventID ] [-N {Node[,Node..] | NodeFile | NodeClass}]

or

mmhealth event hide [ EventName [Entity_Name]]

or

mmhealth event unhide [ EventName [Entity_Name]]

or

mmhealth event list HIDDEN

or

mmhealth cluster show [ NODE | GPFS | NETWORK [ UserDefinedSubComponent ] 
                     | FILESYSTEM  [UserDefinedSubComponent ]| DISK [UserDefinedSubComponent ]
                     | CES |AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB 
                     | HADOOP |CLOUDGATEWAY | GUI | PERFMON | THRESHOLD
                     | AFM [UserDefinedSubComponent]  ]
                     [-Y] [--verbose]

or

mmhealth thresholds list [--verbose]

or

mmhealth thresholds add { metric [: sum | avg | min | max | rate ]| measurement  
                          [--errorlevel {threshold error limit}[--warnlevel{threshold warn limit }]
                       |--direction { high|low }]
                       [--sensitivity { bucketsize } ] [--hysteresis { percentage }]
                       [--filterBy] [--groupBy ] [--name { ruleName }]
                       [--errormsg { user defined action description }]
                       [--warnmsg { user defined action description }]

or

mmhealth thresholds delete { ruleName | all }

or

mmhealth config interval [OFF | LOW | MEDIUM | DEFAULT | HIGH]

Availability

Available on all IBM Spectrum Scale™ editions.

Description

Use the mmhealth command to monitor the health of the node and services hosted on the node in IBM Spectrum Scale.

The IBM Spectrum Scale administrator can monitor the health of each node and the services hosted on that node using the mmhealth command. The mmhealth command also shows the events that are responsible for the unhealthy status of the services hosted on the node. This data can be used to monitor and analyze the reasons for the unhealthy status of the node. The mmhealth command acts as a problem determination tool to identify which services of the node are unhealthy, and find the events responsible for the unhealthy state of the service.

The mmhealth command also monitors the state of all the IBM Spectrum Scale RAID components such as array, pdisk, vdisk, and enclosure of the nodes that belong to the recovery group.

For more information about the system monitoring feature, see Monitoring system health by using the mmhealth command.the Monitoring system health by using the mmhealth command section in the IBM Spectrum Scale: Problem Determination Guide.

The mmhealth command shows the details of threshold rules. This detail helps to avoid out-of-space errors for filesystems. The space availability of the filesystem component depends upon the occupancy level of fileset-inode spaces and the capacity usage in each data or metadata pool. The violation of any single rule triggers the parent filesystem's capacity-issue events. The capacity metrics are frequently compared with the rules boundaries by internal monitor process. If any of the metric values exceeds their threshold limit, then the system health (deamon/service) will receive an event notification from monitor process and generate a RAS event for the filesystem for space issues. For the predefined capacity utilization rules, the warn level is set to 80%, and the error level to 90%. For memory utilization rule, the warn level is set to 100 MB, and the error level to 50 MB. You can use the mmlsfileset and the mmlspool commands to track the inode and pool space usage.

Parameters

event
Gives the details of various events:
show
Shows the detailed description of the specified event:
EventName
Displays the detailed description of the specified event name.
EventID
Displays the detailed description of the specified event ID.
hide
Hides the specified TIP events.
unhide
Reveals the TIP events that were previously hidden using the hide.
list HIDDEN
Shows all the TIP events that are added to the list of hidden events.
node
Displays the health status, specifically, at node level.
show
Displays the health status of the specified component with:
GPFS™ | NETWORK | FILESYSTEM | DISK | CES | AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB | HADOOP | CLOUDGATEWAY | GUI | PERFMON | THRESHOLD | AFM
Displays the detailed health status of the specified component.
UserDefinedSubComponent
Displays services that are named by the customer, categorized by one of the other hosted services. For example, a filesystem named gpfs0 is a subcomponent of filesystem.
-N
Allows the system to make remote calls to the other nodes in the cluster for:
Node[,Node....]
Specifies the node or list of nodes that must be monitored for the health status.
NodeFile
Specifies a file, containing a list of node descriptors, one per line, to be monitored for health status.
NodeClass
Specifies a node class that must be monitored for the health status.
-Y
Displays the command output in a parseable format with a colon (:) as a field delimiter. Each column is described by a header.
Note: Fields that have a colon (:) are encoded to prevent confusion. For the set of characters that might be encoded, see the command documentation of mmclidecode. Use the mmclidecode command to decode the field.
--verbose
Shows the detailed health status of a node, including its sub-components.
--unhealthy
Displays the unhealthy components only.
eventlog
Shows the event history for a specified period of time. If no time period is specified, it displays all the events by default:
[--hour | --day | --week| --month]
Displays the event history for the specified time period.
[--clear]
Clears the event log's database. This action cannot be reversed.
CAUTION:
The events database is used by the mmhealth node eventlog as well as the mmces events list. If you clear the database, it will also affect the mmces events list. Ensure that you use the --clear option with caution.
-Y
Displays the command output in a parseable format with a colon (:) as a field delimiter. Each column is described by a header.
Note: Fields that have a colon (:) are encoded to prevent confusion. For the set of characters that might be encoded, see the command documentation of mmclidecode. Use the mmclidecode command to decode the field.
[--verbose]
Displays additional information about the event like component name and event ID in the eventlog.
event
Gives the details of various events:
show
Shows the detailed description of the specified event:
EventName
Displays the detailed description of the specified event name.
EventID
Displays the detailed description of the specified event ID.
hide
Hides the specified TIP events.
unhide
Reveals the TIP events that were previously hidden using the hide.
list HIDDEN
Shows all the TIP events that are added to the list of hidden events.
cluster
Displays the health status of all nodes and monitored node components in the cluster.
show
Displays the health status of the specified component with:
NODE | GPFS | NETWORK | FILESYSTEM | DISK | CES | AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB | HADOOP | CLOUDGATEWAY | GUI | PERFMON | THRESHOLD | AFM
Displays the detailed health status of the specified component.
-Y
Displays the command output in a parseable format with a colon (:) as a field delimiter. Each column is described by a header.
Note: Fields that have a colon (:) are encoded to prevent confusion. For the set of characters that might be encoded, see the command documentation of mmclidecode. Use the mmclidecode command to decode the field.
--verbose
Shows the detailed health status of a node, including its sub-components.
thresholds list
Displays the list of the threshold rules defined for the system.
thresholds add
Creates a new thresholds rule for the specified metric or measurement, and activates monitoring process stores for this rule.
Note: A measurement is a value calculated using more than one metric in a pre-defined formula.
metric [: SUM | AVG | MIN | MAX | RATE ]
Creates a threshold for the specified metric. All metrics that are supported by the performance monitoring tool, and use raw values or are downsampled by aggregators (sum, avg, min, max, rate) can be used. For a list of metrics supported by the performance monitoring tool, see List of performance metrics.
measurement
Creates a threshold for the specified measurement. The following measurements are supported:
Fileset_inode
Fileset Inode Capacity Utilization. Calculated as:

sum(gpfs_fset_allocInodes)-sum(gpfs_fset_freeInodes)/sum(gpfs_fset_maxInodes)

DataPool_capUtil
Data Pool Capacity Utilization. Calculated as:

sum(gpfs_pool_total_dataKB)-sum(gpfs_pool_free_dataKB)/sum(gpfs_pool_total_dataKB)

MetaDataPool_capUtil
MetaData Pool Capacity Utilization. Calculated as:

sum(gpfs_pool_total_metaKB)-sum(gpfs_pool_free_metaKB))/sum(gpfs_pool_total_metaKB)

--errorlevel
Defines the threshold error limit. The threshold error limit can be a percentage or an integer, depending on the metric on which the threshold value is being set.
--warnlevel
Defines the threshold warn limit. The threshold warn limit can be a percentage or an integer, depending on the metric on which the threshold value is being set.
--direction
Defines the direction for the threshold limit. The allowed values are high or low.
--groupby
Groups the result based on the group key. The following values are allowed for the group key:
  • gpfs_cluster_name
  • gpfs_disk_name
  • gpfs_diskpool_name
  • gpfs_disk_usage_name
  • gpfs_fset_name
  • gpfs_fs_name
  • mountPoint
  • netdev_name
  • node
--filterby
Filters the result based on the filter key. The following values are allowed for the filter key:
  • gpfs_cluster_name
  • gpfs_disk_name
  • gpfs_diskpool_name
  • gpfs_disk_usage_name
  • gpfs_fset_name
  • gpfs_fs_name
  • mountPoint
  • netdev_name
  • node
--sensitivity
Defines the sample interval value in seconds. It is set to 300 by default. If a sensors is configured with interval period greater than 300 seconds, then the --sensitivity will be set to the same value as sensors period. The minimum value allowed is 120 seconds. If a sensors is configured with interval period less than 120 seconds, the --sensitivity will be set to 120 seconds.
--hysteresis
Defines the percentage that the observed value must be under (or over) the current threshold level to switch back to the previous state. The default value is 0.0, while the recommended value is 5.0
--name
Defines the name of the rule. It can be an alphanumeric string with up to 30 characters. If the rule name is not specified, default name will be set. The default name is set using the metric name followed by underscore and then a "custom" prefix.
--errormsg
This is a user defined input. The message can be 256 bytes long. It must be added within double quotes (""), else the system will throw an error.
--warnmsg
This is a user defined input. The message can be 256 bytes long. It must be added within double quotes (""), else the system will throw an error.
Important:
  • The mathematical aggregations: AVG, SUM, MAX, MIN, RATE could be used to determine how to merge the metric values in the evaluation source. The aggregation operations are not supported for measurements.

  • For each rule the user can configure up to two conditions, --error and --warn, triggering event state change. At least one level limit setting is required. For example, the threshold add command must have one of the following options:
    • mmhealth thresholds add { metric[:sum|avg|min|max|rate] | measurement 
      --error{threshold error limit} ---direction {high|low}
    • mmhealth thresholds add { metric[:sum|avg|min|max|rate] | measurement
      --error{threshold error limit} --warn{threshold warn limit}
    • mmhealth thresholds add { metric[:sum|avg|min|max|rate] | measurement
      --error{threshold error limit} --warn{threshold warn limit} ---direction {high|low} 
  • The customer can also influence the measuring quantity and precision by specifying sensitivity, groupby, filterby, histersesis, or rule name option setting.

  • For each condition level the customer can leave an output message text using --errormsg or --warnmsg options, which will be integrated into the state change event notification message, triggered by the exceeding this condition.

thresholds delete
Deletes the threshold rules from the system.
ruleName
Deletes a specific threshold rule.
all
Deletes all the threshold rules.
Note: Using the mmhealth thresholds delete command to delete a rule will accomplish the following tasks:
  • The rule will be removed from the thresholds rules specification file and active monitoring process.
  • All the current health information created by this particular rule will be removed as well.
config interval
Sets the monitoring interval for the whole cluster.
off
The monitoring will be off for the whole cluster.
low
Monitoring is set for every (default monitoring time *10) seconds.
medium
Monitoring is set for every (default monitoring time *5) seconds.
default
Monitoring is set for every 15-30 seconds based on the service being monitored
high
Monitoring is set for every (default monitoring time /2) seconds.

Exit status

0
Successful completion.
nonzero
A failure has occurred.

Security

You must have root authority to run the mmhealth command.

The node on which the command is issued must be able to execute remote shell commands on any other node in the cluster without the use of a password and without producing any extraneous messages. See the information about the requirements for administering a GPFS system in the IBM Spectrum Scale: Administration Guide.

Examples

  1. To show the health status of the current node, issue this command:
    mmhealth node show
    The system displays output similar to this:
    Node name:      test_node
    Node status:    HEALTHY
    Status Change:  39 min. ago
    
    Component          Status        Status Change    Reasons
    -------------------------------------------------------------------
    GPFS               HEALTHY       39 min. ago       -
    NETWORK            HEALTHY       40 min. ago       -
    FILESYSTEM         HEALTHY       39 min. ago       -
    DISK               HEALTHY       39 min. ago       -
    CES                HEALTHY       39 min. ago       -
    PERFMON            HEALTHY       40 min. ago       -
    THRESHOLD          HEALTHY       40 min. ago       -
  2. To view the health status of a specific node, issue this command:
    mmhealth node show -N test_node2
    The system displays output similar to this:
    Node name:      test_node2
    Node status:    CHECKING
    Status Change:  Now
    
    Component       Status        Status Change    Reasons
    -------------------------------------------------------------------
    GPFS            CHECKING      Now              -
    NETWORK         HEALTHY       Now              -
    FILESYSTEM      CHECKING      Now              -
    DISK            CHECKING      Now              -
    CES             CHECKING      Now              -
    PERFMON         HEALTHY       Now              -
  3. To view the health status of all the nodes, issue this command:
    mmhealth node show -N all
    The system displays output similar to this:
    Node name:    test_node
    Node status:  DEGRADED
    
    Component           Status        Status Change     Reasons
    -------------------------------------------------------------
    GPFS                HEALTHY          Now             -
    CES                 FAILED           Now             smbd_down
    FileSystem          HEALTHY          Now             -
    
    Node name:            test_node2
    Node status:          HEALTHY
    
    Component           Status        Status Change    Reasons
    ------------------------------------------------------------
    GPFS                HEALTHY       Now              -
    CES                 HEALTHY       Now              -
    FileSystem          HEALTHY       Now              -
  4. To view the detailed health status of the component and its sub-component, issue this command:
    mmhealth node show ces
    The system displays output similar to this:
    Node name:      test_node
    
    Component       Status        Status Change    Reasons
    -------------------------------------------------------------------
    CES             HEALTHY       2 min. ago       -
      AUTH          DISABLED      2 min. ago       -
      AUTH_OBJ      DISABLED      2 min. ago       -
      BLOCK         DISABLED      2 min. ago       -
      CESNETWORK    HEALTHY       2 min. ago       -
      NFS           HEALTHY       2 min. ago       -
      OBJECT        DISABLED      2 min. ago       -
      SMB           HEALTHY       2 min. ago       -
  5. To view the health status of only unhealthy components, issue this command:
    mmhealth node show --unhealthy
    The system displays output similar to this:
    Node name:        test_node
    Node status:       FAILED
    Status Change:  1 min. ago
    
    Component       Status        Status Change    Reasons
    -------------------------------------------------------------------
    GPFS            FAILED        1 min. ago       gpfs_down, quorum_down
    FILESYSTEM      DEPEND        1 min. ago       unmounted_fs_check
    CES             DEPEND        1 min. ago       ces_network_ips_down, nfs_in_grace
  6. To view the health status of sub-components of a node's component, issue this command:
    mmhealth node show --verbose
    The system displays output similar to this:
    Node name:      gssio1-hs.gpfs.net
    Node status:    HEALTHY
    
    Component                                    Status              Reasons
    -------------------------------------------------------------------
    GPFS                                         DEGRADED            -
    NETWORK                                      HEALTHY             -
      bond0                                      HEALTHY             -
      ib0                                        HEALTHY             -
      ib1                                        HEALTHY             -
    FILESYSTEM                                   DEGRADED            stale_mount, stale_mount, stale_mount
      Basic1                                     FAILED              stale_mount
      Basic2                                     FAILED              stale_mount
      Custom1                                    HEALTHY             -
      gpfs0                                      FAILED              stale_mount
      gpfs1                                      FAILED              stale_mount
    DISK                                         DEGRADED            disk_down
      rg_gssio1_hs_Basic1_data_0                 HEALTHY             -
      rg_gssio1_hs_Basic1_system_0               HEALTHY             -
      rg_gssio1_hs_Basic2_data_0                 HEALTHY             -
      rg_gssio1_hs_Basic2_system_0               HEALTHY             -
      rg_gssio1_hs_Custom1_data1_0               HEALTHY             -
      rg_gssio1_hs_Custom1_system_0              DEGRADED            disk_down
      rg_gssio1_hs_Data_8M_2p_1_gpfs0            HEALTHY             -
      rg_gssio1_hs_Data_8M_3p_1_gpfs1            HEALTHY             -
      rg_gssio1_hs_MetaData_1M_3W_1_gpfs0        HEALTHY             -
      rg_gssio1_hs_MetaData_1M_4W_1_gpfs1        HEALTHY             -
      rg_gssio2_hs_Basic1_data_0                 HEALTHY             -
      rg_gssio2_hs_Basic1_system_0               HEALTHY             -
      rg_gssio2_hs_Basic2_data_0                 HEALTHY             -
      rg_gssio2_hs_Basic2_system_0               HEALTHY             -
      rg_gssio2_hs_Custom1_data1_0               HEALTHY             -
      rg_gssio2_hs_Custom1_system_0              HEALTHY             -
      rg_gssio2_hs_Data_8M_2p_1_gpfs0            HEALTHY             -
      rg_gssio2_hs_Data_8M_3p_1_gpfs1            HEALTHY             -
      rg_gssio2_hs_MetaData_1M_3W_1_gpfs0        HEALTHY             -
      rg_gssio2_hs_MetaData_1M_4W_1_gpfs1        HEALTHY             -
    NATIVE_RAID                                  DEGRADED            gnr_pdisk_replaceable, gnr_rg_failed, enclosure_needsservice
      ARRAY                                      DEGRADED            -
        rg_gssio2-hs/DA1                         HEALTHY             -
        rg_gssio2-hs/DA2                         HEALTHY             -
        rg_gssio2-hs/NVR                         HEALTHY             -
        rg_gssio2-hs/SSD                         HEALTHY             -
      ENCLOSURE                                  DEGRADED            enclosure_needsservice
        SV52122944                               DEGRADED            enclosure_needsservice
        SV53058375                               HEALTHY             -
      PHYSICALDISK                               DEGRADED            gnr_pdisk_replaceable
        rg_gssio2-hs/e1d1s01                     FAILED              gnr_pdisk_replaceable
        rg_gssio2-hs/e1d1s07                     HEALTHY             -
        rg_gssio2-hs/e1d1s08                     HEALTHY             -
        rg_gssio2-hs/e1d1s09                     HEALTHY             -
        rg_gssio2-hs/e1d1s10                     HEALTHY             -
        rg_gssio2-hs/e1d1s11                     HEALTHY             -
        rg_gssio2-hs/e1d1s12                     HEALTHY             -
        rg_gssio2-hs/e1d2s07                     HEALTHY             -
        rg_gssio2-hs/e1d2s08                     HEALTHY             -
        rg_gssio2-hs/e1d2s09                     HEALTHY             -
        rg_gssio2-hs/e1d2s10                     HEALTHY             -
        rg_gssio2-hs/e1d2s11                     HEALTHY             -
        rg_gssio2-hs/e1d2s12                     HEALTHY             -
        rg_gssio2-hs/e1d3s07                     HEALTHY             -
        rg_gssio2-hs/e1d3s08                     HEALTHY             -
        rg_gssio2-hs/e1d3s09                     HEALTHY             -
        rg_gssio2-hs/e1d3s10                     HEALTHY             -
        rg_gssio2-hs/e1d3s11                     HEALTHY             -
        rg_gssio2-hs/e1d3s12                     HEALTHY             -
        rg_gssio2-hs/e1d4s07                     HEALTHY             -
        rg_gssio2-hs/e1d4s08                     HEALTHY             -
        rg_gssio2-hs/e1d4s09                     HEALTHY             -
        rg_gssio2-hs/e1d4s10                     HEALTHY             -
        rg_gssio2-hs/e1d4s11                     HEALTHY             -
        rg_gssio2-hs/e1d4s12                     HEALTHY             -
        rg_gssio2-hs/e1d5s07                     HEALTHY             -
        rg_gssio2-hs/e1d5s08                     HEALTHY             -
        rg_gssio2-hs/e1d5s09                     HEALTHY             -
        rg_gssio2-hs/e1d5s10                     HEALTHY             -
        rg_gssio2-hs/e1d5s11                     HEALTHY             -
        rg_gssio2-hs/e2d1s07                     HEALTHY             -
        rg_gssio2-hs/e2d1s08                     HEALTHY             -
        rg_gssio2-hs/e2d1s09                     HEALTHY             -
        rg_gssio2-hs/e2d1s10                     HEALTHY             -
        rg_gssio2-hs/e2d1s11                     HEALTHY             -
        rg_gssio2-hs/e2d1s12                     HEALTHY             -
        rg_gssio2-hs/e2d2s07                     HEALTHY             -
        rg_gssio2-hs/e2d2s08                     HEALTHY             -
        rg_gssio2-hs/e2d2s09                     HEALTHY             -
        rg_gssio2-hs/e2d2s10                     HEALTHY             -
        rg_gssio2-hs/e2d2s11                     HEALTHY             -
        rg_gssio2-hs/e2d2s12                     HEALTHY             -
        rg_gssio2-hs/e2d3s07                     HEALTHY             -
        rg_gssio2-hs/e2d3s08                     HEALTHY             -
        rg_gssio2-hs/e2d3s09                     HEALTHY             -
        rg_gssio2-hs/e2d3s10                     HEALTHY             -
        rg_gssio2-hs/e2d3s11                     HEALTHY             -
        rg_gssio2-hs/e2d3s12                     HEALTHY             -
        rg_gssio2-hs/e2d4s07                     HEALTHY             -
        rg_gssio2-hs/e2d4s08                     HEALTHY             -
        rg_gssio2-hs/e2d4s09                     HEALTHY             -
        rg_gssio2-hs/e2d4s10                     HEALTHY             -
        rg_gssio2-hs/e2d4s11                     HEALTHY             -
        rg_gssio2-hs/e2d4s12                     HEALTHY             -
        rg_gssio2-hs/e2d5s07                     HEALTHY             -
        rg_gssio2-hs/e2d5s08                     HEALTHY             -
        rg_gssio2-hs/e2d5s09                     HEALTHY             -
        rg_gssio2-hs/e2d5s10                     HEALTHY             -
        rg_gssio2-hs/e2d5s11                     HEALTHY             -
        rg_gssio2-hs/e2d5s12ssd                  HEALTHY             -
        rg_gssio2-hs/n1s02                       HEALTHY             -
        rg_gssio2-hs/n2s02                       HEALTHY             -
      RECOVERYGROUP                              DEGRADED            gnr_rg_failed
        rg_gssio1-hs                             FAILED              gnr_rg_failed
        rg_gssio2-hs                             HEALTHY             -
      VIRTUALDISK                                DEGRADED            -
        rg_gssio2_hs_Basic1_data_0               HEALTHY             -
        rg_gssio2_hs_Basic1_system_0             HEALTHY             -
        rg_gssio2_hs_Basic2_data_0               HEALTHY             -
        rg_gssio2_hs_Basic2_system_0             HEALTHY             -
        rg_gssio2_hs_Custom1_data1_0             HEALTHY             -
        rg_gssio2_hs_Custom1_system_0            HEALTHY             -
        rg_gssio2_hs_Data_8M_2p_1_gpfs0          HEALTHY             -
        rg_gssio2_hs_Data_8M_3p_1_gpfs1          HEALTHY             -
        rg_gssio2_hs_MetaData_1M_3W_1_gpfs0      HEALTHY             -
        rg_gssio2_hs_MetaData_1M_4W_1_gpfs1      HEALTHY             -
        rg_gssio2_hs_loghome                     HEALTHY             -
        rg_gssio2_hs_logtip                      HEALTHY             -
        rg_gssio2_hs_logtipbackup                HEALTHY             -
    PERFMON                                      HEALTHY             -		
  7. To view the eventlog history of the node for the last hour, issue this command:
    mmhealth node eventlog --hour
    The system displays output similar to this:
    Node name:      test-21.localnet.com
    Timestamp                             Event Name                Severity   Details
    2016-10-28 06:59:34.045980 CEST       monitor_started           INFO       The IBM Spectrum Scale monitoring service has been started
    2016-10-28 07:01:21.919943 CEST       fs_remount_mount          INFO       The filesystem objfs was mounted internal
    2016-10-28 07:01:32.434703 CEST       disk_found                INFO       The disk disk1 was found
    2016-10-28 07:01:32.669125 CEST       disk_found                INFO       The disk disk8 was found
    2016-10-28 07:01:36.975902 CEST       filesystem_found          INFO       Filesystem objfs was found
    2016-10-28 07:01:37.226157 CEST       unmounted_fs_check        WARNING    The filesystem objfs is probably needed, but not mounted
    2016-10-28 07:01:52.1691 CEST       mounted_fs_check          INFO       The filesystem objfs is mounted
    2016-10-28 07:01:52.283545 CEST       fs_remount_mount          INFO       The filesystem objfs was mounted normal
    2016-10-28 07:02:07.026093 CEST       mounted_fs_check          INFO       The filesystem objfs is mounted
    2016-10-28 07:14:58.498854 CEST       ces_network_ips_down      WARNING    No CES relevant NICs detected
    2016-10-28 07:15:07.702351 CEST       nodestatechange_info      INFO       A CES node state change: Node 1 add startup flag
    2016-10-28 07:15:37.322997 CEST       nodestatechange_info      INFO       A CES node state change: Node 1 remove startup flag
    2016-10-28 07:15:43.741149 CEST       ces_network_ips_up        INFO       CES-relevant IPs are served by found NICs
    2016-10-28 07:15:44.028031 CEST       ces_network_vanished      INFO       CES NIC eth0 has vanished
  8. To view the eventlog history of the node for the last hour, issue this command:
    mmhealth node eventlog --hour --verbose
    The system displays output similar to this:
    Node name:      test-21.localnet.com
    Timestamp                             Component     Event Name                Event ID Severity   Details
    2016-10-28 06:59:34.045980 CEST       gpfs          monitor_started           999726   INFO       The IBM Spectrum Scale monitoring service has been started
    2016-10-28 07:01:21.919943 CEST       filesystem    fs_remount_mount          999306   INFO       The filesystem objfs was mounted internal
    2016-10-28 07:01:32.434703 CEST       disk          disk_found                999424   INFO       The disk disk1 was found
    2016-10-28 07:01:32.669125 CEST       disk          disk_found                999424   INFO       The disk disk8 was found
    2016-10-28 07:01:36.975902 CEST       filesystem    filesystem_found          999299   INFO       Filesystem objfs was found
    2016-10-28 07:01:37.226157 CEST       filesystem    unmounted_fs_check        999298   WARNING    The filesystem objfs is probably needed, but not mounted
    2016-10-28 07:01:52.113691 CEST       filesystem    mounted_fs_check          999301   INFO       The filesystem objfs is mounted
    2016-10-28 07:01:52.283545 CEST       filesystem    fs_remount_mount          999306   INFO       The filesystem objfs was mounted normal
    2016-10-28 07:02:07.026093 CEST       filesystem    mounted_fs_check          999301   INFO       The filesystem objfs is mounted
    2016-10-28 07:14:58.498854 CEST       cesnetwork    ces_network_ips_down      999426   WARNING    No CES relevant NICs detected
    2016-10-28 07:15:07.702351 CEST       gpfs          nodestatechange_info      999220   INFO       A CES node state change: Node 1 add startup flag
    2016-10-28 07:15:37.322997 CEST       gpfs          nodestatechange_info      999220   INFO       A CES node state change: Node 1 remove startup flag
    2016-10-28 07:15:43.741149 CEST       cesnetwork    ces_network_ips_up        999427   INFO       CES-relevant IPs are served by found NICs
    2016-10-28 07:15:44.028031 CEST       cesnetwork    ces_network_vanished      999434   INFO       CES NIC eth0 has vanished
  9. To view the detailed description of an event, issue the mmhealth event show command. This is an example for quorum_down event:
    mmhealth event show quorum_down
    The system displays output similar to this:
    Event Name:              quorum_down
    Event ID:                999289
    Description:             Reasons could be network or hardware issues, or a shutdown of the cluster service.
                             The event does not necessarily indicate an issue with the cluster quorum state.
    Cause:                   The local node does not have quorum. The cluster service might not be running.
    User Action:             Check if the cluster quorum nodes are running and can be reached over the network. 
                             Check local firewall settings
    Severity:                ERROR
    State:                   DEGRADED  
  10. To view the list of hidden events, issue the mmhealth event list HIDDEN command:
    mmhealth event list HIDDEN
    The system displays output similar to this:
    Event                           scope
    --------------------------------------
    gpfs_pagepool_small             -
    nfsv4_acl_type_wrong            fs1
    nfsv4_acl_type_wrong            fs2  
  11. To view the detailed description of the cluster, issue the mmhealth cluster show command:
    mmhealth cluster show
    The system displays output similar to this:
    Component     Total   Failed   Degraded   Healthy   Other
    -----------------------------------------------------------------
    NODE             50        1          1        48       -
    GPFS             50        1          -        49       -
    NETWORK          50        -          -        50       -
    FILESYSTEM        3        -          -         3       -
    DISK             50        -          -        50       -
    CES               5        -          5         -       -
    CLOUDGATEWAY      2        -          -         2       -
    PERFMON          48        -          5        43       -
    THRESHOLD         4        -          -         4       -  
    
    Note: The cluster must have the minimum release level as 4.2.2.0 or higher to use mmhealth cluster show command.
    Also, this command does not support Windows operating system.
  12. To view more information of the cluster health status, issue this command:
    mmhealth cluster show --verbose
    The system displays output similar to this:
    Component     Total   Failed   Degraded   Healthy   Other
    -----------------------------------------------------------------
    NODE             50        1          1        48       -
    GPFS             50        1          -        49       -
    NETWORK          50        -          -        50       -
    FILESYSTEM
      FS1            15        -          -        15       -
      FS2             5        -          -         5       -
      FS3            20        -          -        20       -
    DISK             50        -          -        50       -
    CES               5        -          5         -       -
      AUTH            5        -          -         -       5
      AUTH_OBJ        5        5          -         -       -
      BLOCK           5        -          -         -       5
      CESNETWORK      5        -          -         5       -
      NFS             5        -          -         5       -
      OBJECT          5        -          -         5       -
      SMB             5        -          -         5       -
    CLOUDGATEWAY      2        -          -         2       -
    PERFMON          48        -          5        43       -
    THRESHOLD         4        -          -         4       - 
  13. To create a new threshold rule, issue this command:
    mmhealth thresholds add MetaDataPool_capUtil --errorlevel 90 --direction high 
    --groupby gpfs_fs_name,gpfs_diskpool_name
    The system displays output similar to this:
    New rule 'MetaDataPool_capUtil_custom' is created. The monitor process is activated
  14. To view the list of threshold rules defined for the system, issue this command:
    mmhealth thresholds list
    The system displays output similar to this:
    ### Threshold Rules ###
    rule_name                    metric                error  warn    direction  filterBy  groupBy                                            sensitivity  
    ---------------------------------------------------------------------------------------------------------------------------------------------------------
    InodeCapUtil_Rule            Fileset_inode         90.0   80.0    high                 gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name      300          
    MetaDataPool_capUtil_custom  MetaDataPool_capUtil  90     None    high                 gpfs_fs_name,gpfs_diskpool_name                    300          
    DataCapUtil_Rule             DataPool_capUtil      90.0   80.0    high                 gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300          
    MemFree_Rule                 mem_memfree           50000  100000  low                  node                                               300          
    MetaDataCapUtil_Rule         MetaDataPool_capUtil  90.0   80.0    high                 gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300   
  15. To view the detailed health status of filesystem component, issue this command:
    mmhealth node show filesystem -v
    The system displays output similar to this:
    Node name:      gpfsgui-12.novalocal
    
    Component           Status        Status Change            Reasons
    -------------------------------------------------------------------------------
    FILESYSTEM          DEGRADED      2016-09-29 15:22:48      pool-data_high_error
      fs1               FAILED        2016-09-29 15:22:48      pool-data_high_error
      fs2               HEALTHY       2016-09-29 15:22:33      -
      objfs             HEALTHY       2016-09-29 15:22:33      -
    
    
    Event                   Parameter   Severity  Active Since           Event Message
    -------------------------------------------------------------------------------------------------------------------------------------------------------------
    pool-data_high_error    fs1          ERROR     2016-09-29 15:22:47    The pool myPool of file system fs1 reached a nearly exhausted data level. 90.0
    inode_normal            fs1          INFO      2016-09-29 15:22:47    The inode usage of fileset root in file system fs1 reached a normal level.
    inode_normal            fs2          INFO      2016-09-29 15:22:47    The inode usage of fileset root in file system fs2 reached a normal level.
    inode_normal            objfs        INFO      2016-09-29 15:22:47    The inode usage of fileset root in file system objfs reached a normal level.
    inode_normal            objfs        INFO      2016-09-29 15:22:47    The inode usage of fileset Object_Fileset in file system objfs reached a normal level.
    mounted_fs_check        fs1          INFO      2016-09-29 15:22:33    The filesystem fs1 is mounted
    mounted_fs_check        fs2          INFO      2016-09-29 15:22:33    The filesystem fs2 is mounted
    mounted_fs_check        objfs        INFO      2016-09-29 15:22:33    The filesystem objfs is mounted
    pool-data_normal        fs1          INFO      2016-09-29 15:22:47    The pool system of file system fs1 reached a normal data level.
    pool-data_normal        fs2          INFO      2016-09-29 15:22:47    The pool system of file system fs2 reached a normal data level.
    pool-data_normal        objfs        INFO      2016-09-29 15:22:47    The pool data of file system objfs reached a normal data level.
    pool-data_normal        objfs        INFO      2016-09-29 15:22:47    The pool system of file system objfs reached a normal data level.
    pool-metadata_normal    fs1          INFO      2016-09-29 15:22:47    The pool system of file system fs1 reached a normal metadata level.
    pool-metadata_normal    fs1          INFO      2016-09-29 15:22:47    The pool myPool of file system fs1 reached a normal metadata level.
    pool-metadata_normal    fs2          INFO      2016-09-29 15:22:47    The pool system of file system fs2 reached a normal metadata level.
    pool-metadata_normal    objfs        INFO      2016-09-29 15:22:47    The pool system of file system objfs reached a normal metadata level.
    pool-metadata_normal    objfs        INFO      2016-09-29 15:22:47    The pool data of file system objfs reached a normal metadata level.
    
  16. To check the monitoring interval, issue the following command:
    # mmhealth config interval
    The system displays output similar to this:
    Monitor interval is DEFAULT.
  17. To set the monitoring interval to low, issue the following command:
    # mmhealth config interval LOW
    The system displays output similar to this:
    Monitor interval changed to LOW.

Location

/usr/lpp/mmfs/bin