mmhealth command
Monitors health status of nodes.
Synopsis
mmhealth node show [ GPFS | NETWORK [ UserDefinedSubComponent ]
| FILESYSTEM [UserDefinedSubComponent ] | DISK [UserDefinedSubComponent ]
| CES | AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB
| HADOOP |CLOUDGATEWAY | GUI | PERFMON | THRESHOLD
| AFM [ UserDefinedSubComponent] ]
[-N {Node[,Node..] | NodeFile | NodeClass}]
[-Y] [--verbose] [--unhealthy]
or
mmhealth node eventlog [[--hour | --day | --week | month] | [--clear] | [--verbose]]
[-N {Node[,Node..] | NodeFile | NodeClass}]
[-Y]
or
mmhealth event show [ EventName | EventID ] [-N {Node[,Node..] | NodeFile | NodeClass}]
or
mmhealth event hide [ EventName [Entity_Name]]
or
mmhealth event unhide [ EventName [Entity_Name]]
or
mmhealth event list HIDDEN
or
mmhealth cluster show [ NODE | GPFS | NETWORK [ UserDefinedSubComponent ]
| FILESYSTEM [UserDefinedSubComponent ]| DISK [UserDefinedSubComponent ]
| CES |AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB
| HADOOP |CLOUDGATEWAY | GUI | PERFMON | THRESHOLD
| AFM [UserDefinedSubComponent] ]
[-Y] [--verbose]
or
mmhealth thresholds list [--verbose]
or
mmhealth thresholds add { metric [: sum | avg | min | max | rate ]| measurement
[--errorlevel {threshold error limit}[--warnlevel{threshold warn limit }]
|--direction { high|low }]
[--sensitivity { bucketsize } ] [--hysteresis { percentage }]
[--filterBy] [--groupBy ] [--name { ruleName }]
[--errormsg { user defined action description }]
[--warnmsg { user defined action description }]
or
mmhealth thresholds delete { ruleName | all }
or
mmhealth config interval [OFF | LOW | MEDIUM | DEFAULT | HIGH]
Availability
Available on all IBM Spectrum Scale™ editions.
Description
Use the mmhealth command to monitor the health of the node and services hosted on the node in IBM Spectrum Scale.
The IBM Spectrum Scale administrator can monitor the health of each node and the services hosted on that node using the mmhealth command. The mmhealth command also shows the events that are responsible for the unhealthy status of the services hosted on the node. This data can be used to monitor and analyze the reasons for the unhealthy status of the node. The mmhealth command acts as a problem determination tool to identify which services of the node are unhealthy, and find the events responsible for the unhealthy state of the service.
The mmhealth command also monitors the state of all the IBM Spectrum Scale RAID components such as array, pdisk, vdisk, and enclosure of the nodes that belong to the recovery group.
For more information about the system monitoring feature, see Monitoring system health by using the mmhealth command.the Monitoring system health by using the mmhealth command section in the IBM Spectrum Scale: Problem Determination Guide.
The mmhealth command shows the details of threshold rules. This detail helps to avoid out-of-space errors for filesystems. The space availability of the filesystem component depends upon the occupancy level of fileset-inode spaces and the capacity usage in each data or metadata pool. The violation of any single rule triggers the parent filesystem's capacity-issue events. The capacity metrics are frequently compared with the rules boundaries by internal monitor process. If any of the metric values exceeds their threshold limit, then the system health (deamon/service) will receive an event notification from monitor process and generate a RAS event for the filesystem for space issues. For the predefined capacity utilization rules, the warn level is set to 80%, and the error level to 90%. For memory utilization rule, the warn level is set to 100 MB, and the error level to 50 MB. You can use the mmlsfileset and the mmlspool commands to track the inode and pool space usage.
Parameters
- event
- Gives the details of various events:
- show
- Shows the detailed description of the specified event:
- EventName
- Displays the detailed description of the specified event name.
- EventID
- Displays the detailed description of the specified event ID.
- hide
- Hides the specified TIP events.
- unhide
- Reveals the TIP events that were previously hidden using the hide.
- list HIDDEN
- Shows all the TIP events that are added to the list of hidden events.
- node
- Displays the health status, specifically, at node level.
- show
- Displays the health status of the specified component with:
- GPFS™ | NETWORK | FILESYSTEM | DISK | CES | AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB | HADOOP | CLOUDGATEWAY | GUI | PERFMON | THRESHOLD | AFM
- Displays the detailed health status of the specified component.
- UserDefinedSubComponent
- Displays services that are named by the customer, categorized by one of the other hosted services. For example, a filesystem named gpfs0 is a subcomponent of filesystem.
- -N
- Allows the system to make remote calls to the other nodes in the cluster for:
- Node[,Node....]
- Specifies the node or list of nodes that must be monitored for the health status.
- NodeFile
- Specifies a file, containing a list of node descriptors, one per line, to be monitored for health status.
- NodeClass
- Specifies a node class that must be monitored for the health status.
- -Y
- Displays the command output in a parseable format with a colon (:) as a field delimiter. Each
column is described by a header.Note: Fields that have a colon (:) are encoded to prevent confusion. For the set of characters that might be encoded, see the command documentation of mmclidecode. Use the mmclidecode command to decode the field.
- --verbose
- Shows the detailed health status of a node, including its sub-components.
- --unhealthy
- Displays the unhealthy components only.
- eventlog
- Shows the event history for a specified period of time. If no time period is specified, it
displays all the events by default:
- [--hour | --day | --week| --month]
- Displays the event history for the specified time period.
- [--clear]
- Clears the event log's database. This action cannot be reversed.CAUTION:The events database is used by the mmhealth node eventlog as well as the mmces events list. If you clear the database, it will also affect the mmces events list. Ensure that you use the --clear option with caution.
- -Y
- Displays the command output in a parseable format with a colon (:) as a field delimiter. Each
column is described by a header.Note: Fields that have a colon (:) are encoded to prevent confusion. For the set of characters that might be encoded, see the command documentation of mmclidecode. Use the mmclidecode command to decode the field.
- [--verbose]
- Displays additional information about the event like component name and event ID in the eventlog.
- event
- Gives the details of various events:
- show
- Shows the detailed description of the specified event:
- EventName
- Displays the detailed description of the specified event name.
- EventID
- Displays the detailed description of the specified event ID.
- hide
- Hides the specified TIP events.
- unhide
- Reveals the TIP events that were previously hidden using the hide.
- list HIDDEN
- Shows all the TIP events that are added to the list of hidden events.
- cluster
- Displays the health status of all nodes and monitored node components in the cluster.
- show
- Displays the health status of the specified component with:
- NODE | GPFS | NETWORK | FILESYSTEM | DISK | CES | AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB | HADOOP | CLOUDGATEWAY | GUI | PERFMON | THRESHOLD | AFM
- Displays the detailed health status of the specified component.
- -Y
- Displays the command output in a parseable format with a colon (:) as a field delimiter. Each
column is described by a header.Note: Fields that have a colon (:) are encoded to prevent confusion. For the set of characters that might be encoded, see the command documentation of mmclidecode. Use the mmclidecode command to decode the field.
- --verbose
- Shows the detailed health status of a node, including its sub-components.
- thresholds list
- Displays the list of the threshold rules defined for the system.
- thresholds add
- Creates a new thresholds rule for the specified metric or measurement, and activates monitoring
process stores for this rule.Note: A measurement is a value calculated using more than one metric in a pre-defined formula.
- metric [: SUM | AVG | MIN | MAX | RATE ]
- Creates a threshold for the specified metric. All metrics that are supported by the performance monitoring tool, and use raw values or are downsampled by aggregators (sum, avg, min, max, rate) can be used. For a list of metrics supported by the performance monitoring tool, see List of performance metrics.
- measurement
- Creates a threshold for the specified measurement. The following measurements are supported:
- Fileset_inode
- Fileset Inode Capacity Utilization. Calculated
as:
sum(gpfs_fset_allocInodes)-sum(gpfs_fset_freeInodes)/sum(gpfs_fset_maxInodes)
- DataPool_capUtil
- Data Pool Capacity Utilization. Calculated as:
sum(gpfs_pool_total_dataKB)-sum(gpfs_pool_free_dataKB)/sum(gpfs_pool_total_dataKB)
- MetaDataPool_capUtil
- MetaData Pool Capacity Utilization. Calculated
as:
sum(gpfs_pool_total_metaKB)-sum(gpfs_pool_free_metaKB))/sum(gpfs_pool_total_metaKB)
- --errorlevel
- Defines the threshold error limit. The threshold error limit can be a percentage or an integer, depending on the metric on which the threshold value is being set.
- --warnlevel
- Defines the threshold warn limit. The threshold warn limit can be a percentage or an integer, depending on the metric on which the threshold value is being set.
- --direction
- Defines the direction for the threshold limit. The allowed values are high or low.
- --groupby
- Groups the result based on the group key. The following values are allowed for the group key:
- gpfs_cluster_name
- gpfs_disk_name
- gpfs_diskpool_name
- gpfs_disk_usage_name
- gpfs_fset_name
- gpfs_fs_name
- mountPoint
- netdev_name
- node
- --filterby
- Filters the result based on the filter key. The following values are allowed for the filter key:
- gpfs_cluster_name
- gpfs_disk_name
- gpfs_diskpool_name
- gpfs_disk_usage_name
- gpfs_fset_name
- gpfs_fs_name
- mountPoint
- netdev_name
- node
- --sensitivity
- Defines the sample interval value in seconds. It is set to 300 by default. If a sensors is configured with interval period greater than 300 seconds, then the --sensitivity will be set to the same value as sensors period. The minimum value allowed is 120 seconds. If a sensors is configured with interval period less than 120 seconds, the --sensitivity will be set to 120 seconds.
- --hysteresis
- Defines the percentage that the observed value must be under (or over) the current threshold level to switch back to the previous state. The default value is 0.0, while the recommended value is 5.0
- --name
- Defines the name of the rule. It can be an alphanumeric string with up to 30 characters. If the rule name is not specified, default name will be set. The default name is set using the metric name followed by underscore and then a "custom" prefix.
- --errormsg
- This is a user defined input. The message can be 256 bytes long. It must be added within double quotes (""), else the system will throw an error.
- --warnmsg
- This is a user defined input. The message can be 256 bytes long. It must be added within double quotes (""), else the system will throw an error.
Important:The mathematical aggregations: AVG, SUM, MAX, MIN, RATE could be used to determine how to merge the metric values in the evaluation source. The aggregation operations are not supported for measurements.
- For each rule the user can configure up to two conditions, --error and --warn, triggering event state change. At least one level limit setting is required. For example, the threshold add command must have one of the following options:
mmhealth thresholds add { metric[:sum|avg|min|max|rate] | measurement --error{threshold error limit} ---direction {high|low}
mmhealth thresholds add { metric[:sum|avg|min|max|rate] | measurement --error{threshold error limit} --warn{threshold warn limit}
mmhealth thresholds add { metric[:sum|avg|min|max|rate] | measurement --error{threshold error limit} --warn{threshold warn limit} ---direction {high|low}
The customer can also influence the measuring quantity and precision by specifying sensitivity, groupby, filterby, histersesis, or rule name option setting.
For each condition level the customer can leave an output message text using --errormsg or --warnmsg options, which will be integrated into the state change event notification message, triggered by the exceeding this condition.
- thresholds delete
- Deletes the threshold rules from the system.
- ruleName
- Deletes a specific threshold rule.
- all
- Deletes all the threshold rules.
Note: Using the mmhealth thresholds delete command to delete a rule will accomplish the following tasks:- The rule will be removed from the thresholds rules specification file and active monitoring process.
- All the current health information created by this particular rule will be removed as well.
- config interval
- Sets the monitoring interval for the whole cluster.
- off
- The monitoring will be off for the whole cluster.
- low
- Monitoring is set for every (default monitoring time *10) seconds.
- medium
- Monitoring is set for every (default monitoring time *5) seconds.
- default
- Monitoring is set for every 15-30 seconds based on the service being monitored
- high
- Monitoring is set for every (default monitoring time /2) seconds.
Exit status
- 0
- Successful completion.
- nonzero
- A failure has occurred.
Security
You must have root authority to run the mmhealth command.
The node on which the command is issued must be able to execute remote shell commands on any other node in the cluster without the use of a password and without producing any extraneous messages. See the information about the requirements for administering a GPFS system in the IBM Spectrum Scale: Administration Guide.
Examples
- To show the health status of the current node, issue this
command:
The system displays output similar to this:mmhealth node show
Node name: test_node Node status: HEALTHY Status Change: 39 min. ago Component Status Status Change Reasons ------------------------------------------------------------------- GPFS HEALTHY 39 min. ago - NETWORK HEALTHY 40 min. ago - FILESYSTEM HEALTHY 39 min. ago - DISK HEALTHY 39 min. ago - CES HEALTHY 39 min. ago - PERFMON HEALTHY 40 min. ago - THRESHOLD HEALTHY 40 min. ago -
- To view the health status of a specific node, issue this
command:
The system displays output similar to this:mmhealth node show -N test_node2
Node name: test_node2 Node status: CHECKING Status Change: Now Component Status Status Change Reasons ------------------------------------------------------------------- GPFS CHECKING Now - NETWORK HEALTHY Now - FILESYSTEM CHECKING Now - DISK CHECKING Now - CES CHECKING Now - PERFMON HEALTHY Now -
- To view the health status of all the nodes, issue this
command:
The system displays output similar to this:mmhealth node show -N all
Node name: test_node Node status: DEGRADED Component Status Status Change Reasons ------------------------------------------------------------- GPFS HEALTHY Now - CES FAILED Now smbd_down FileSystem HEALTHY Now - Node name: test_node2 Node status: HEALTHY Component Status Status Change Reasons ------------------------------------------------------------ GPFS HEALTHY Now - CES HEALTHY Now - FileSystem HEALTHY Now -
- To view the detailed health status of the component and its sub-component, issue this
command:
The system displays output similar to this:mmhealth node show ces
Node name: test_node Component Status Status Change Reasons ------------------------------------------------------------------- CES HEALTHY 2 min. ago - AUTH DISABLED 2 min. ago - AUTH_OBJ DISABLED 2 min. ago - BLOCK DISABLED 2 min. ago - CESNETWORK HEALTHY 2 min. ago - NFS HEALTHY 2 min. ago - OBJECT DISABLED 2 min. ago - SMB HEALTHY 2 min. ago -
- To view the health status of only unhealthy components, issue this
command:
The system displays output similar to this:mmhealth node show --unhealthy
Node name: test_node Node status: FAILED Status Change: 1 min. ago Component Status Status Change Reasons ------------------------------------------------------------------- GPFS FAILED 1 min. ago gpfs_down, quorum_down FILESYSTEM DEPEND 1 min. ago unmounted_fs_check CES DEPEND 1 min. ago ces_network_ips_down, nfs_in_grace
- To view the health status of sub-components of a node's component, issue this
command:
The system displays output similar to this:mmhealth node show --verbose
Node name: gssio1-hs.gpfs.net Node status: HEALTHY Component Status Reasons ------------------------------------------------------------------- GPFS DEGRADED - NETWORK HEALTHY - bond0 HEALTHY - ib0 HEALTHY - ib1 HEALTHY - FILESYSTEM DEGRADED stale_mount, stale_mount, stale_mount Basic1 FAILED stale_mount Basic2 FAILED stale_mount Custom1 HEALTHY - gpfs0 FAILED stale_mount gpfs1 FAILED stale_mount DISK DEGRADED disk_down rg_gssio1_hs_Basic1_data_0 HEALTHY - rg_gssio1_hs_Basic1_system_0 HEALTHY - rg_gssio1_hs_Basic2_data_0 HEALTHY - rg_gssio1_hs_Basic2_system_0 HEALTHY - rg_gssio1_hs_Custom1_data1_0 HEALTHY - rg_gssio1_hs_Custom1_system_0 DEGRADED disk_down rg_gssio1_hs_Data_8M_2p_1_gpfs0 HEALTHY - rg_gssio1_hs_Data_8M_3p_1_gpfs1 HEALTHY - rg_gssio1_hs_MetaData_1M_3W_1_gpfs0 HEALTHY - rg_gssio1_hs_MetaData_1M_4W_1_gpfs1 HEALTHY - rg_gssio2_hs_Basic1_data_0 HEALTHY - rg_gssio2_hs_Basic1_system_0 HEALTHY - rg_gssio2_hs_Basic2_data_0 HEALTHY - rg_gssio2_hs_Basic2_system_0 HEALTHY - rg_gssio2_hs_Custom1_data1_0 HEALTHY - rg_gssio2_hs_Custom1_system_0 HEALTHY - rg_gssio2_hs_Data_8M_2p_1_gpfs0 HEALTHY - rg_gssio2_hs_Data_8M_3p_1_gpfs1 HEALTHY - rg_gssio2_hs_MetaData_1M_3W_1_gpfs0 HEALTHY - rg_gssio2_hs_MetaData_1M_4W_1_gpfs1 HEALTHY - NATIVE_RAID DEGRADED gnr_pdisk_replaceable, gnr_rg_failed, enclosure_needsservice ARRAY DEGRADED - rg_gssio2-hs/DA1 HEALTHY - rg_gssio2-hs/DA2 HEALTHY - rg_gssio2-hs/NVR HEALTHY - rg_gssio2-hs/SSD HEALTHY - ENCLOSURE DEGRADED enclosure_needsservice SV52122944 DEGRADED enclosure_needsservice SV53058375 HEALTHY - PHYSICALDISK DEGRADED gnr_pdisk_replaceable rg_gssio2-hs/e1d1s01 FAILED gnr_pdisk_replaceable rg_gssio2-hs/e1d1s07 HEALTHY - rg_gssio2-hs/e1d1s08 HEALTHY - rg_gssio2-hs/e1d1s09 HEALTHY - rg_gssio2-hs/e1d1s10 HEALTHY - rg_gssio2-hs/e1d1s11 HEALTHY - rg_gssio2-hs/e1d1s12 HEALTHY - rg_gssio2-hs/e1d2s07 HEALTHY - rg_gssio2-hs/e1d2s08 HEALTHY - rg_gssio2-hs/e1d2s09 HEALTHY - rg_gssio2-hs/e1d2s10 HEALTHY - rg_gssio2-hs/e1d2s11 HEALTHY - rg_gssio2-hs/e1d2s12 HEALTHY - rg_gssio2-hs/e1d3s07 HEALTHY - rg_gssio2-hs/e1d3s08 HEALTHY - rg_gssio2-hs/e1d3s09 HEALTHY - rg_gssio2-hs/e1d3s10 HEALTHY - rg_gssio2-hs/e1d3s11 HEALTHY - rg_gssio2-hs/e1d3s12 HEALTHY - rg_gssio2-hs/e1d4s07 HEALTHY - rg_gssio2-hs/e1d4s08 HEALTHY - rg_gssio2-hs/e1d4s09 HEALTHY - rg_gssio2-hs/e1d4s10 HEALTHY - rg_gssio2-hs/e1d4s11 HEALTHY - rg_gssio2-hs/e1d4s12 HEALTHY - rg_gssio2-hs/e1d5s07 HEALTHY - rg_gssio2-hs/e1d5s08 HEALTHY - rg_gssio2-hs/e1d5s09 HEALTHY - rg_gssio2-hs/e1d5s10 HEALTHY - rg_gssio2-hs/e1d5s11 HEALTHY - rg_gssio2-hs/e2d1s07 HEALTHY - rg_gssio2-hs/e2d1s08 HEALTHY - rg_gssio2-hs/e2d1s09 HEALTHY - rg_gssio2-hs/e2d1s10 HEALTHY - rg_gssio2-hs/e2d1s11 HEALTHY - rg_gssio2-hs/e2d1s12 HEALTHY - rg_gssio2-hs/e2d2s07 HEALTHY - rg_gssio2-hs/e2d2s08 HEALTHY - rg_gssio2-hs/e2d2s09 HEALTHY - rg_gssio2-hs/e2d2s10 HEALTHY - rg_gssio2-hs/e2d2s11 HEALTHY - rg_gssio2-hs/e2d2s12 HEALTHY - rg_gssio2-hs/e2d3s07 HEALTHY - rg_gssio2-hs/e2d3s08 HEALTHY - rg_gssio2-hs/e2d3s09 HEALTHY - rg_gssio2-hs/e2d3s10 HEALTHY - rg_gssio2-hs/e2d3s11 HEALTHY - rg_gssio2-hs/e2d3s12 HEALTHY - rg_gssio2-hs/e2d4s07 HEALTHY - rg_gssio2-hs/e2d4s08 HEALTHY - rg_gssio2-hs/e2d4s09 HEALTHY - rg_gssio2-hs/e2d4s10 HEALTHY - rg_gssio2-hs/e2d4s11 HEALTHY - rg_gssio2-hs/e2d4s12 HEALTHY - rg_gssio2-hs/e2d5s07 HEALTHY - rg_gssio2-hs/e2d5s08 HEALTHY - rg_gssio2-hs/e2d5s09 HEALTHY - rg_gssio2-hs/e2d5s10 HEALTHY - rg_gssio2-hs/e2d5s11 HEALTHY - rg_gssio2-hs/e2d5s12ssd HEALTHY - rg_gssio2-hs/n1s02 HEALTHY - rg_gssio2-hs/n2s02 HEALTHY - RECOVERYGROUP DEGRADED gnr_rg_failed rg_gssio1-hs FAILED gnr_rg_failed rg_gssio2-hs HEALTHY - VIRTUALDISK DEGRADED - rg_gssio2_hs_Basic1_data_0 HEALTHY - rg_gssio2_hs_Basic1_system_0 HEALTHY - rg_gssio2_hs_Basic2_data_0 HEALTHY - rg_gssio2_hs_Basic2_system_0 HEALTHY - rg_gssio2_hs_Custom1_data1_0 HEALTHY - rg_gssio2_hs_Custom1_system_0 HEALTHY - rg_gssio2_hs_Data_8M_2p_1_gpfs0 HEALTHY - rg_gssio2_hs_Data_8M_3p_1_gpfs1 HEALTHY - rg_gssio2_hs_MetaData_1M_3W_1_gpfs0 HEALTHY - rg_gssio2_hs_MetaData_1M_4W_1_gpfs1 HEALTHY - rg_gssio2_hs_loghome HEALTHY - rg_gssio2_hs_logtip HEALTHY - rg_gssio2_hs_logtipbackup HEALTHY - PERFMON HEALTHY -
- To view the eventlog history of the node for the last hour, issue this
command:
The system displays output similar to this:mmhealth node eventlog --hour
Node name: test-21.localnet.com Timestamp Event Name Severity Details 2016-10-28 06:59:34.045980 CEST monitor_started INFO The IBM Spectrum Scale monitoring service has been started 2016-10-28 07:01:21.919943 CEST fs_remount_mount INFO The filesystem objfs was mounted internal 2016-10-28 07:01:32.434703 CEST disk_found INFO The disk disk1 was found 2016-10-28 07:01:32.669125 CEST disk_found INFO The disk disk8 was found 2016-10-28 07:01:36.975902 CEST filesystem_found INFO Filesystem objfs was found 2016-10-28 07:01:37.226157 CEST unmounted_fs_check WARNING The filesystem objfs is probably needed, but not mounted 2016-10-28 07:01:52.1691 CEST mounted_fs_check INFO The filesystem objfs is mounted 2016-10-28 07:01:52.283545 CEST fs_remount_mount INFO The filesystem objfs was mounted normal 2016-10-28 07:02:07.026093 CEST mounted_fs_check INFO The filesystem objfs is mounted 2016-10-28 07:14:58.498854 CEST ces_network_ips_down WARNING No CES relevant NICs detected 2016-10-28 07:15:07.702351 CEST nodestatechange_info INFO A CES node state change: Node 1 add startup flag 2016-10-28 07:15:37.322997 CEST nodestatechange_info INFO A CES node state change: Node 1 remove startup flag 2016-10-28 07:15:43.741149 CEST ces_network_ips_up INFO CES-relevant IPs are served by found NICs 2016-10-28 07:15:44.028031 CEST ces_network_vanished INFO CES NIC eth0 has vanished
- To view the eventlog history of the node for the last hour, issue this
command:
The system displays output similar to this:mmhealth node eventlog --hour --verbose
Node name: test-21.localnet.com Timestamp Component Event Name Event ID Severity Details 2016-10-28 06:59:34.045980 CEST gpfs monitor_started 999726 INFO The IBM Spectrum Scale monitoring service has been started 2016-10-28 07:01:21.919943 CEST filesystem fs_remount_mount 999306 INFO The filesystem objfs was mounted internal 2016-10-28 07:01:32.434703 CEST disk disk_found 999424 INFO The disk disk1 was found 2016-10-28 07:01:32.669125 CEST disk disk_found 999424 INFO The disk disk8 was found 2016-10-28 07:01:36.975902 CEST filesystem filesystem_found 999299 INFO Filesystem objfs was found 2016-10-28 07:01:37.226157 CEST filesystem unmounted_fs_check 999298 WARNING The filesystem objfs is probably needed, but not mounted 2016-10-28 07:01:52.113691 CEST filesystem mounted_fs_check 999301 INFO The filesystem objfs is mounted 2016-10-28 07:01:52.283545 CEST filesystem fs_remount_mount 999306 INFO The filesystem objfs was mounted normal 2016-10-28 07:02:07.026093 CEST filesystem mounted_fs_check 999301 INFO The filesystem objfs is mounted 2016-10-28 07:14:58.498854 CEST cesnetwork ces_network_ips_down 999426 WARNING No CES relevant NICs detected 2016-10-28 07:15:07.702351 CEST gpfs nodestatechange_info 999220 INFO A CES node state change: Node 1 add startup flag 2016-10-28 07:15:37.322997 CEST gpfs nodestatechange_info 999220 INFO A CES node state change: Node 1 remove startup flag 2016-10-28 07:15:43.741149 CEST cesnetwork ces_network_ips_up 999427 INFO CES-relevant IPs are served by found NICs 2016-10-28 07:15:44.028031 CEST cesnetwork ces_network_vanished 999434 INFO CES NIC eth0 has vanished
- To view the detailed description of an event, issue the mmhealth event show
command. This is an example for quorum_down
event:
The system displays output similar to this:mmhealth event show quorum_down
Event Name: quorum_down Event ID: 999289 Description: Reasons could be network or hardware issues, or a shutdown of the cluster service. The event does not necessarily indicate an issue with the cluster quorum state. Cause: The local node does not have quorum. The cluster service might not be running. User Action: Check if the cluster quorum nodes are running and can be reached over the network. Check local firewall settings Severity: ERROR State: DEGRADED
- To view the list of hidden events, issue the mmhealth event list HIDDEN
command:
The system displays output similar to this:mmhealth event list HIDDEN
Event scope -------------------------------------- gpfs_pagepool_small - nfsv4_acl_type_wrong fs1 nfsv4_acl_type_wrong fs2
- To view the detailed description of the cluster, issue the mmhealth cluster
show command:
The system displays output similar to this:mmhealth cluster show
Component Total Failed Degraded Healthy Other ----------------------------------------------------------------- NODE 50 1 1 48 - GPFS 50 1 - 49 - NETWORK 50 - - 50 - FILESYSTEM 3 - - 3 - DISK 50 - - 50 - CES 5 - 5 - - CLOUDGATEWAY 2 - - 2 - PERFMON 48 - 5 43 - THRESHOLD 4 - - 4 -
Note: The cluster must have the minimum release level as 4.2.2.0 or higher to use mmhealth cluster show command.
Also, this command does not support Windows operating system. - To view more information of the cluster health status, issue this
command:
The system displays output similar to this:mmhealth cluster show --verbose
Component Total Failed Degraded Healthy Other ----------------------------------------------------------------- NODE 50 1 1 48 - GPFS 50 1 - 49 - NETWORK 50 - - 50 - FILESYSTEM FS1 15 - - 15 - FS2 5 - - 5 - FS3 20 - - 20 - DISK 50 - - 50 - CES 5 - 5 - - AUTH 5 - - - 5 AUTH_OBJ 5 5 - - - BLOCK 5 - - - 5 CESNETWORK 5 - - 5 - NFS 5 - - 5 - OBJECT 5 - - 5 - SMB 5 - - 5 - CLOUDGATEWAY 2 - - 2 - PERFMON 48 - 5 43 - THRESHOLD 4 - - 4 -
- To create a new threshold rule, issue this
command:
The system displays output similar to this:mmhealth thresholds add MetaDataPool_capUtil --errorlevel 90 --direction high --groupby gpfs_fs_name,gpfs_diskpool_name
New rule 'MetaDataPool_capUtil_custom' is created. The monitor process is activated
- To view the list of threshold rules defined for the system, issue this
command:
mmhealth thresholds list
The system displays output similar to this:### Threshold Rules ### rule_name metric error warn direction filterBy groupBy sensitivity --------------------------------------------------------------------------------------------------------------------------------------------------------- InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name 300 MetaDataPool_capUtil_custom MetaDataPool_capUtil 90 None high gpfs_fs_name,gpfs_diskpool_name 300 DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300 MemFree_Rule mem_memfree 50000 100000 low node 300 MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300
- To view the detailed health status of filesystem component, issue this
command:
The system displays output similar to this:mmhealth node show filesystem -v
Node name: gpfsgui-12.novalocal Component Status Status Change Reasons ------------------------------------------------------------------------------- FILESYSTEM DEGRADED 2016-09-29 15:22:48 pool-data_high_error fs1 FAILED 2016-09-29 15:22:48 pool-data_high_error fs2 HEALTHY 2016-09-29 15:22:33 - objfs HEALTHY 2016-09-29 15:22:33 - Event Parameter Severity Active Since Event Message ------------------------------------------------------------------------------------------------------------------------------------------------------------- pool-data_high_error fs1 ERROR 2016-09-29 15:22:47 The pool myPool of file system fs1 reached a nearly exhausted data level. 90.0 inode_normal fs1 INFO 2016-09-29 15:22:47 The inode usage of fileset root in file system fs1 reached a normal level. inode_normal fs2 INFO 2016-09-29 15:22:47 The inode usage of fileset root in file system fs2 reached a normal level. inode_normal objfs INFO 2016-09-29 15:22:47 The inode usage of fileset root in file system objfs reached a normal level. inode_normal objfs INFO 2016-09-29 15:22:47 The inode usage of fileset Object_Fileset in file system objfs reached a normal level. mounted_fs_check fs1 INFO 2016-09-29 15:22:33 The filesystem fs1 is mounted mounted_fs_check fs2 INFO 2016-09-29 15:22:33 The filesystem fs2 is mounted mounted_fs_check objfs INFO 2016-09-29 15:22:33 The filesystem objfs is mounted pool-data_normal fs1 INFO 2016-09-29 15:22:47 The pool system of file system fs1 reached a normal data level. pool-data_normal fs2 INFO 2016-09-29 15:22:47 The pool system of file system fs2 reached a normal data level. pool-data_normal objfs INFO 2016-09-29 15:22:47 The pool data of file system objfs reached a normal data level. pool-data_normal objfs INFO 2016-09-29 15:22:47 The pool system of file system objfs reached a normal data level. pool-metadata_normal fs1 INFO 2016-09-29 15:22:47 The pool system of file system fs1 reached a normal metadata level. pool-metadata_normal fs1 INFO 2016-09-29 15:22:47 The pool myPool of file system fs1 reached a normal metadata level. pool-metadata_normal fs2 INFO 2016-09-29 15:22:47 The pool system of file system fs2 reached a normal metadata level. pool-metadata_normal objfs INFO 2016-09-29 15:22:47 The pool system of file system objfs reached a normal metadata level. pool-metadata_normal objfs INFO 2016-09-29 15:22:47 The pool data of file system objfs reached a normal metadata level.
- To check the monitoring interval, issue the following
command:
The system displays output similar to this:# mmhealth config interval
Monitor interval is DEFAULT.
- To set the monitoring interval to low, issue the following
command:
The system displays output similar to this:# mmhealth config interval LOW
Monitor interval changed to LOW.