mmhealth command

Monitors health status of nodes.

Synopsis

mmhealth node show [ GPFS | NETWORK [ UserDefinedSubComponent ] 
                   | FILESYSTEM [UserDefinedSubComponent ] | DISK [UserDefinedSubComponent ]
                   | CES | AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB 
                   | HADOOP |CLOUDGATEWAY | GUI | PERFMON | THRESHOLD
                   | AFM [ UserDefinedSubComponent]  ]
                   [-N {Node[,Node..] | NodeFile | NodeClass}] 
                   [-Y] [--verbose] [--unhealthy]

mmhealth node eventlog [[--hour | --day | --week | month] | [--clear] | [--verbose]]
                      [-N {Node[,Node..] | NodeFile | NodeClass}]
                      [-Y]

mmhealth event show [ EventName | EventID ] [-N {Node[,Node..] | NodeFile | NodeClass}]

mmhealth event hide [ EventName [Entity_Name]]

mmhealth event unhide [ EventName [Entity_Name]]

mmhealth event list HIDDEN

mmhealth cluster show [ NODE | GPFS | NETWORK [ UserDefinedSubComponent ] 
                     | FILESYSTEM  [UserDefinedSubComponent ]| DISK [UserDefinedSubComponent ]
                     | CES |AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB 
                     | HADOOP |CLOUDGATEWAY | GUI | PERFMON | THRESHOLD
                     | AFM [UserDefinedSubComponent]  ]
                     [-Y] [--verbose]

mmhealth thresholds list [--verbose]

mmhealth thresholds add { metric [: sum | avg | min | max | rate ]| measurement  
                          [--errorlevel {threshold error limit}[--warnlevel{threshold warn limit }]
                       |--direction { high|low }]
                       [--sensitivity { bucketsize } ] [--hysteresis { percentage }]
                       [--filterBy] [--groupBy ] [--name { ruleName }]
                       [--errormsg { user defined action description }]
                       [--warnmsg { user defined action description }]

mmhealth thresholds delete { ruleName | all }

mmhealth config interval [OFF | LOW | MEDIUM | DEFAULT | HIGH]

Availability

Available on all IBM Spectrum Scale™ editions.

Description

Use the mmhealth command to monitor the health of the node and services hosted on the node in IBM Spectrum Scale.

The IBM Spectrum Scale administrator can monitor the health of each node and the services hosted on that node using the mmhealth command. The mmhealth command also shows the events that are responsible for the unhealthy status of the services hosted on the node. This data can be used to monitor and analyze the reasons for the unhealthy status of the node. The mmhealth command acts as a problem determination tool to identify which services of the node are unhealthy, and find the events responsible for the unhealthy state of the service.

The mmhealth command also monitors the state of all the IBM Spectrum Scale RAID components such as array, pdisk, vdisk, and enclosure of the nodes that belong to the recovery group.

For more information about the system monitoring feature, see Monitoring system health by using the mmhealth command.the Monitoring system health by using the mmhealth command section in the IBM Spectrum Scale: Problem Determination Guide.

The mmhealth command shows the details of threshold rules. This detail helps to avoid out-of-space errors for filesystems. The space availability of the filesystem component depends upon the occupancy level of fileset-inode spaces and the capacity usage in each data or metadata pool. The violation of any single rule triggers the parent filesystem's capacity-issue events. The capacity metrics are frequently compared with the rules boundaries by internal monitor process. If any of the metric values exceeds their threshold limit, then the system health (deamon/service) will receive an event notification from monitor process and generate a RAS event for the filesystem for space issues. For the predefined capacity utilization rules, the warn level is set to 80%, and the error level to 90%. For memory utilization rule, the warn level is set to 100 MB, and the error level to 50 MB. You can use the mmlsfileset and the mmlspool commands to track the inode and pool space usage.

Parameters

event

Gives the details of various events:

show

Shows the detailed description of the specified event:

EventName: Displays the detailed description of the specified event name.
EventID: Displays the detailed description of the specified event ID.

hide: Hides the specified TIP events.

unhide: Reveals the TIP events that were previously hidden using the hide.

list HIDDEN: Shows all the TIP events that are added to the list of hidden events.

node

Displays the health status, specifically, at node level.

show

Displays the health status of the specified component with:

Displays the detailed health status of the specified component.

UserDefinedSubComponent: Displays services that are named by the customer, categorized by one of the other hosted services. For example, a filesystem named gpfs0 is a subcomponent of filesystem.

-N

Allows the system to make remote calls to the other nodes in the cluster for:

Node[,Node....]: Specifies the node or list of nodes that must be monitored for the health status.

NodeFile: Specifies a file, containing a list of node descriptors, one per line, to be monitored for health status.

NodeClass: Specifies a node class that must be monitored for the health status.

-Y

Displays the command output in a parseable format with a colon (:) as a field delimiter. Each column is described by a header.

Note: Fields that have a colon (:) are encoded to prevent confusion. For the set of characters that might be encoded, see the command documentation of mmclidecode. Use the mmclidecode command to decode the field.

--verbose

Shows the detailed health status of a node, including its sub-components.

--unhealthy

Displays the unhealthy components only.

eventlog

Shows the event history for a specified period of time. If no time period is specified, it displays all the events by default:

[--hour | --day | --week| --month]: Displays the event history for the specified time period.

[--clear]: Clears the event log's database. This action cannot be reversed.
CAUTION:
The events database is used by the mmhealth node eventlog as well as the mmces events list. If you clear the database, it will also affect the mmces events list. Ensure that you use the --clear option with caution.

-Y: Displays the command output in a parseable format with a colon (:) as a field delimiter. Each column is described by a header.
Note: Fields that have a colon (:) are encoded to prevent confusion. For the set of characters that might be encoded, see the command documentation of mmclidecode. Use the mmclidecode command to decode the field.

[--verbose]: Displays additional information about the event like component name and event ID in the eventlog.

event

Gives the details of various events:

show

Shows the detailed description of the specified event:

EventName: Displays the detailed description of the specified event name.
EventID: Displays the detailed description of the specified event ID.

hide: Hides the specified TIP events.

unhide: Reveals the TIP events that were previously hidden using the hide.

list HIDDEN: Shows all the TIP events that are added to the list of hidden events.

cluster

Displays the health status of all nodes and monitored node components in the cluster.

show

Displays the health status of the specified component with:

NODE | GPFS | NETWORK | FILESYSTEM | DISK | CES | AUTH | AUTH_OBJ | BLOCK | CESNETWORK | NFS | OBJECT | SMB | HADOOP | CLOUDGATEWAY | GUI | PERFMON | THRESHOLD | AFM: Displays the detailed health status of the specified component.

-Y: Displays the command output in a parseable format with a colon (:) as a field delimiter. Each column is described by a header.
Note: Fields that have a colon (:) are encoded to prevent confusion. For the set of characters that might be encoded, see the command documentation of mmclidecode. Use the mmclidecode command to decode the field.

--verbose: Shows the detailed health status of a node, including its sub-components.

thresholds list

Displays the list of the threshold rules defined for the system.

thresholds add

Creates a new thresholds rule for the specified metric or measurement, and activates monitoring process stores for this rule.

Note: A measurement is a value calculated using more than one metric in a pre-defined formula.

metric [: SUM | AVG | MIN | MAX | RATE ]: Creates a threshold for the specified metric. All metrics that are supported by the performance monitoring tool, and use raw values or are downsampled by aggregators (sum, avg, min, max, rate) can be used. For a list of metrics supported by the performance monitoring tool, see List of performance metrics.

measurement

Creates a threshold for the specified measurement. The following measurements are supported:

Fileset_inode: Fileset Inode Capacity Utilization. Calculated as:
sum(gpfs_fset_allocInodes)-sum(gpfs_fset_freeInodes)/sum(gpfs_fset_maxInodes)

DataPool_capUtil: Data Pool Capacity Utilization. Calculated as:
sum(gpfs_pool_total_dataKB)-sum(gpfs_pool_free_dataKB)/sum(gpfs_pool_total_dataKB)

MetaDataPool_capUtil: MetaData Pool Capacity Utilization. Calculated as:
sum(gpfs_pool_total_metaKB)-sum(gpfs_pool_free_metaKB))/sum(gpfs_pool_total_metaKB)

--errorlevel: Defines the threshold error limit. The threshold error limit can be a percentage or an integer, depending on the metric on which the threshold value is being set.

--warnlevel: Defines the threshold warn limit. The threshold warn limit can be a percentage or an integer, depending on the metric on which the threshold value is being set.

--direction: Defines the direction for the threshold limit. The allowed values are high or low.

--groupby

Groups the result based on the group key. The following values are allowed for the group key:

gpfs_cluster_name
gpfs_disk_name
gpfs_diskpool_name
gpfs_disk_usage_name
gpfs_fset_name
gpfs_fs_name
mountPoint
netdev_name
node

--filterby

Filters the result based on the filter key. The following values are allowed for the filter key:

gpfs_cluster_name
gpfs_disk_name
gpfs_diskpool_name
gpfs_disk_usage_name
gpfs_fset_name
gpfs_fs_name
mountPoint
netdev_name
node

--sensitivity: Defines the sample interval value in seconds. It is set to 300 by default. If a sensors is configured with interval period greater than 300 seconds, then the --sensitivity will be set to the same value as sensors period. The minimum value allowed is 120 seconds. If a sensors is configured with interval period less than 120 seconds, the --sensitivity will be set to 120 seconds.

--hysteresis: Defines the percentage that the observed value must be under (or over) the current threshold level to switch back to the previous state. The default value is 0.0, while the recommended value is 5.0

--name: Defines the name of the rule. It can be an alphanumeric string with up to 30 characters. If the rule name is not specified, default name will be set. The default name is set using the metric name followed by underscore and then a "custom" prefix.

--errormsg: This is a user defined input. The message can be 256 bytes long. It must be added within double quotes (""), else the system will throw an error.

--warnmsg: This is a user defined input. The message can be 256 bytes long. It must be added within double quotes (""), else the system will throw an error.

Important:

The mathematical aggregations: AVG, SUM, MAX, MIN, RATE could be used to determine how to merge the metric values in the evaluation source. The aggregation operations are not supported for measurements.

For each rule the user can configure up to two conditions, --error and --warn, triggering event state change. At least one level limit setting is required. For example, the threshold add command must have one of the following options:

mmhealth thresholds add { metric[:sum|avg|min|max|rate] | measurement 
--error{threshold error limit} ---direction {high|low}

mmhealth thresholds add { metric[:sum|avg|min|max|rate] | measurement
--error{threshold error limit} --warn{threshold warn limit}

mmhealth thresholds add { metric[:sum|avg|min|max|rate] | measurement
--error{threshold error limit} --warn{threshold warn limit} ---direction {high|low}

The customer can also influence the measuring quantity and precision by specifying sensitivity, groupby, filterby, histersesis, or rule name option setting.
For each condition level the customer can leave an output message text using --errormsg or --warnmsg options, which will be integrated into the state change event notification message, triggered by the exceeding this condition.

thresholds delete

Deletes the threshold rules from the system.

ruleName: Deletes a specific threshold rule.
all: Deletes all the threshold rules.

Note: Using the mmhealth thresholds delete command to delete a rule will accomplish the following tasks:

The rule will be removed from the thresholds rules specification file and active monitoring process.
All the current health information created by this particular rule will be removed as well.

config interval

Sets the monitoring interval for the whole cluster.

off: The monitoring will be off for the whole cluster.
low: Monitoring is set for every (default monitoring time *10) seconds.
medium: Monitoring is set for every (default monitoring time *5) seconds.
default: Monitoring is set for every 15-30 seconds based on the service being monitored
high: Monitoring is set for every (default monitoring time /2) seconds.

Exit status

0: Successful completion.
nonzero: A failure has occurred.

Security

You must have root authority to run the mmhealth command.

The node on which the command is issued must be able to execute remote shell commands on any other node in the cluster without the use of a password and without producing any extraneous messages. See the information about the requirements for administering a GPFS system in the IBM Spectrum Scale: Administration Guide.