Use case 3: Observe the health status changes for a particular component based on the specified threshold rules
This use case shows the usage of the mmhealth command to observe the health status changes for a particular node based on the specified threshold rules.
Run the following command to view the threshold rules that are predefined and enabled
automatically in a cluster:
[root@rhel77-11 ~] mmhealth thresholds list
The system displays output similar to the
following:
active_thresholds_monitor: RHEL77-11.novalocal
### Threshold Rules ###
rule_name metric error warn direction filterBy groupBy sensitivity
---------------------------------------------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name 300
DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300
MemFree_Rule MemoryAvailable_percent None 5.0 low node 300-min
SMBConnPerNode_Rule current_connections 3000 None high node 300
SMBConnTotal_Rule current_connections 20000 None high 300
MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name 300
The default
MemFree_Rule
rule monitors the estimated available memory in
relation to the total memory allocation on all cluster nodes. A WARNING
event is
sent for a node, if the MemoryAvailable_percent
value goes less than 5% for that
node. Run the following command to review the details of the rule
settings:[root@rhel77-11 ~]# mmhealth thresholds list -v -Y | grep MemFree_Rule
The system displays output similar to the
following:
mmhealth_thresholds:THRESHOLD_RULE:HEADER:version:reserved:MemFree_Rule:attribute:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:rule_name:MemFree_Rule:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:frequency:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:tags:thresholds:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:user_action_warn:The estimated available memory is less than 5%, calculated to the total RAM or 40 GB, whichever is lower.:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:user_action_error::
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:priority:2:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:downsamplOp:min:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:type:measurement:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:metric:MemoryAvailable_percent:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:metricOp:noOperation:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:bucket_size:1:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:computation:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:duration:n/a:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:filterBy::
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:groupBy:node:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:error:None:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:warn:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:direction:low:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:hysteresis:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:sensitivity:300-min:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:state:active:
Note: The
MemFree_Rule
rule has the same evaluation priority for all nodes.Run the following command on a node to view the health state of all the threshold rules that are
defined for that node:
[root@rhel77-11 ~] mmhealth node show threshold -v
The system displays output similar to the
following:
Node name: RHEL77-11.novalocal
Component Status Status Change Reasons
------------------------------------------------------------------------
THRESHOLD HEALTHY 2020-04-27 12:07:07 -
MemFree_Rule HEALTHY 2020-04-27 12:07:22 -
active_thresh_monitor HEALTHY 2020-04-27 12:07:22 -
Event Parameter Severity Active Since Event Message
------------------------------------------------------------------------------------------------------------------------------
thresholds_normal MemFree_Rule INFO 2020-04-27 12:07:22 The value of MemoryAvailable_percent defined in
MemFree_Rule for component
MemFree_Rule/rhel77-11.novalocal
reached a normal level.
In a production environment, in certain cases, the memory availability observation settings need
to be defined for a particular host separately. Follow these steps to set the memory availability
for a particular node:
- Run the following command to create a new rule,
node11_mem_available
, to set theMemoryAvailable_percent
threshold value for the nodeRHEL77-11.novalocal
:[root@rhel77-11 ~] mmhealth thresholds add MemoryAvailable_percent --filterby node=rhel77-11.novalocal --errorlevel 5.0 --warnlevel 50.0 --name node11_mem_available
The system displays output similar to the following:New rule 'node11_mem_available' is created.
- Run the following command to view all the defined rules on a cluster:
[root@rhel77-11 ~] mmhealth thresholds list
The system displays output similar to the following:active_thresholds_monitor: RHEL77-11.novalocal ### Threshold Rules ### rule_name metric error warn direction filterBy groupBy sensitivity ----------------------------------------------------------------------------------------------------------------------------------- InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name, gpfs_fs_name, gpfs_fset_name 300 DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name, gpfs_fs_name, gpfs_diskpool_name 300 MemFree_Rule MemoryAvailable_percent None 5.0 low node 300-min SMBConnPerNode_Rule current_connections 3000 None high node 300 node11_mem_available MemoryAvailable_percent 5.0 50.0 None node=rhel77-11.novalocal node 300 SMBConnTotal_Rule current_connections 20000 None high 300 MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name, gpfs_fs_name, gpfs_diskpool_name 300
Note:Thenode11_mem_available
rule has the priority one for theRHEL77-11.novalocal
node:[root@rhel77-11 ~]# mmhealth thresholds list -v -Y | grep node11_mem_available mmhealth_thresholds:THRESHOLD_RULE:HEADER:version:reserved:node11_mem_available:attribute:value: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:rule_name:node11_mem_available: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:frequency:300: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:tags:thresholds: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:user_action_warn:None: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:user_action_error:None: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:priority:1: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:downsamplOp:None: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:type:measurement: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:metric:MemoryAvailable_percent: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:metricOp:noOperation: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:bucket_size:300: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:computation:None: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:duration:None: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:filterBy:node=rhel77-11.novalocal: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:groupBy:node: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:error:5.0: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:warn:50.0: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:direction:None: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:hysteresis:0.0: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:sensitivity:300: mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:state:active:
All theMemFree_Rule
events are removed forRHEL77-11.novalocal
, since thenode11_mem_available
rule has the higher priority for this node:[root@rhel77-11 ~]# mmhealth node show threshold -v Node name: RHEL77-11.novalocal Component Status Status Change Reasons ------------------------------------------------------------------------------- THRESHOLD HEALTHY 2020-04-27 12:07:07 - active_thresh_monitor HEALTHY 2020-04-27 12:07:22 - node11_mem_available HEALTHY 2020-05-13 10:10:16 - Event Parameter Severity Active Since Event Message -------------------------------------------------------------------------------------------------------------------------------------- thresholds_normal node11_mem_available INFO 2020-05-13 10:10:16 The value of MemoryAvailable_percent defined in node11_mem_available for component node11_mem_available/rhel77-11.novalocal reached a normal level. thresholds_removed MemFree_Rule INFO 2020-05-13 10:06:15 The value of MemoryAvailable_percent defined for the component(s) MemFree_Rule/rhel77-11.novalocal defined in MemFree_Rule was removed.
Since the warning boundary bynode11_mem_available
rule is higher than theMemFree_Rule
rule, theWARNING
event might appear faster than before for this node.[root@rhel77-11 ~]# mmhealth node show threshold -v Node name: RHEL77-11.novalocal Component Status Status Change Reasons ------------------------------------------------------------------------------------------------------------------- THRESHOLD DEGRADED 2020-05-13 12:50:24 thresholds_warn(node11_mem_available) active_thresh_monitor HEALTHY 2020-04-27 12:07:22 - node11_mem_available DEGRADED 2020-05-13 12:50:24 thresholds_warn(node11_mem_available) Event Parameter Severity Active Since Event Message ------------------------------------------------------------------------------------------------------------------------------------------------ thresholds_warn node11_mem_available WARNING 2020-05-13 12:50:23 The value of MemoryAvailable_percent for the component(s) node11_mem_available/rhel77-11.novalocal exceeded threshold warning level 50.0 defined in node11_mem_available. thresholds_removed MemFree_Rule INFO 2020-05-13 10:06:15 The value of MemoryAvailable_percent for the component(s) MemFree_Rule/rhel77-11.novalocal defined in MemFree_Rule was removed.
- Run the following command to get the
WARNING
event details:[root@rhel77-11 ~] mmhealth event show thresholds_warn
The system displays output similar to the following:Event Name: thresholds_warn Description: The thresholds value reached a warning level. Cause: The thresholds value reached a warning level. User Action: Run 'mmhealth thresholds list -v' commmand and review the user action recommendations for the corresponding thresholds rule. Severity: WARNING State: DEGRADED
You can also review the event history by viewing the whole event log as shown:transfered[root@rhel77-11 ~]# mmhealth node eventlog Node name: RHEL77-11.novalocal Timestamp Event Name Severity Details 2020-04-27 11:59:06.532239 CEST monitor_started INFO The IBM Storage Scale monitoring service has been started 2020-04-27 11:59:07.410614 CEST service_running INFO The service clusterstate is running on node RHEL77-11.novalocal 2020-04-27 11:59:07.784565 CEST service_running INFO The service network is running on node RHEL77-11.novalocal 2020-04-27 11:59:09.965934 CEST gpfs_down ERROR The Storage Scale service process not running on this node. Normal operation cannot be done 2020-04-27 11:59:10.102891 CEST quorum_down ERROR The node is not able to reach enough quorum nodes/disks to work properly. 2020-04-27 11:59:10.329689 CEST service_running INFO The service gpfs is running on node RHEL77-11.novalocal 2020-04-27 11:59:38.399120 CEST gpfs_up INFO The Storage Scale service process is running 2020-04-27 11:59:38.498718 CEST callhome_not_enabled TIP Callhome is not installed, configured or enabled. 2020-04-27 11:59:38.511969 CEST gpfs_pagepool_small TIP The GPFS pagepool is smaller than or equal to 1G. 2020-04-27 11:59:38.526075 CEST csm_resync_forced INFO All events and state will be transferred to the cluster manager 2020-04-27 12:01:07.486549 CEST quorum_up INFO Quorum achieved 2020-04-27 12:01:41.906686 CEST service_running INFO The service disk is running on node RHEL77-11.novalocal 2020-04-27 12:02:22.319159 CEST fs_remount_mount INFO The filesystem gpfs01 was mounted internal 2020-04-27 12:02:22.322987 CEST disk_found INFO The disk nsd_1 was found 2020-04-27 12:02:22.337810 CEST fs_remount_mount INFO The filesystem gpfs01 was mounted normal 2020-04-27 12:02:22.369814 CEST mounted_fs_check INFO The filesystem gpfs01 is mounted 2020-04-27 12:02:22.443717 CEST service_running INFO The service filesystem is running on node RHEL77-11.novalocal 2020-04-27 12:04:43.842571 CEST service_running INFO The service threshold is running on node RHEL77-11.novalocal 2020-04-27 12:04:55.168176 CEST service_running INFO The service perfmon is running on node RHEL77-11.novalocal 2020-04-27 12:07:07.657284 CEST service_running INFO The service threshold is running on node RHEL77-11.novalocal 2020-04-27 12:07:22.609728 CEST thresh_monitor_set_active INFO The thresholds monitoring process is running in ACTIVE state on the local node 2020-04-27 12:07:22.626369 CEST thresholds_new_rule INFO Rule MemFree_Rule was added 2020-04-27 12:09:08.275073 CEST local_fs_normal INFO The local file system with the mount point / used for /tmp/mmfs reached a normal level with more than 1000 MB free space. 2020-04-27 12:14:07.997867 CEST singleton_sensor_on INFO The singleton sensors of pmsensors are turned on 2020-05-08 11:03:45.324399 CEST local_fs_path_not_found INFO The configured dataStructureDump path /tmp/mmfs does not exists. Skipping monitoring. 2020-05-13 10:06:15.912457 CEST thresholds_removed INFO The value of MemoryAvailable_percent for the component(s) MemFree_Rule/rhel77-11.novalocal defined in MemFree_Rule was removed. 2020-05-13 10:10:16.173478 CEST thresholds_new_rule INFO Rule node11_mem_available was added 2020-05-13 12:50:23.955531 CEST thresholds_warn WARNING The value of MemoryAvailable_percent for the component(s) node11_mem_available/rhel77-11.novalocal exceeded threshold warning level 50.0 defined in node11_mem_available. 2020-05-13 13:14:12.836070 CEST out_of_memory WARNING Detected Out of memory killer conditions in system log