Use case 3: Observe the health status changes for a particular component based on the specified threshold rules

This use case shows the usage of the mmhealth command to observe the health status changes for a particular node based on the specified threshold rules.

Run the following command to view the threshold rules that are predefined and enabled automatically in a cluster:
[root@rhel77-11 ~] mmhealth thresholds list
The system displays output similar to the following:
active_thresholds_monitor: RHEL77-11.novalocal
### Threshold Rules ###
rule_name             metric                   error  warn  direction  filterBy  groupBy                                            sensitivity
---------------------------------------------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule     Fileset_inode            90.0   80.0  high                 gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name      300
DataCapUtil_Rule      DataPool_capUtil         90.0   80.0  high                 gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300
MemFree_Rule          MemoryAvailable_percent  None   5.0   low                  node                                               300-min
SMBConnPerNode_Rule   current_connections      3000   None  high                 node                                               300
SMBConnTotal_Rule     current_connections      20000  None  high                                                                    300
MetaDataCapUtil_Rule  MetaDataPool_capUtil     90.0   80.0  high                 gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300
The default MemFree_Rule rule monitors the estimated available memory in relation to the total memory allocation on all cluster nodes. A WARNING event is sent for a node, if the MemoryAvailable_percent value goes less than 5% for that node. Run the following command to review the details of the rule settings:
[root@rhel77-11 ~]# mmhealth thresholds list -v -Y | grep MemFree_Rule
The system displays output similar to the following:

mmhealth_thresholds:THRESHOLD_RULE:HEADER:version:reserved:MemFree_Rule:attribute:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:rule_name:MemFree_Rule:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:frequency:300:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:tags:thresholds:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:user_action_warn:The estimated available memory is less than 5%, calculated to the total RAM or 40 GB, whichever is lower.:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:user_action_error::
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:priority:2:                       			 
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:downsamplOp:min:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:type:measurement:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:metric:MemoryAvailable_percent:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:metricOp:noOperation:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:bucket_size:1:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:computation:value:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:duration:n/a:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:filterBy::									
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:groupBy:node:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:error:None:	
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:warn:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:direction:low:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:hysteresis:5.0:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:sensitivity:300-min:
mmhealth_thresholds:THRESHOLD_RULE:0:1::MemFree_Rule:state:active: 
Note: The MemFree_Rule rule has the same evaluation priority for all nodes.
Run the following command on a node to view the health state of all the threshold rules that are defined for that node:
[root@rhel77-11 ~] mmhealth node show threshold -v
The system displays output similar to the following:
Node name:      RHEL77-11.novalocal

Component                 Status        Status Change            Reasons
------------------------------------------------------------------------
THRESHOLD                 HEALTHY       2020-04-27 12:07:07      -
  MemFree_Rule            HEALTHY       2020-04-27 12:07:22      -
  active_thresh_monitor   HEALTHY       2020-04-27 12:07:22      -


Event                 Parameter        Severity    Active Since             Event Message
------------------------------------------------------------------------------------------------------------------------------
thresholds_normal     MemFree_Rule     INFO        2020-04-27 12:07:22      The value of MemoryAvailable_percent defined in 
                                                                            MemFree_Rule for component 
                                                                            MemFree_Rule/rhel77-11.novalocal
                                                                            reached a normal level.
In a production environment, in certain cases, the memory availability observation settings need to be defined for a particular host separately. Follow these steps to set the memory availability for a particular node:
  1. Run the following command to create a new rule, node11_mem_available, to set the MemoryAvailable_percent threshold value for the node RHEL77-11.novalocal:
    [root@rhel77-11 ~] mmhealth thresholds add MemoryAvailable_percent
     --filterby node=rhel77-11.novalocal --errorlevel 5.0 
    --warnlevel 50.0 --name node11_mem_available
    The system displays output similar to the following:
    New rule 'node11_mem_available' is created.
  2. Run the following command to view all the defined rules on a cluster:
    [root@rhel77-11 ~] mmhealth thresholds list
    The system displays output similar to the following:
    active_thresholds_monitor: RHEL77-11.novalocal
    ### Threshold Rules ###
    rule_name             metric                   error  warn  direction  filterBy                  groupBy             sensitivity
    -----------------------------------------------------------------------------------------------------------------------------------
    InodeCapUtil_Rule     Fileset_inode            90.0   80.0  high                                 gpfs_cluster_name,
                                                                                                     gpfs_fs_name,
                                                                                                     gpfs_fset_name      300
    DataCapUtil_Rule      DataPool_capUtil         90.0   80.0  high                                 gpfs_cluster_name,
                                                                                                     gpfs_fs_name,
                                                                                                     gpfs_diskpool_name  300
    MemFree_Rule          MemoryAvailable_percent  None   5.0   low                                  node                300-min
    SMBConnPerNode_Rule   current_connections      3000   None  high                                 node                300
    node11_mem_available  MemoryAvailable_percent  5.0    50.0  None       node=rhel77-11.novalocal  node                300
    SMBConnTotal_Rule     current_connections      20000  None  high                                                     300
    MetaDataCapUtil_Rule  MetaDataPool_capUtil     90.0   80.0  high                                 gpfs_cluster_name,
                                                                                                     gpfs_fs_name,
                                                                                                     gpfs_diskpool_name  300
    
    
    Note:
    The node11_mem_available rule has the priority one for the RHEL77-11.novalocal node:
    [root@rhel77-11 ~]# mmhealth thresholds list -v -Y | grep node11_mem_available
    mmhealth_thresholds:THRESHOLD_RULE:HEADER:version:reserved:node11_mem_available:attribute:value:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:rule_name:node11_mem_available:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:frequency:300:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:tags:thresholds:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:user_action_warn:None:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:user_action_error:None:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:priority:1:                             
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:downsamplOp:None:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:type:measurement:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:metric:MemoryAvailable_percent:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:metricOp:noOperation:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:bucket_size:300:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:computation:None:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:duration:None:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:filterBy:node=rhel77-11.novalocal:       
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:groupBy:node:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:error:5.0:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:warn:50.0:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:direction:None:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:hysteresis:0.0:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:sensitivity:300:
    mmhealth_thresholds:THRESHOLD_RULE:0:1::node11_mem_available:state:active:
    All the MemFree_Rule events are removed for RHEL77-11.novalocal, since the node11_mem_available rule has the higher priority for this node:
    [root@rhel77-11 ~]# mmhealth node show threshold -v
    
    Node name:      RHEL77-11.novalocal
    
    Component                 Status        Status Change            Reasons
    -------------------------------------------------------------------------------
    THRESHOLD                 HEALTHY       2020-04-27 12:07:07      -
      active_thresh_monitor   HEALTHY       2020-04-27 12:07:22      -
      node11_mem_available    HEALTHY       2020-05-13 10:10:16      -
    
    
    Event                  Parameter                Severity    Active Since             Event Message
    --------------------------------------------------------------------------------------------------------------------------------------
    thresholds_normal      node11_mem_available     INFO        2020-05-13 10:10:16      The value of MemoryAvailable_percent defined
                                                                                         in node11_mem_available for component
                                                                                         node11_mem_available/rhel77-11.novalocal 
                                                                                         reached a normal level.
    thresholds_removed     MemFree_Rule             INFO        2020-05-13 10:06:15      The value of MemoryAvailable_percent defined 
                                                                                         for the component(s) 
                                                                                         MemFree_Rule/rhel77-11.novalocal 
                                                                                         defined in MemFree_Rule was removed.
    
    Since the warning boundary by node11_mem_available rule is higher than the MemFree_Rule rule, the WARNING event might appear faster than before for this node.
    [root@rhel77-11 ~]# mmhealth node show threshold -v
    
    Node name:      RHEL77-11.novalocal
    
    Component                 Status        Status Change            Reasons
    -------------------------------------------------------------------------------------------------------------------
    THRESHOLD                 DEGRADED      2020-05-13 12:50:24      thresholds_warn(node11_mem_available)
      active_thresh_monitor   HEALTHY       2020-04-27 12:07:22      -
      node11_mem_available    DEGRADED      2020-05-13 12:50:24      thresholds_warn(node11_mem_available)
    
    
    Event                  Parameter                Severity    Active Since             Event Message
    ------------------------------------------------------------------------------------------------------------------------------------------------
    thresholds_warn        node11_mem_available     WARNING     2020-05-13 12:50:23      The value of MemoryAvailable_percent for the component(s)
                                                                                         node11_mem_available/rhel77-11.novalocal exceeded 
                                                                                         threshold warning level 50.0 defined 
                                                                                         in node11_mem_available.
    thresholds_removed     MemFree_Rule             INFO        2020-05-13 10:06:15      The value of MemoryAvailable_percent for the component(s) 
                                                                                         MemFree_Rule/rhel77-11.novalocal defined 
                                                                                         in MemFree_Rule was removed.
  3. Run the following command to get the WARNING event details:
    [root@rhel77-11 ~] mmhealth event show thresholds_warn
    The system displays output similar to the following:
    Event Name:              thresholds_warn
    Description:             The thresholds value reached a warning level.
    Cause:                   The thresholds value reached a warning level.
    User Action:             Run 'mmhealth thresholds list -v' commmand 
                             and review the user action recommendations 
                             for the corresponding thresholds rule.
    Severity:                WARNING
    State:                   DEGRADED
    You can also review the event history by viewing the whole event log as shown:transfered
    [root@rhel77-11 ~]# mmhealth node eventlog
    Node name:      RHEL77-11.novalocal
    Timestamp                             Event Name                Severity                Details
    2020-04-27 11:59:06.532239 CEST       monitor_started           INFO       The IBM Storage Scale monitoring service has been started
    2020-04-27 11:59:07.410614 CEST       service_running           INFO       The service clusterstate is running on node RHEL77-11.novalocal
    2020-04-27 11:59:07.784565 CEST       service_running           INFO       The service network is running on node RHEL77-11.novalocal
    2020-04-27 11:59:09.965934 CEST       gpfs_down                 ERROR      The Storage Scale service process not running on this node.
                                                                               Normal operation cannot be done
    2020-04-27 11:59:10.102891 CEST       quorum_down               ERROR      The node is not able to reach enough quorum nodes/disks to work properly.
    2020-04-27 11:59:10.329689 CEST       service_running           INFO       The service gpfs is running on node RHEL77-11.novalocal
    2020-04-27 11:59:38.399120 CEST       gpfs_up                   INFO       The Storage Scale service process is running
    2020-04-27 11:59:38.498718 CEST       callhome_not_enabled      TIP        Callhome is not installed, configured or enabled.
    2020-04-27 11:59:38.511969 CEST       gpfs_pagepool_small       TIP        The GPFS pagepool is smaller than or equal to 1G.
    2020-04-27 11:59:38.526075 CEST       csm_resync_forced         INFO       All events and state will be transferred to the cluster manager
    2020-04-27 12:01:07.486549 CEST       quorum_up                 INFO       Quorum achieved
    2020-04-27 12:01:41.906686 CEST       service_running           INFO       The service disk is running on node RHEL77-11.novalocal
    2020-04-27 12:02:22.319159 CEST       fs_remount_mount          INFO       The filesystem gpfs01 was mounted internal
    2020-04-27 12:02:22.322987 CEST       disk_found                INFO       The disk nsd_1 was found
    2020-04-27 12:02:22.337810 CEST       fs_remount_mount          INFO       The filesystem gpfs01 was mounted normal
    2020-04-27 12:02:22.369814 CEST       mounted_fs_check          INFO       The filesystem gpfs01 is mounted
    2020-04-27 12:02:22.443717 CEST       service_running           INFO       The service filesystem is running on node RHEL77-11.novalocal
    2020-04-27 12:04:43.842571 CEST       service_running           INFO       The service threshold is running on node RHEL77-11.novalocal
    2020-04-27 12:04:55.168176 CEST       service_running           INFO       The service perfmon is running on node RHEL77-11.novalocal
    2020-04-27 12:07:07.657284 CEST       service_running           INFO       The service threshold is running on node RHEL77-11.novalocal
    2020-04-27 12:07:22.609728 CEST       thresh_monitor_set_active INFO       The thresholds monitoring process is running in ACTIVE state 
                                                                               on the local node
    2020-04-27 12:07:22.626369 CEST       thresholds_new_rule       INFO       Rule MemFree_Rule was added
    2020-04-27 12:09:08.275073 CEST       local_fs_normal           INFO       The local file system with the mount point / used for /tmp/mmfs 
                                                                               reached a normal level with more than 1000 MB free space.
    2020-04-27 12:14:07.997867 CEST       singleton_sensor_on       INFO       The singleton sensors of pmsensors are turned on
    2020-05-08 11:03:45.324399 CEST       local_fs_path_not_found   INFO       The configured dataStructureDump path /tmp/mmfs does not exists. 
                                                                               Skipping monitoring.
    2020-05-13 10:06:15.912457 CEST       thresholds_removed        INFO       The value of MemoryAvailable_percent for the component(s) 
                                                                               MemFree_Rule/rhel77-11.novalocal defined in MemFree_Rule was removed.
    2020-05-13 10:10:16.173478 CEST       thresholds_new_rule       INFO       Rule node11_mem_available was added
    2020-05-13 12:50:23.955531 CEST       thresholds_warn           WARNING    The value of MemoryAvailable_percent for the component(s)
                                                                               node11_mem_available/rhel77-11.novalocal exceeded threshold 
                                                                               warning level 50.0 defined in node11_mem_available.
    2020-05-13 13:14:12.836070 CEST       out_of_memory             WARNING    Detected Out of memory killer conditions in system log