Use case 1: Create a threshold rule and use the mmhealth command to observe the changed in the HEALTH status

This section describes the threshold use case to create a threshold rule and use the mmhealth commands to observe the changed in the HEALTH status.

  1. To monitor the memory_free usage on each node, create a new thresholds rule with the following settings:
    # mmhealth thresholds add mem_memfree --errorlevel 1000000 --warnlevel 1500000 
    --name myTest_memfree --groupby node
    The system displays output similar to the following:
    New rule 'myTest_memfree' is created. The monitor process is activated
  2. To view the list of all threshold rules defined for the system, run the following command:
    mmhealth thresholds list
    The system displays output similar to the following:
    ### Threshold Rules ###
    rule_name             metric                error    warn    direction filterBy  groupBy                        sensitivity
    ------------------------------------------------------------------------------------------------------------------------------
    myTest_memfree        mem_memfree           1000000  1500000  None               node                             300
    InodeCapUtil_Rule     Fileset_inode         90.0     80.0     high               gpfs_cluster_name,
                                                                                     gpfs_fs_name,gpfs_fset_name      300
    DataCapUtil_Rule      DataPool_capUtil      90.0     80.0     high               gpfs_cluster_name,
                                                                                     gpfs_fs_name,gpfs_diskpool_name  300
    MemFree_Rule          mem_memfree           50000    100000   low                node                             300
    MetaDataCapUtil_Rule  MetaDataPool_capUtil  90.0     80.0     high               gpfs_cluster_name,
                                                                                     gpfs_fs_name,gpfs_diskpool_name  300
  3. To show the THRESHOLD status of the current node, run the following command:
    # mmhealth node show THRESHOLD
    The system displays output similar to the following:
    Component           Status        Status Change     Reasons
    -----------------------------------------------------------
    THRESHOLD           HEALTHY       13 hours ago      -
      MemFree_Rule      HEALTHY       13 hours ago      -
      myTest_memfree    HEALTHY       10 min   ago      -
  4. To view the event log history of the node, run the following command on each node:
    # mmhealth node eventlog
    2017-03-17 11:52:33.063550 CET        thresholds_error          ERROR      The value of mem_memfree for the component(s) 
                                                                               myTest_memfree/gpfsgui-14.novalocal exceeded 
                                                                               threshold error level 1000000 defined in 
                                                                               myTest_memfree.
    # mmhealth node eventlog
    2017-03-17 11:52:32.594932 CET        thresholds_warn           WARNING    The value of mem_memfree for the component(s) 
                                                                               myTest_memfree/gpfsgui-11.novalocal exceeded 
                                                                               threshold warning level 1500000 defined in 
                                                                               myTest_memfree.
    2017-03-17 12:00:31.653163 CET        thresholds_normal         INFO       The value of mem_memfree defined in myTest_memfree 
                                                                               for component myTest_memfree/gpfsgui-11.novalocal 
                                                                               reached a normal level.
    # mmhealth node eventlog
    2017-03-17 11:52:35.389392 CET        thresholds_error          ERROR      The value of mem_memfree for the component(s) 
                                                                               myTest_memfree/gpfsgui-13.novalocal exceeded 
                                                                               threshold error level 1000000 defined in 
                                                                               myTest_memfree.
    
  5. You can view the actual metric values and compare with the rule boundaries by running the metric query against the pmcollector node. The following example shows the mem_memfree metric query command and metric values for each node in the output:
    # date; echo "get metrics mem_memfree -x -r last 10 bucket_size 300 " |
      /opt/IBM/zimon/zc gpfsgui-11
    The system displays output similar to the following:
    Fri Mar 17 12:09:00 CET 2017
    1:      gpfsgui-11.novalocal|Memory|mem_memfree
    2:      gpfsgui-12.novalocal|Memory|mem_memfree
    3:      gpfsgui-13.novalocal|Memory|mem_memfree
    4:      gpfsgui-14.novalocal|Memory|mem_memfree
    Row     Timestamp               mem_memfree     mem_memfree     mem_memfree     mem_memfree
    1       2017-03-17 11:20:00     1558888         1598442         717029          768610
    2       2017-03-17 11:25:00     1555256         1598596         717328          768207
    3       2017-03-17 11:30:00     1554707         1597399         715988          767737
    4       2017-03-17 11:35:00     1554945         1598114         715664          768056
    5       2017-03-17 11:40:00     1553744         1597234         715559          766245
    6       2017-03-17 11:45:00     1552876         1596891         715369          767282
    7       2017-03-17 11:50:00     1450204         1596364         714640          766594
    8       2017-03-17 11:55:00     1389649         1595493         714228          764839
    9       2017-03-17 12:00:00     1549598         1594154         713059          765411
    10      2017-03-17 12:05:00     1547029         1590308         706375          766655
    ...
  6. To view the THRESHOLD status of all the nodes, run the following command:
    # mmhealth cluster show THRESHOLD
    The system displays output similar to the following:
    Component                Node                     Status            Reasons
    ------------------------------------------------------------------------------------------
    THRESHOLD                gpfsgui-11.novalocal     HEALTHY           -
    THRESHOLD                gpfsgui-13.novalocal     FAILED            thresholds_error
    THRESHOLD                gpfsgui-12.novalocal     HEALTHY           -
    THRESHOLD                gpfsgui-14.novalocal     FAILED            thresholds_error
  7. To view the details of the raised event, run the following command:
    # mmhealth event show thresholds_error
    The system displays output similar to this:
    Event Name:              thresholds_error
    Description:             The thresholds value reached an error level.
    Cause:                   The thresholds value reached an error level.
    User Action:             N/A
    Severity:                ERROR
    State:                   FAILED
  8. To get an overview about the node that is reporting an unhealthy status, check the event log for this node by running the following command:
    # mmhealth node eventlog
    The system displays output similar to the following:
    ...
    2017-03-17 11:50:23.252419 CET        move_cesip_from           INFO       Address 192.168.0.158 was moved from this node to node 0
    2017-03-17 11:50:23.400872 CET        thresholds_warn           WARNING    The value of mem_memfree for the component(s) 
                                                                               myTest_memfree/gpfsgui-13.novalocal exceeded 
                                                                               threshold warning level 1500000 defined in myTest_memfree.
    2017-03-17 11:50:26.090570 CET        mounted_fs_check          INFO       The filesystem fs2 is mounted
    2017-03-17 11:50:26.304381 CET        mounted_fs_check          INFO       The filesystem gpfs0 is mounted
    2017-03-17 11:50:26.428079 CET        fs_remount_mount          INFO       The filesystem gpfs0 was mounted normal
    2017-03-17 11:50:27.449704 CET        quorum_up                 INFO       Quorum achieved
    2017-03-17 11:50:28.283431 CET        mounted_fs_check          INFO       The filesystem gpfs0 is mounted
    2017-03-17 11:52:32.591514 CET        mounted_fs_check          INFO       The filesystem objfs is mounted
    2017-03-17 11:52:32.685953 CET        fs_remount_mount          INFO       The filesystem objfs was mounted normal
    2017-03-17 11:52:32.870778 CET        fs_remount_mount          INFO       The filesystem fs1 was mounted normal
    2017-03-17 11:52:35.752707 CET        mounted_fs_check          INFO       The filesystem fs1 is mounted
    2017-03-17 11:52:35.931688 CET        mounted_fs_check          INFO       The filesystem objfs is mounted
    2017-03-17 12:00:36.390594 CET        service_disabled          INFO       The service auth is disabled
    2017-03-17 12:00:36.673544 CET        service_disabled          INFO       The service block is disabled
    2017-03-17 12:00:39.636839 CET        postgresql_failed         ERROR      postgresql-obj process should be started but is stopped
    
    2017-03-16 12:01:21.389392 CET        thresholds_error          ERROR      The value of mem_memfree for the component(s) 
                                                                               myTest_memfree/gpfsgui-13.novalocal exceeded 
                                                                               threshold error level 1000000 defined in myTest_memfree.
    
  9. To check the last THRESHOLD event update for this node, run the following command:
    # mmhealth node show THRESHOLD
    The system displays output similar to the following:
    Node name:      gpfsgui-13.novalocal
    
    Component          Status        Status Change     Reasons
    --------------------------------------------------------------------------------------------------------
    THRESHOLD          FAILED         15 minutes ago      thresholds_error(myTest_memfree/gpfsgui-13.novalocal)
      myTest_memfree   FAILED         15 minutes ago      thresholds_error
    
    
    Event                Parameter         Severity    Active Since     Event Message
    ----------------------------------------------------------------------------------------------------------------------------
    thresholds_error     myTest_memfree    ERROR       15 minutes ago   The value of mem_memfree for the component(s)
                                                                        myTest_memfree/gpfsgui-13.novalocal exceeded 
                                                                        threshold error level 1000000 defined in myTest_memfree.
  10. To review the status of all services for this node, run the following command:
    # mmhealth node show
    The system displays output similar to the following:
    Node name:      gpfsgui-13.novalocal
    Node status:    TIPS
    Status Change:  15 hours ago
    
    Component      Status        Status Change     Reasons
    ----------------------------------------------------------------------------------------------------------------------
    GPFS           TIPS          15 hours ago      gpfs_maxfilestocache_small, gpfs_maxstatcache_high, gpfs_pagepool_small
    NETWORK        HEALTHY       15 hours ago      -
    FILESYSTEM     HEALTHY       15 hours ago      -
    DISK           HEALTHY       15 hours ago      -
    CES            TIPS          15 hours ago      nfs_sensors_inactive
    PERFMON        HEALTHY       15 hours ago      -
    THRESHOLD      FAILED        15 minutes ago      thresholds_error(myTest_memfree/gpfsgui-13.novalocal)
    [root@gpfsgui-13 ~]#