Use case 1: Create a threshold rule and use the mmhealth command to observe the changed in the HEALTH status
This section describes the threshold use case to create a threshold rule and use the
mmhealth commands to observe the changed in the HEALTH
status.
- To monitor the memory_free usage on each node, create a new thresholds
rule with the following
settings:
# mmhealth thresholds add mem_memfree --errorlevel 1000000 --warnlevel 1500000 --name myTest_memfree --groupby node
The system displays output similar to the following:New rule 'myTest_memfree' is created. The monitor process is activated
- To view the list of all threshold rules defined for the system, run the following
command:
mmhealth thresholds list
The system displays output similar to the following:### Threshold Rules ### rule_name metric error warn direction filterBy groupBy sensitivity ------------------------------------------------------------------------------------------------------------------------------ myTest_memfree mem_memfree 1000000 1500000 None node 300 InodeCapUtil_Rule Fileset_inode 90.0 80.0 high gpfs_cluster_name, gpfs_fs_name,gpfs_fset_name 300 DataCapUtil_Rule DataPool_capUtil 90.0 80.0 high gpfs_cluster_name, gpfs_fs_name,gpfs_diskpool_name 300 MemFree_Rule mem_memfree 50000 100000 low node 300 MetaDataCapUtil_Rule MetaDataPool_capUtil 90.0 80.0 high gpfs_cluster_name, gpfs_fs_name,gpfs_diskpool_name 300
- To show the THRESHOLD status of the current node, run the following
command:
# mmhealth node show THRESHOLD
The system displays output similar to the following:Component Status Status Change Reasons ----------------------------------------------------------- THRESHOLD HEALTHY 13 hours ago - MemFree_Rule HEALTHY 13 hours ago - myTest_memfree HEALTHY 10 min ago -
- To view the event log history of the node, run the following command on each
node:
# mmhealth node eventlog 2017-03-17 11:52:33.063550 CET thresholds_error ERROR The value of mem_memfree for the component(s) myTest_memfree/gpfsgui-14.novalocal exceeded threshold error level 1000000 defined in myTest_memfree.
# mmhealth node eventlog 2017-03-17 11:52:32.594932 CET thresholds_warn WARNING The value of mem_memfree for the component(s) myTest_memfree/gpfsgui-11.novalocal exceeded threshold warning level 1500000 defined in myTest_memfree. 2017-03-17 12:00:31.653163 CET thresholds_normal INFO The value of mem_memfree defined in myTest_memfree for component myTest_memfree/gpfsgui-11.novalocal reached a normal level.
# mmhealth node eventlog 2017-03-17 11:52:35.389392 CET thresholds_error ERROR The value of mem_memfree for the component(s) myTest_memfree/gpfsgui-13.novalocal exceeded threshold error level 1000000 defined in myTest_memfree.
- You can view the actual metric values and compare with the rule boundaries by running the metric
query against the pmcollector node. The following example shows the mem_memfree metric
query command and metric values for each node in the
output:
# date; echo "get metrics mem_memfree -x -r last 10 bucket_size 300 " | /opt/IBM/zimon/zc gpfsgui-11
The system displays output similar to the following:Fri Mar 17 12:09:00 CET 2017 1: gpfsgui-11.novalocal|Memory|mem_memfree 2: gpfsgui-12.novalocal|Memory|mem_memfree 3: gpfsgui-13.novalocal|Memory|mem_memfree 4: gpfsgui-14.novalocal|Memory|mem_memfree Row Timestamp mem_memfree mem_memfree mem_memfree mem_memfree 1 2017-03-17 11:20:00 1558888 1598442 717029 768610 2 2017-03-17 11:25:00 1555256 1598596 717328 768207 3 2017-03-17 11:30:00 1554707 1597399 715988 767737 4 2017-03-17 11:35:00 1554945 1598114 715664 768056 5 2017-03-17 11:40:00 1553744 1597234 715559 766245 6 2017-03-17 11:45:00 1552876 1596891 715369 767282 7 2017-03-17 11:50:00 1450204 1596364 714640 766594 8 2017-03-17 11:55:00 1389649 1595493 714228 764839 9 2017-03-17 12:00:00 1549598 1594154 713059 765411 10 2017-03-17 12:05:00 1547029 1590308 706375 766655 ...
- To view the THRESHOLD status of all the nodes, run the following
command:
# mmhealth cluster show THRESHOLD
The system displays output similar to the following:Component Node Status Reasons ------------------------------------------------------------------------------------------ THRESHOLD gpfsgui-11.novalocal HEALTHY - THRESHOLD gpfsgui-13.novalocal FAILED thresholds_error THRESHOLD gpfsgui-12.novalocal HEALTHY - THRESHOLD gpfsgui-14.novalocal FAILED thresholds_error
- To view the details of the raised event, run the following command:
# mmhealth event show thresholds_error
The system displays output similar to this:Event Name: thresholds_error Description: The thresholds value reached an error level. Cause: The thresholds value reached an error level. User Action: N/A Severity: ERROR State: FAILED
- To get an overview about the node that is reporting an unhealthy status, check the event log for
this node by running the following command:
# mmhealth node eventlog
The system displays output similar to the following:... 2017-03-17 11:50:23.252419 CET move_cesip_from INFO Address 192.168.0.158 was moved from this node to node 0 2017-03-17 11:50:23.400872 CET thresholds_warn WARNING The value of mem_memfree for the component(s) myTest_memfree/gpfsgui-13.novalocal exceeded threshold warning level 1500000 defined in myTest_memfree. 2017-03-17 11:50:26.090570 CET mounted_fs_check INFO The filesystem fs2 is mounted 2017-03-17 11:50:26.304381 CET mounted_fs_check INFO The filesystem gpfs0 is mounted 2017-03-17 11:50:26.428079 CET fs_remount_mount INFO The filesystem gpfs0 was mounted normal 2017-03-17 11:50:27.449704 CET quorum_up INFO Quorum achieved 2017-03-17 11:50:28.283431 CET mounted_fs_check INFO The filesystem gpfs0 is mounted 2017-03-17 11:52:32.591514 CET mounted_fs_check INFO The filesystem objfs is mounted 2017-03-17 11:52:32.685953 CET fs_remount_mount INFO The filesystem objfs was mounted normal 2017-03-17 11:52:32.870778 CET fs_remount_mount INFO The filesystem fs1 was mounted normal 2017-03-17 11:52:35.752707 CET mounted_fs_check INFO The filesystem fs1 is mounted 2017-03-17 11:52:35.931688 CET mounted_fs_check INFO The filesystem objfs is mounted 2017-03-17 12:00:36.390594 CET service_disabled INFO The service auth is disabled 2017-03-17 12:00:36.673544 CET service_disabled INFO The service block is disabled 2017-03-17 12:00:39.636839 CET postgresql_failed ERROR postgresql-obj process should be started but is stopped 2017-03-16 12:01:21.389392 CET thresholds_error ERROR The value of mem_memfree for the component(s) myTest_memfree/gpfsgui-13.novalocal exceeded threshold error level 1000000 defined in myTest_memfree.
- To check the last THRESHOLD event update for this node, run the
following command:
# mmhealth node show THRESHOLD
The system displays output similar to the following:Node name: gpfsgui-13.novalocal Component Status Status Change Reasons -------------------------------------------------------------------------------------------------------- THRESHOLD FAILED 15 minutes ago thresholds_error(myTest_memfree/gpfsgui-13.novalocal) myTest_memfree FAILED 15 minutes ago thresholds_error Event Parameter Severity Active Since Event Message ---------------------------------------------------------------------------------------------------------------------------- thresholds_error myTest_memfree ERROR 15 minutes ago The value of mem_memfree for the component(s) myTest_memfree/gpfsgui-13.novalocal exceeded threshold error level 1000000 defined in myTest_memfree.
- To review the status of all services for this node, run the following
command:
# mmhealth node show
The system displays output similar to the following:Node name: gpfsgui-13.novalocal Node status: TIPS Status Change: 15 hours ago Component Status Status Change Reasons ---------------------------------------------------------------------------------------------------------------------- GPFS TIPS 15 hours ago gpfs_maxfilestocache_small, gpfs_maxstatcache_high, gpfs_pagepool_small NETWORK HEALTHY 15 hours ago - FILESYSTEM HEALTHY 15 hours ago - DISK HEALTHY 15 hours ago - CES TIPS 15 hours ago nfs_sensors_inactive PERFMON HEALTHY 15 hours ago - THRESHOLD FAILED 15 minutes ago thresholds_error(myTest_memfree/gpfsgui-13.novalocal) [root@gpfsgui-13 ~]#