GPFS events

The following table lists the events that are created for the GPFS component.

Table 1. Events for the GPFS component
Event Event
Type
Severity Call Home Details
bandwidth_low STATE_CHANGE
TIPS
TIP no Message: Network bandwidth is insufficient in relation to the pagepool size. A flush of the pagepool (for example, for taking a snapshot) might take longer and interfere with other operations.
Description: Network bandwidth is insufficient in relation to the pagepool size. A flush of the pagepool (for example, for taking a snapshot) might take longer and interfere with other operations.
Cause: Network bandwidth is insufficient in relation to the pagepool size. A flush of the pagepool (for example, for taking a snapshot) might take longer and interfere with other operations.
User Action: Check the size of the pagepool and consider whether certain operations are OK to take longer.
bandwidth_ok TIP INFO no Message: Sufficient network bandwidth in relation to the pagepool size.
Description: Sufficient network bandwidth in relation to the pagepool size.
Cause: N/A
User Action: N/A
callhome_enabled TIP INFO no Message: Call home is installed, configured, and enabled.
Description: With enabling the call home functionality, you are providing useful information to the developers, which helps to improve the product.
Cause: Call home packages are installed. Call home is configured and enabled.
User Action: N/A
callhome_not_enabled TIP TIP no Message: Call home is not installed, configured, or enabled.
Description: Call home is a functionality that uploads cluster configuration and log files onto the IBM ECuRep servers. It provides helpful information for the developers to improve the product as well as for the support to help in PMR cases.
Cause: Call home packages are not installed, there is no call home configuration, there are no call home groups, or no call home group was enabled.
User Action: Use the mmcallhome command to setup call home.
callhome_not_monitored TIP INFO no Message: Call home status is not monitored on the current node.
Description: Call home status is not monitored on the current node. The call home status was monitored when it was the cluster manager.
Cause: Previously, this node was a cluster manager, and call home monitoring was running on it.
User Action: N/A
callhome_without_schedule TIP TIP no Message: In spite of call home is enabled, neither daily nor weekly schedule is configured.
Description: Call home is enabled, but, neither daily nor weekly schedule is configured. It is recommended to enable daily or weekly call home schedules.
Cause: In spite of call home is enabled, neither daily nor weekly schedule is configured.
User Action: Enable daily call home uploads by using the mmcallhome schedule add --task DAILY command.
ccr_auth_keys_disabled STATE_CHANGE
HEALTHY
INFO no Message: The security file that is used by GPFS CCR is not checked on this node.
Description: The check for the security file used by GPFS CCR is disabled on this node, since it is not a quorum node.
Cause: N/A
User Action: N/A
ccr_auth_keys_fail STATE_CHANGE
DEGRADED
ERROR FTDC upload Message: The security file that is used by GPFS CCR is corrupt. Item={0},ErrMsg={1},Failed={2}.
Description: The security file used by GPFS CCR is corrupt. For more information, see message.
Cause: Either the security file is missing or corrupt.
User Action: Recover this degraded node from a still intact node by using the mmsdrrestore -p NODE command with NODE by specifying intact node. For more information, see the mmsdrrestore command in the Command Reference Guide.
ccr_auth_keys_ok STATE_CHANGE
HEALTHY
INFO no Message: The security file that is used by GPFS CCR is OK {0}.
Description: The security file used by GPFS CCR is OK.
Cause: N/A
User Action: N/A
ccr_client_init_disabled STATE_CHANGE
HEALTHY
INFO no Message: GPFS CCR client initialization is not checked on this node.
Description: The check for GPFS CCR client initialization is disabled on this node, since it is not a quorum node.
Cause: N/A
User Action: N/A
ccr_client_init_fail STATE_CHANGE
DEGRADED
ERROR no Message: GPFS CCR client initialization has failed. Item={0},ErrMsg={1},Failed={2}.
Description: The GPFS CCR client initialization has failed. For more information, see message.
Cause: The item specified in the message is either not available or corrupt.
User Action: Recover this degraded node from a still intact node by using the mmsdrrestore -p NODE command with NODE by specifying intact node. For more information, see the mmsdrrestore command in the Command Reference Guide.
ccr_client_init_ok STATE_CHANGE
HEALTHY
INFO no Message: GPFS CCR client initialization is OK {0}.
Description: GPFS CCR client initialization is OK.
Cause: N/A
User Action: N/A
ccr_client_init_warn STATE_CHANGE
DEGRADED
WARNING no Message: GPFS CCR client initialization has failed. Item={0},ErrMsg={1},Failed={2}.
Description: The GPFS CCR client initialization has failed. For more information, see message.
Cause: The item specified in the message is either not available or corrupt.
User Action: Recover this degraded node from a still intact node by using the mmsdrrestore -p NODE command with NODE by specifying intact node. For more information, see the mmsdrrestore command in the Command Reference Guide.
ccr_comm_dir_disabled STATE_CHANGE
HEALTHY
INFO no Message: The files that are committed to the GPFS CCR are not checked on this node.
Description: The check for the files that are committed to the GPFS CCR is disabled on this node, since it is not a quorum node.
Cause: N/A
User Action: N/A
ccr_comm_dir_fail STATE_CHANGE
DEGRADED
ERROR FTDC upload Message: The files committed to the GPFS CCR are not complete or corrupt. Item={0},ErrMsg={1},Failed={2}.
Description: The files committed to the GPFS CCR are not complete or corrupt. For more information, see message.
Cause: The local disk might be full.
User Action: Check the local disk space and remove not necessary files. Recover this degraded node from a still intact node by using the mmsdrrestore -p NODE command with NODE by specifying an intact node. For more information, see the mmsdrrestore command in the Command Reference Guide.
ccr_comm_dir_ok STATE_CHANGE
HEALTHY
INFO no Message: The files committed to the GPFS CCR are complete and intact {0}.
Description: The files committed to the GPFS CCR are complete and intact.
Cause: N/A
User Action: N/A
ccr_comm_dir_warn STATE_CHANGE
DEGRADED
WARNING no Message: The files that are committed to the GPFS CCR are not complete or corrupt. Item={0},ErrMsg={1},Failed={2}.
Description: The files that are committed to the GPFS CCR are not complete or corrupt. For more information, see message.
Cause: The local disk may be full.
User Action: Check the local disk space and remove not necessary files. Recover this degraded node from a still intact node by using the mmsdrrestore -p NODE command with NODE by specifying an intact node. For more information, see the mmsdrrestore command in the Command Reference Guide.
ccr_ip_lookup_disabled STATE_CHANGE
HEALTHY
INFO no Message: The IP address lookup for the GPFS CCR component is not checked on this node.
Description: The check for the IP address lookup for the GPFS CCR component is disabled on this node, since it is not a quorum node.
Cause: N/A
User Action: N/A
ccr_ip_lookup_ok STATE_CHANGE
HEALTHY
INFO no Message: The IP address lookup for the GPFS CCR component is OK {0}.
Description: The IP address lookup for the GPFS CCR component is OK.
Cause: N/A
User Action: N/A
ccr_ip_lookup_warn STATE_CHANGE
DEGRADED
WARNING no Message: The IP address lookup for the GPFS CCR component takes too long. Item={0},ErrMsg={1},Failed={2}.
Description: The IP address lookup for the GPFS CCR component takes too long, resulting in slow administration commands. For more information, see message.
Cause: The local network or the DNS is not configured correctly.
User Action: Check the local network and DNS configuration.
ccr_local_server_disabled STATE_CHANGE
HEALTHY
INFO no Message: The local GPFS CCR server is not checked on this node.
Description: The check for the local GPFS CCR server is disabled on this node, since it is not a quorum node.
Cause: N/A
User Action: N/A
ccr_local_server_ok STATE_CHANGE
HEALTHY
INFO no Message: The local GPFS CCR server is reachable {0}.
Description: The local GPFS CCR server is reachable.
Cause: N/A
User Action: N/A
ccr_local_server_warn STATE_CHANGE
DEGRADED
WARNING no Message: The local GPFS CCR server is not reachable. Item={0},ErrMsg={1},Failed={2}.
Description: The local GPFS CCR server is not reachable. For more information, see message.
Cause: Either the local network or firewall is configured wrong, or the local GPFS daemon does not respond.
User Action: Check the network and firewall configuration with regards to the used GPFS communication port (default: 1191). Restart GPFS on this node.
ccr_paxos_12_disabled STATE_CHANGE
HEALTHY
INFO no Message: The stored GPFS CCR state is not checked on this node.
Description: The check for the stored GPFS CCR state is disabled on this node, since it is not a quorum node.
Cause: N/A
User Action: N/A
ccr_paxos_12_fail STATE_CHANGE
DEGRADED
ERROR FTDC upload Message: The stored GPFS CCR state is corrupt. Item={0},ErrMsg={1},Failed={2}.
Description: The stored GPFS CCR state is corrupt. For more information, see message.
Cause: The CCR on quorum nodes has inconsistent states. Use the mmccr check -e command to check the detailed status.
User Action: Recover this degraded node from a still intact node by using the mmsdrrestore -p NODE command with NODE by specifying intact node. For more information, see the mmsdrrestore command in the Command Reference Guide.
ccr_paxos_12_ok STATE_CHANGE
HEALTHY
INFO no Message: The stored GPFS CCR state is OK {0}.
Description: The stored GPFS CCR state is OK.
Cause: N/A
User Action: N/A
ccr_paxos_12_warn STATE_CHANGE
DEGRADED
WARNING no Message: The stored GPFS CCR state is corrupt. Item={0},ErrMsg={1},Failed={2}.
Description: The stored GPFS CCR state is corrupt. For more information, see message.
Cause: One stored GPFS state file is missing or corrupt.
User Action: No user action necessary. GPFS repairs this automatically.
ccr_paxos_cached_disabled STATE_CHANGE
HEALTHY
INFO no Message: The stored GPFS CCR state is not checked on this node.
Description: The check for the stored GPFS CCR state is disabled on this node, since it is not a quorum node.
Cause: N/A
User Action: N/A
ccr_paxos_cached_fail STATE_CHANGE
DEGRADED
ERROR no Message: The stored GPFS CCR state is corrupt. Item={0},ErrMsg={1},Failed={2}.
Description: The stored GPFS CCR state is corrupt. For more information, see message.
Cause: Either the stored GPFS CCR state file is corrupt or empty.
User Action: Recover this degraded node from a still intact node by using the mmsdrrestore -p NODE command with NODE by specifying intact node. For more information, see the mmsdrrestore command in the Command Reference Guide.
ccr_paxos_cached_ok STATE_CHANGE
HEALTHY
INFO no Message: The stored GPFS CCR state is OK {0}.
Description: The stored GPFS CCR state is OK.
Cause: N/A
User Action: N/A
ccr_quorum_node_ok STATE_CHANGE
HEALTHY
INFO no Message: Quorum node {0} with IP {1} is reachable.
Description: Quorum node is reachable.
Cause: N/A
User Action: N/A
ccr_quorum_node_warn STATE_CHANGE
DEGRADED
WARNING no Message: Quorum node {0} with IP {1} is not reachable.
Description: A quorum node is not reachable or does not respond to CCR requests.
Cause: The quorum node is not reachable due to a network issue or firewall misconfiguration.
User Action: Check the network or firewall (default port 1191 must not be blocked) configuration of the not reachable quorum node.
ccr_quorum_nodes_disabled STATE_CHANGE
HEALTHY
INFO no Message: The quorum nodes reachability is not checked on this node.
Description: The check for the reachability of the quorum nodes is disabled on this node, since this node is not a quorum node.
Cause: N/A
User Action: N/A
ccr_quorum_nodes_fail STATE_CHANGE
DEGRADED
ERROR no Message: A majority of the quorum nodes are not reachable over the management network. Item={0},ErrMsg={1},Failed={2}.
Description: A majority of the quorum nodes are not reachable over the management network. GPFS declares quorum loss. For more information, see message.
Cause: The quorum nodes cannot communicate with each other caused by a network or firewall misconfiguration.
User Action: Check the network or firewall (default port 1191 must not be blocked) configuration of the not reachable quorum nodes.
ccr_quorum_nodes_ok STATE_CHANGE
HEALTHY
INFO no Message: A majority of quorum nodes are reachable {0}.
Description: A majority of quorum nodes are reachable.
Cause: N/A
User Action: N/A
ccr_quorum_nodes_warn STATE_CHANGE
DEGRADED
WARNING no Message: At least one quorum node is non-reachable. Item={0},ErrMsg={1},Failed={2}.
Description: At least one quorum node is not reachable. For more information, see message.
Cause: The quorum node is not reachable caused by a network or firewall misconfiguration.
User Action: Check the network or firewall (default port 1191 must not be blocked) configuration of the not reachable quorum node.
ccr_tiebreaker_dsk_disabled STATE_CHANGE
HEALTHY
INFO no Message: The accessibility of the tiebreaker disks that are used by the GPFS CCR is not checked on this node.
Description: The accessibility check for the tiebreaker disks that are used by the GPFS CCR is disabled on this node, since it is not a quorum node.
Cause: N/A
User Action: N/A
ccr_tiebreaker_dsk_fail STATE_CHANGE
DEGRADED
ERROR no Message: Access to tiebreaker disks have failed. Item={0},ErrMsg={1},Failed={2}.
Description: Access to all tiebreaker disks have failed. For more information, see message.
Cause: Corrupt disk.
User Action: Check whether the tiebreaker disks are available.
ccr_tiebreaker_dsk_ok STATE_CHANGE
HEALTHY
INFO no Message: All tiebreaker disks that are used by the GPFS CCR are accessible {0}.
Description: All tiebreaker disks that are used by the GPFS CCR are accessible.
Cause: N/A
User Action: N/A
ccr_tiebreaker_dsk_warn STATE_CHANGE
DEGRADED
WARNING no Message: At least one tiebreaker disk is not accessible. Item={0},ErrMsg={1},Failed={2}.
Description: At least one tiebreaker disk is not accessible. For more information, see message.
Cause: Corrupt disk.
User Action: Check whether the tiebreaker disk is accessible.
cert_expires_fail STATE_CHANGE
FAILED
ERROR no Message: The cluster certificate is going to expire in {0} days on {1}. The cluster might go down if not renewed.
Description: Cluster certificate expires very soon, leading to a cluster shutdown if not renewed.
Cause: The certificate expiration date is in the near future. Validate the certificate expiration using the mmauth show command.
User Action: Renew the cluster certificate before it expires. Run the mmauth genkey new command followed by the mmauth genkey commit command to renew the cluster certificate.
cert_expires_not_monitored STATE_CHANGE
HEALTHY
INFO no Message: Certificate validation is not monitored on a non cluster manager node.
Description: Certificate validation is not monitored on a non cluster manager node.
Cause: N/A
User Action: N/A
cert_expires_ok STATE_CHANGE
HEALTHY
INFO no Message: The certificate is not close to expiring. Expiration date: {0}.
Description: The certificate is not close to expiring. No immediate action required.
Cause: N/A
User Action: N/A
cert_expires_warn STATE_CHANGE
DEGRADED
WARNING no Message: The cluster certificate is going to expire in {0} days on {1}. The cluster may go down if not renewed.
Description: Cluster certificate expires soon, leading to a cluster shutdown if not renewed.
Cause: The certificate expiration date is in the near future. Validate the certificate expiration using the mmauth show command.
User Action: The cluster certificate expires soon. Run the mmauth genkey new command followed by the mmauth genkey commit command to renew the cluster certificate.
cluster_connections_bad STATE_CHANGE
DEGRADED
WARNING no Message: Connection to cluster node {0} has {1} bad connection(s). (Maximum {2}).
Description: The cluster internal network to a node is in a bad state. Not all possible connections do work.
Cause: The cluster internal network to a node is in a bad state. Not all possible connections do work.
User Action: Check whether the cluster network is good. The event can be manually cleared by using the mmhealth event resolve cluster_connections_bad command.
cluster_connections_clear STATE_CHANGE
HEALTHY
INFO no Message: Cleared all cluster internal connection states.
Description: The cluster internal network is in a good state. All possible connections are working.
Cause: N/A
User Action: N/A
cluster_connections_down STATE_CHANGE
DEGRADED
WARNING no Message: Connection to cluster node {0} has all {1} connection(s) down. (Maximum {2}).
Description: The cluster internal network to a node is in a bad state. All possible connections are down.
Cause: The cluster internal network to a node is in a bad state. All possible connections are down.
User Action: Check whether the cluster network is good. The event can be manually cleared by using the mmhealth event resolve cluster_connections_down command.
cluster_connections_ok STATE_CHANGE
HEALTHY
INFO no Message: All connections are good for target IP {0}.
Description: The cluster internal network is in a good state. All possible connections do work.
Cause: N/A
User Action: N/A
csm_resync_forced STATE_CHANGE_EXTERNAL
HEALTHY
INFO no Message: All events and state are transferred to the cluster manager.
Description: All events and state are transferred to the cluster manager.
Cause: The mmhealth node show --resync command was executed.
User Action: N/A
csm_resync_needed STATE_CHANGE_EXTERNAL
DEGRADED
WARNING no Message: Forwarding of an event to the cluster manager failed multiple times.
Description: Forwarding of an event to the cluster manager failed multiple times, which causes the mmhealth cluster show command to show stale data.
Cause: Cluster manager node cannot be reached.
User Action: Check state and connection of the cluster manager node. Then, run the mmhealth node show --resync command.
deadlock_detected STATE_CHANGE
DEGRADED
WARNING no Message: The cluster detected a file system deadlock in the IBM Storage Scale file system.
Description: The cluster detected a deadlock in the IBM Storage Scale file system.
Cause: High file system activity might cause this issue.
User Action: The problem may be temporary or remain. For more information, check the /var/adm/ras/mmfs.log.latest log files.
disk_call_home INFO ERROR service ticket Message: Disk requires replacement: event:{0}, eventName:{1}, rgName:{2}, daName:{3}, pdName:{4}, pdLocation:{5}, pdFru:{6}, rgErr:{7}, rgReason:{8}.
Description: Disk requires replacement.
Cause: Hardware monitoring callback reported a faulty disk.
User Action: Contact IBM support for further guidance.
disk_call_home2 INFO ERROR service ticket Message: Disk requires replacement: event:{0}, eventName:{1}, rgName:{2}, daName:{3}, pdName:{4}, pdLocation:{5}, pdFru:{6}, rgErr:{7}, rgReason:{8}.
Description: Disk requires replacement.
Cause: Hardware monitoring callback reported a faulty disk.
User Action: Contact IBM support for further guidance.
ess_ptf_update_available TIP TIP no Message: For the currently installed IBM Storage Scale System packages, the PTF update {0} PTF {1} is available.
Description: For the currently installed IBM Storage Scale System packages a PTF update is available.
Cause: PTF updates are available for the currently installed gpfs.ess.utility.tools or gpfs.ess.tools package.
User Action: Visit IBM Fix Central to download and install the updates.
event_hidden INFO INFO no Message: The event {0} was hidden.
Description: An event used in the system health framework was hidden. It can still be seen with the '--verbose' flag in the mmhealth node show ComponentName command when it is active. However, it does not affect the component state anymore.
Cause: The mmhealth event hide command was used.
User Action: Use mmhealth event list hidden command to see all hidden events. Use the mmhealth event unhide command to show the event again.
event_test_info INFO INFO no Message: Test info event that is received from GPFS daemon. Arg0:{0} Arg1:{1}.
Description: Test event that is raised by using the mmfsadm test raiserasevent command.
Cause: N/A
User Action: To raise this test event, run the mmfsadm test raiseRASEvent 0 arg1txt arg2txt command. The event shows up in the event log. For more information, see the mmhealth node eventlog command.
event_test_ok STATE_CHANGE
HEALTHY
INFO no Message: Test OK event that is received from GPFS daemon for entity: {id} Arg0:{0} Arg1:{1}.
Description: Test OK event that is raised by using the mmfsadm test raiserasevent command.
Cause: N/A
User Action: N/A
event_test_statechange STATE_CHANGE
DEGRADED
WARNING no Message: Test State-Change event that is received from GPFS daemon for entity: {id} Arg0:{0} Arg1:{1}.
Description: Test State-Change event that is raised by using the mmfsadm test raiserasevent command.
Cause: This event was raised by the user. It is a test event.
User Action: To raise this test event, run the mmfsadm test raiseRASEvent 1 id arg1txt arg2txt command. The event changes the GPFS state to DEGRADED. For more information, see the mmhealth node show command. Raise the event_test_ok event to change state back to HEALTHY.
event_unhidden INFO INFO no Message: The event {0} was unhidden.
Description: An event was unhidden, which means that the event affects its component's state when it is active. Furthermore, it is shown in the event table of the mmhealth node show ComponentName command without the '--verbose' flag.
Cause: The mmhealth event unhide command was used.
User Action: If this is an active TIP event, then fix it or hide by using the mmhealth event hide command.
expellist_not_monitored TIP INFO no Message: Expel list not monitored on non cluster manager node.
Description: Expel list not monitored on non cluster manager node.
Cause: Expel list is monitored on cluster manager only.
User Action: N/A
gpfs_cache_cfg_high TIP TIP no Message: The GPFS cache settings may be too high for the installed total memory.
Description: The cache settings for maxFilesToCache, maxStatCache, and pagepool are close to the amount of total memory.
Cause: The configured cache settings are close to the total memory. The settings for pagepool, maxStatCache, and maxFilesToCache, in total, exceed the recommended value, which is 90% by default.
User Action: For more information on maxStatCache size, see the 'Cache usage' section in the Administration Guide. Check whether there is enough memory available.
gpfs_cache_cfg_ok TIP INFO no Message: The GPFS cache memory configuration is OK.
Description: The GPFS cache memory configuration is OK. The values for maxFilesToCache, maxStatCache, and pagepool fit to the amount of total memory and configured services.
Cause: The GPFS cache memory configuration is OK.
User Action: N/A
gpfs_deadlock_detection_disabled TIP TIP no Message: The GPFS deadlockDetectionThreshold is set to 0.
Description: Automated deadlock detection monitors waiters. The deadlock detection relies on a configurable threshold to determine whether a deadlock is in- progress.
Cause: Automated deadlock detection is disabled in IBM Storage Scale.
User Action: Configure the deadlockDetectionThreshold parameter value to a positive value by using the mmchconfig command.
gpfs_deadlock_detection_ok TIP INFO no Message: The GPFS deadlockDetectionThreshold is greater than zero.
Description: Automated deadlock detection monitors waiters. The deadlock detection relies on a configurable threshold to determine whether a deadlock is in- progress.
Cause: Automated deadlock detection is enabled in IBM Storage Scale.
User Action: N/A
gpfs_down STATE_CHANGE
FAILED
ERROR no Message: The IBM Storage Scale service process not running on this node. Normal operation cannot be done.
Description: The IBM Storage Scale service is not running. This can be an expected state when the IBM Storage Scale service is shutdown.
Cause: The IBM Storage Scale service is not running.
User Action: Check the state of the IBM Storage Scale file system daemon, and check for the root cause in the /var/adm/ras/mmfs.log.latest log.
gpfs_ignoreprefetchluncount_off TIP TIP no Message: The ignorePrefetchLUNCount config option is disabled (recommendation is enabled).
Description: The ignorePrefetchLUNCount option is essential to achieve optimal performance. IBM Storage Scale can achieve higher prefetch IO performance when this option is enabled. This tip event is raised because the ignorePrefetchLUNCount option is disabled (Recommendation for Scale >= 5.2.0).
Cause: ignorePrefetchLUNCount option is disabled according to the mmdiag --config command.
User Action: For more information on the ignorePrefetchLUNCount option see the 'configuration and tuning' section in the Administration Guide. Although, the ignorePrefetchLUNCount should be enabled, there are situations in which the administrator decides against it. In this case or in case that the current setting fits, hide the event by using the GUI or the mmhealth event hide command. The ignorePrefetchLUNCount can be changed by using the mmchconfig command. The event automatically disappears as soon as the new value is active. Use the mmchconfig -i flag command or restart if required. For more information, see the mmchconfig command in the Command Reference Guide. Consider that the actively used configuration is monitored. You can list the actively used configuration by using the mmdiag --config command, which includes changes that are not activated as yet.
gpfs_ignoreprefetchluncount_ok TIP INFO no Message: The ignorePrefetchLUNCount is enabled as recommended.
Description: The ignorePrefetchLUNCount config option is essential to achieve optimal performance. You can see the actively used configuration by using the mmdiag --config command.
Cause: N/A
User Action: N/A
gpfs_maxfilestocache_ok TIP INFO no Message: The GPFS maxFilesToCache is greater than 100,000.
Description: The GPFS maxFilesToCache is greater than 100,000. Consider that the actively used configuration is monitored. You can see the actively used configuration by using the mmdiag --config command.
Cause: The GPFS maxFilesToCache is greater than 100,000.
User Action: N/A
gpfs_maxfilestocache_small TIP TIP no Message: The GPFS maxFilesToCache is smaller than or equal to 100,000.
Description: The size of maxFilesToCache is essential to achieve optimal performance, especially on protocol nodes. With a larger maxFilesToCache size, IBM Storage Scale can handle more concurrently open files and is able to cache more recently used files, which makes IO operations more efficient. This event is raised because the maxFilesToCache value is configured less than or equal to 100,000 on a protocol node.
Cause: The size of maxFilesToCache is essential to achieve optimal performance, especially on protocol nodes. With a larger maxFilesToCache size, IBM Storage Scale can handle more concurrently open files and is able to cache more recently used files, which makes IO operations more efficient. This event is raised because the maxFilesToCache value is configured less than or equal to 100,000 on a protocol node.
User Action: For more information on maxFilesToCache size, see the 'Cache usage' section in the Administration Guide. Although, the maxFilesToCache size should be greater than 100,000, there are situations in which the administrator decides against a maxFilesToCache size that is greater than 100,000. In this case or in case that the current setting fits, hide the event by using the GUI or the mmhealth event hide command. The maxFilesToCache can be changed by using the mmchconfig command. The gpfs_maxfilestocache_small event automatically disappears as soon as the new maxFilesToCache value is greater than 100,000 is active. Restart the GPFS daemon, if required. Consider that the actively used configuration is monitored. You can list the actively used configuration by using the mmdiag --config command, which includes changes that are not activated as yet.
gpfs_maxstatcache_high TIP TIP no Message: The GPFS maxStatCache is greater than 0 on a Linux system.
Description: The size of maxStatCache is useful to improve the performance of both the system and IBM Storage Scale stat() calls for applications with a working set that does not fit in the regular file cache. Nevertheless, the stat cache is not effective on Linux platform. Therefore, it is recommended to set the maxStatCache attribute to 0 on a Linux platform. This event is raised because the maxStatCache value is configured greater than 0 on a Linux system.
Cause: The size of maxStatCache is useful to improve the performance of both the system and IBM Storage Scale stat() calls for applications with a working set that does not fit in the regular file cache. Nevertheless, the stat cache is not effective on Linux platform. Therefore, it is recommended to set the maxStatCache attribute to 0 on a Linux platform. This event is raised because the maxStatCache value is configured greater than 0 on a Linux system.
User Action: For more information on the maxStatCache size, see the 'Cache usage' section in the Administration Guide. Although, the maxStatCache size should be 0 on a Linux system, there are situations in which the administrator decides against a maxStatCache size of 0. In this case or in case that the current setting fits, hide the event either by using the GUI or the mmhealth event hide command. The maxStatCache can be changed with the mmchconfig command. The gpfs_maxStatCache_high event automatically disappears as soon as the new maxStatCache value of 0 is active. Restart the GPFS daemon, if required. Consider that the actively used configuration is monitored. You can list the actively used configuration by using the mmdiag --config command, which includes changes that are not activated as yet.
gpfs_maxstatcache_low TIP TIP no Message: The GPFS maxStatCache is smaller than the maxFilesToCache setting.
Description: The size of maxStatCache is useful to improve the performance of both the system and IBM Storage Scale stat() calls for applications with a working set that does not fit in the regular file cache.
Cause: The GPFS maxStatCache is smaller than the maxFilesToCache setting.
User Action: For more information on the maxStatCache size, see the 'Cache usage' section in the Administration Guide. In case that the current setting fits your needs, hide the event either by using the GUI or the mmhealth event hide command. The maxStatCache can be changed by using the mmchconfig command. Consider that the actively used configuration is monitored. You can list the actively used configuration by using the mmdiag --config command, which includes changes that are not activated as yet.
gpfs_maxstatcache_ok TIP INFO no Message: The GPFS maxStatCache is set to default or at least to the maxFilesToCache value.
Description: The GPFS maxStatCache value is set to 0 on a Linux system as default. Consider that the actively used configuration is monitored. You can see the actively used configuration by using the mmdiag --config command.
Cause: The GPFS maxStatCache is set to default or at least to the maxFilesToCache value.
User Action: N/A
gpfs_pagepool_ok TIP INFO no Message: The GPFS pagepool meets the recommended minimum size.
Description: The GPFS pagepool is greater or equal to the recommended minimum. Consider that the actively used configuration is monitored. You can see the actively used configuration by using the mmdiag --config command.
Cause: N/A
User Action: N/A
gpfs_pagepool_small TIP TIP no Message: The GPFS pagepool is less than 1GB.
Description: The size of the pagepool is essential to achieve optimal performance. With a larger pagepool, IBM Storage Scale can cache or prefetch more data, which makes IO operations more efficient. This tip event is raised because the pagepool is configured less than 1GB.
Cause: Pagepool option is configured less than 1GB according to the mmdiag --config command.
User Action: For more information on the pagepool size, see the 'Cache usage' section in the Administration Guide. Although, the pagepool should be at least 1G, there are situations in which the administrator decides against it. In this case or in case that the current setting fits, hide the event by using the GUI or the mmhealth event hide command. The pagepool can be changed by using the mmchconfig command. The gpfs_pagepool_small event automatically disappears as soon as the new pagepool value, which is greater than 1GB is active. Use the mmchconfig -i flag command or restart if required. For more information, see the mmchconfig command in the Command Reference Guide. Consider that the actively used configuration is monitored. You can list the actively used configuration by using the mmdiag --config command, which includes changes that are not activated as yet.
gpfs_pagepool_small_4g TIP TIP no Message: The GPFS pagepool is less than 4GB.
Description: The size of the pagepool is essential to achieve optimal performance. With a larger pagepool, IBM Storage Scale can cache or prefetch more data, which makes IO operations more efficient. This tip event is raised because the pagepool is configured less than 4GB (Recommendation for Scale >= 5.2.0).
Cause: Pagepool option is configured less than 4GB according to the mmdiag --config command.
User Action: For more information on the pagepool size, see the 'Cache usage' section in the Administration Guide. Although, the pagepool should be at least 4G, there are situations in which the administrator decides against it. In this case or in case that the current setting fits, hide the event by using the GUI or the mmhealth event hide command. The pagepool can be changed by using the mmchconfig command. The gpfs_pagepool_small event automatically disappears as soon as the new pagepool value, which is greater than 4GB is active. Use the mmchconfig -i flag command or restart if required. For more information, see the mmchconfig command in the Command Reference Guide. Consider that the actively used configuration is monitored. You can list the actively used configuration by using the mmdiag --config command, which includes changes that are not activated as yet.
gpfs_unresponsive STATE_CHANGE
FAILED
ERROR no Message: The IBM Storage Scale service process is unresponsive on this node. Normal operation cannot be done.
Description: The IBM Storage Scale service is unresponsive. This can be an expected state when the IBM Storage Scale service is shut down.
Cause: The IBM Storage Scale service is unresponsive.
User Action: Check the state of the IBM Storage Scale file system daemon, and check for the root cause in the /var/adm/ras/mmfs.log.latest log.
gpfs_up STATE_CHANGE
HEALTHY
INFO no Message: The IBM Storage Scale service process is running.
Description: The IBM Storage Scale service is running.
Cause: N/A
User Action: N/A
gpfs_warn INFO WARNING no Message: The IBM Storage Scale process monitoring returned unknown result. This can be a temporary issue.
Description: Check whether the IBM Storage Scale file system daemon returned an unknown result. This can be a temporary issue, like a timeout during the check procedure.
Cause: The IBM Storage Scale file system daemon state cannot be determined due to a problem.
User Action: Find potential issues for this kind of failure in the /var/adm/ras/mmsysmonitor.log file.
gpfs_workerthreads_ok TIP INFO no Message: The configured number of worker threads meet or exceed the recommended minimum number ({0}).
Description: The workerThreads config option is essential to achieve optimal performance. You can see the actively used configuration by using the mmdiag --config command.
Cause: N/A
User Action: N/A
gpfs_workerthreads_small TIP TIP no Message: The workerThreads config option is lower than the recommended number ({0}).
Description: The workerThreads option is essential to achieve optimal performance. With more threads, IBM Storage Scale can achieve more parallelism, which makes IO operations more efficient. This tip event is raised because the workerThreads option is configured less than 256 (Recommendation for Scale >= 5.2.0).
Cause: workerThreads option is configured less than 256 according to the mmdiag --config command.
User Action: For more information on the workerThreads option see the 'configuration and tuning' section in the Administration Guide. Although, the workerThreads should be at least 256, there are situations in which the administrator decides against it. In this case or in case that the current setting fits, hide the event by using the GUI or the mmhealth event hide command. The workerThreads can be changed by using the mmchconfig command. The event automatically disappears as soon as the new value, which is greater than 256 is active. For more information, see the mmchconfig command in the Command Reference Guide. Consider that the actively used configuration is monitored. You can list the actively used configuration by using the mmdiag --config command, which includes changes that are not activated as yet.
gpfsport_access_down STATE_CHANGE
FAILED
ERROR no Message: No access to IBM Storage Scale IP {0} port {1}. Check the firewall settings.
Description: The access check of the local IBM Storage Scale file system daemon port has failed.
Cause: The port is probably blocked by a firewall rule.
User Action: Check whether the IBM Storage Scale file system daemon is running and check the firewall for blocking rules on this port.
gpfsport_access_up STATE_CHANGE
HEALTHY
INFO no Message: Access to IBM Storage Scale IP {0} port {1} is OK.
Description: The TCP access check of the local IBM Storage Scale file system daemon port was successful.
Cause: N/A
User Action: N/A
gpfsport_access_warn INFO WARNING no Message: IBM Storage Scale access check IP {0} port {1} failed. Check for a valid IBM Storage Scale IP.
Description: The access check of the IBM Storage Scale file system daemon port has returned an unknown result.
Cause: The IBM Storage Scale file system daemon port access cannot be determined due to a problem.
User Action: Find potential issues for this kind of failure in the logs.
gpfsport_down STATE_CHANGE
FAILED
ERROR no Message: IBM Storage Scale port {0} is not active.
Description: The expected local IBM Storage Scale file system daemon port was not detected.
Cause: The IBM Storage Scale file system daemon is not running.
User Action: Check whether the IBM Storage Scale service is running.
gpfsport_up STATE_CHANGE
HEALTHY
INFO no Message: IBM Storage Scale port {0} is active.
Description: The expected local IBM Storage Scale file system daemon port was detected.
Cause: N/A
User Action: N/A
gpfsport_warn INFO WARNING no Message: IBM Storage Scale monitoring ip {0} port {1} has returned an unknown result.
Description: The check of the IBM Storage Scale file system daemon port has returned an unknown result.
Cause: The IBM Storage Scale file system daemon port cannot be determined due to a problem.
User Action: Find potential issues for this kind of failure in the logs.
info_on_duplicate_events INFO INFO no Message: The event {0} {id} was repeated {1} times.
Description: Multiple messages of the same type were de-duplicate to avoid log flooding.
Cause: Multiple events of the same type are processed.
User Action: N/A
kernel_io_hang_detected STATE_CHANGE
FAILED
ERROR no Message: A kernel IO hang has been detected on disk {0} affecting file system {1}.
Description: I/Os to the underlying storage system have been pending for more than the configured threshold time, which is 'ioHangDetectorTimeout'. When panicOnIOHang is enabled, this can force a kernel panic.
Cause: A diskIOHang callback with reason 'Block I/O' or empty reason was received.
User Action: Check the underlying storage system and reboot the node to revolve the current hang condition.
kernel_io_hang_resolved STATE_CHANGE
HEALTHY
INFO no Message: A kernel IO hang on disk {id} has been resolved.
Description: Pending I/Os to the underlying storage system have been resolved manually.
Cause: N/A
User Action: N/A
local_fs_filled STATE_CHANGE
DEGRADED
WARNING no Message: The local file system with the mount point {1} used for {0} reached a warning level with less than 1000 MB, but, more than 100 MB free space.
Description: The monitored file system has reached an available space value less than 1000 MB, but, more than 100 MB.
Cause: The local file systems reached a warning level of under 1000 MB.
User Action: Detect large file on the local file system by using the 'du -cks * |sort -rn |head -11' command, and delete or move data to free space.
local_fs_full STATE_CHANGE
FAILED
ERROR no Message: The local file system with the mount point {1} used for {0} reached a nearly exhausted level, which is less than 100 MB free space.
Description: The monitored file system has reached an available space value, which is less than 100 MB.
Cause: The local file systems have reached a nearly exhausted level, which is less than 100 MB.
User Action: Detect large file on the local file system by using the 'du -cks * |sort -rn |head -11' command, and delete or move data to free space.
local_fs_normal STATE_CHANGE
HEALTHY
INFO no Message: The local file system with the mount point {1} used for {0} reached a normal level with more than 1000 MB free space.
Description: The monitored file system has an available space value of over 1000 MB.
Cause: N/A
User Action: N/A
local_fs_path_not_found STATE_CHANGE
HEALTHY
INFO no Message: The configured dataStructureDump path {0} does not exists. Skip monitoring.
Description: The configured dataStructureDump path does not exist yet, therefore, the disk capacity monitoring is skipped.
Cause: N/A
User Action: N/A
local_fs_unknown INFO WARNING no Message: The fill level of local file systems is unknown because of a non- expected output of the df command. Return Code: {0} Error: {1}.
Description: The df command returned a return code unequal to 0 or an unexpected output.
Cause: Cannot determine the fill states of the local file systems, which may be caused through an return code unequal to 0 from the df command or an unexpected output format.
User Action: Check whether the df command exists on the node, and whether there are time issues with the df command or it may run into a timeout.
longwaiters_found STATE_CHANGE
DEGRADED
ERROR no Message: Detected IBM Storage Scale longwaiter threads.
Description: Longwaiter threads are found in the IBM Storage Scale file system.
Cause: The mmdiag --deadlock command reports longwaiter threads, most likely due to a high IO load.
User Action: Check log files and the output of the mmdiag --waiters command to identify the root cause. This can be also due to a temporary issue.
longwaiters_warn INFO WARNING no Message: IBM Storage Scale longwaiters monitoring has returned an unknown result.
Description: The longwaiters check has returned an unknown result.
Cause: The IBM Storage Scale file system longwaiters check cannot be determined due to a problem.
User Action: Find potential issues for this kind of failure in the logs.
mmfsd_abort_clear STATE_CHANGE
HEALTHY
INFO no Message: Resolve event for IBM Storage Scale issue signal.
Description: Resolve event for IBM Storage Scale issue signal.
Cause: N/A
User Action: N/A
mmfsd_abort_warn STATE_CHANGE
DEGRADED
WARNING FTDC upload Message: IBM Storage Scale reported an issue {0}.
Description: The mmfsd daemon process may have terminated abnormally.
Cause: IBM Storage Scale signaled an issue. The mmfsd daemon process might have terminated abnormally.
User Action: Check the mmfs.log.latest and mmfs.log.previous files for crash and restart hints. Check for mmfsd daemon status. Run the mmhealth event resolve mmfsd_abort_warn command to remove this warning event from the mmhealth command.
monitor_started INFO INFO no Message: The IBM Storage Scale monitoring service has been started.
Description: The IBM Storage Scale monitoring service has been started and is actively monitoring the system components.
Cause: N/A
User Action: Use the mmhealth command to query the monitoring status.
no_longwaiters_found STATE_CHANGE
HEALTHY
INFO no Message: No IBM Storage Scale longwaiters are found.
Description: No longwaiter threads are found in the IBM Storage Scale file system.
Cause: N/A
User Action: N/A
no_rpc_waiters STATE_CHANGE
HEALTHY
INFO no Message: No pending RPC messages were found.
Description: No pending RPC messages were found.
Cause: N/A
User Action: N/A
node_call_home INFO ERROR service ticket Message: OPAL logs reported a problem: event:{0}, eventId:{1}, myNode:{2}.
Description: OPAL logs reported a problem via callhomemon.sh, which requires IBM support attention.
Cause: OPAL logs reported a problem via callhomemon.sh.
User Action: Contact IBM support for further guidance.
node_call_home2 INFO ERROR service ticket Message: OPAL logs reported a problem: event:{0}, eventId:{1}, myNode:{2}.
Description: OPAL logs reported a problem via callhomemon.sh, which requires IBM support attention.
Cause: OPAL logs reported a problem via callhomemon.sh.
User Action: Contact IBM support for further guidance.
node_expelled TIP TIP no Message: The cluster node {1} ({id}) is expelled from the cluster.
Description: A cluster node is expelled from the cluster. Either because mmexpelnode commands was manually called or when the automatic expel on pending RPCs is enabled and it detected a sick node.
Cause: The mmexpelnode -l command shows that the node is expelled.
User Action: If the node was expelled because of pending RPCS, verify that the node is healthy and has enough resources (memory, CPU) to respond to RPCS. If the node was expelled manually, this event can be hidden.
node_unexpelled TIP INFO no Message: The cluster node {0} ({id}) is no longer expelled from the cluster.
Description: A cluster node is no longer expelled from the cluster.
Cause: The mmexpelnode -l command shows that the node is no longer expelled.
User Action: N/A.
nodeleave_info INFO INFO no Message: A CES node left the cluster: Node {0}.
Description: Informational. Shows the name of the node name that is leaving the cluster. This event may be logged on a different node, and not necessarily on the node that is leaving the cluster.
Cause: A CES node left the cluster. The name of the node that is leaving the cluster is provided.
User Action: N/A
nodestatechange_info INFO INFO no Message: A CES node state change: Node {0} {1} {2} flag.
Description: Informational. Shows the modified node state, such as the node changed to suspended mode, network down, or others.
Cause: A node state change was detected. Details are shown in the message.
User Action: N/A
numaMemoryInterleave_not_set TIP INFO no Message: The numaMemoryInterleave parameter is set to no.
Description: The numaMemoryInterleave parameter is set to no. The numactl tool is not required.
Cause: The mmlsconfig command indicates that numaMemoryInterleave is disabled.
User Action: N/A
numaMemoryInterleave_wrong TIP TIP no Message: The numaMemoryInterleave parameter is set to no.
Description: The system has multiple NUMA nodes but the numaMemoryInterleave parameter is set to no. With numaMemoryInterleave enabled, the mmfsd daemon is supposed to allocate memory from CPU bound NUMA nodes which can improve performance.
Cause: The mmlsconfig command indicates that numaMemoryInterleave is disabled.
User Action: It is recommended to enable numaMemoryInterleave on systems with multiple NUMA nodes. The numaMemoryInterleave option can be changed by using the mmchconfig command (daemon restart required). For more information, see the mmchconfig command in the Command Reference Guide. The event automatically disappears as soon as the new value is active. Consider that the actively used configuration is monitored. You can list the actively used configuration by using the mmdiag --config command, which includes changes that are not activated as yet. In case that the current setting is desired, hide the event by using the GUI or the mmhealth event hide command.
numa_not_present TIP INFO no Message: The hardware is not NUMA capable or NUMA is not enabled.
Description: The system does not have multiple NUMA nodes.
Cause: None or single NUMA node is found in /sys/devices/system/node/ .
User Action: N/A
numactl_installed TIP INFO no Message: The numactl tool is installed.
Description: To use the mmchconfig numaMemoryInterleave parameter, the numactl tool is required. Detected numactl is installed.
Cause: The required /usr/bin/numactl command is installed correctly.
User Action: N/A
numactl_not_installed TIP TIP no Message: The numactl tool is not found, but needs to be installed.
Description: When the configuration attribute numaMemoryInterleave is enabled, the mmfsd daemon is supposed to allocate memory from CPU bound NUMA nodes. However, when the numactl tool is missing, mmfsd allocates memory only from a single NUMA memory region. This action might impact the performance and lead to memory allocation issues even when other NUMA regions still have plenty of memory left.
Cause: The mmlsconfig command indicates that numaMemoryInterleave is enabled, but the required /usr/bin/numactl command is missing.
User Action: Install the required numactl tool. For example, run the yum install numactl command on RHEL. In case,ß the numactl is not available for your operating system, disable numaMemoryInterleave setting on this node.
operating_system_ok STATE_CHANGE
HEALTHY
INFO no Message: A supported operating system was detected.
Description: A supported operating system was detected.
Cause: N/A
User Action: N/A
out_of_memory STATE_CHANGE
DEGRADED
WARNING no Message: Detected out-of-memory killer conditions in system log.
Description: In an out-of-memory condition, the OOM killer terminates the process with the largest memory utilization score. This may affect the IBM Storage Scale processes and cause subsequent issues.
Cause: The dmesg command returned log entries, which are written by the OOM killer.
User Action: Check the memory usage on the node. Identify the reason for the out-of-memory condition and check the system log to find out which processes have been killed by OOM killer. You might need to recover these processes manually or reboot the system to get to a clean state. Run the mmhealth event resolve out_of_memory command once you recovered the system to remove this warning event from the mmhealth command.
out_of_memory_ok STATE_CHANGE
HEALTHY
INFO no Message: Out-of-memory issue is resolved.
Description: Resolve event for the out-of-memory degraded state.
Cause: N/A
User Action: N/A
passthrough_query_hang STATE_CHANGE
FAILED
ERROR no Message: An SCSI pass-through query request hang has been detected on disk {0} affecting file system {1}. Reason: {2}.
Description: SCSI pass-through query request to storage system have been pending for more than the configured threshold time, which is 'passthrough_query_hang_detected'. When panicOnIOHang is enabled, this can force a kernel panic.
Cause: A diskIOHang callback was received.
User Action: Check the underlying storage system and reboot the node to revolve the current hang condition.
perfmon_gpfsvfs_deprecated TIP TIP no Message: The deprecated GPFSVFS sensor is used. Configure GPFSVFSX sensor instead.
Description: The GPFSVFS sensor currently in use is deprecated and as of release 5.1.0, the new GPFSVFSX sensor needs to be added to the config or activated.
Cause: The configuration is using a deprecated GPFSVFS sensor. Starting from the 5.1.0 release, new GPFSVFSX sensor must be added or activated.
User Action: Disable the GPFSVFS sensor using the mmperfmon config update GPFSVFS.period=0 command, followed by the mmperfmon config add --sensors /opt/IBM/zimon/defaults/ZIMonSensors_GPFSVFSX.cfg command to activate the GPFSVFSX sensor.
perfmon_gpfsvfs_not_monitored TIP INFO no Message: GPFSVFS sensor status not monitored on non cluster manager node.
Description: GPFSVFS sensor status not monitored on non cluster manager node.
Cause: N/A
User Action: N/A
perfmon_gpfsvfs_ok TIP INFO no Message: Deprecated GPFSVFS sensor is inactive.
Description: The deprecated GPFSVFS sensor is not active.
Cause: N/A
User Action: N/A
quorum_down STATE_CHANGE
DEGRADED
ERROR no Message: The node is not able to reach enough quorum nodes/disks to work properly.
Description: Reasons can be network or hardware issues, or a shutdown of the cluster service. The event does not necessarily indicate an issue with the cluster quorum state.
Cause: The node is trying to form a quorum with the other available nodes. The cluster service may not be running or the communication with other nodes is faulty.
User Action: Check whether the cluster service is running and other quorum nodes can be reached over the network. Check the local firewall settings.
quorum_even_nodes_no_tiebreaker STATE_CHANGE
TIPS
TIP no Message: No tiebreaker disk is defined with an even number of quorum nodes.
Description: No tiebreaker disk is defined.
Cause: You have not configured any tiebreaker disk.
User Action: Add 1 or 3 tiebreaker disks.
quorum_ok STATE_CHANGE
HEALTHY
INFO no Message: The quorum configuration corresponds to the best practices.
Description: The quorum configuration is as recommended.
Cause: N/A
User Action: N/A
quorum_too_little_nodes TIP TIP no Message: An odd number of at least 3 quorum nodes is recommended.
Description: Only one quorum node is defined.
Cause: 3, 5, or 7 quorum nodes is recommended. This is not configured.
User Action: Add quorum nodes.
quorum_two_tiebreaker_count STATE_CHANGE
TIPS
TIP no Message: Change number of tiebreaker disks to an odd number.
Description: Number of tiebreaker disks is two.
Cause: The number of tiebreaker disks is not as recommended.
User Action: Use an odd number of tiebreaker disks.
quorum_up STATE_CHANGE
HEALTHY
INFO no Message: Quorum is achieved.
Description: The monitor has detected a valid quorum.
Cause: N/A
User Action: N/A
quorum_warn INFO WARNING no Message: The IBM Storage Scale quorum monitor cannot be executed. This can be a timeout issue.
Description: Check whether the quorum state returned an unknown result. This may be due to a temporary issue, like a timeout during the check procedure.
Cause: The quorum state cannot be determined due to a problem.
User Action: Find potential issues for this kind of failure in the /var/adm/ras/mmsysmonitor.log file.
quorumloss INFO WARNING no Message: The cluster has detected a quorum loss.
Description: The cluster may get into an inconsistent or split-brain state. The reason for this issue can be due to any network or hardware issues, or when quorum nodes were removed from the cluster. The event may not necessarily be logged on the node, which causes the quorum loss.
Cause: The number of required quorum nodes does not match the minimum requirements. This can be an expected situation.
User Action: Ensure the required cluster quorum nodes are up and running.
quorumreached_detected INFO INFO no Message: Quorum reached event.
Description: The cluster has reached quorum.
Cause: The cluster has reached quorum.
User Action: N/A
reconnect_aborted STATE_CHANGE
HEALTHY
INFO no Message: Reconnect to {0} is aborted.
Description: Reconnect failed, which may due to a network error. Check for a network error.
Cause: N/A
User Action: N/A
reconnect_done STATE_CHANGE
HEALTHY
INFO no Message: Reconnected to {0}.
Description: The TCP connection is reconnected.
Cause: N/A
User Action: N/A
reconnect_failed INFO ERROR no Message: Reconnect to {0} has failed.
Description: Reconnect failed, which may due to a network error.
Cause: The network is in bad state.
User Action: Check whether the network is good.
reconnect_start STATE_CHANGE
DEGRADED
WARNING no Message: Attempting to reconnect to {0}.
Description: The TCP connection is in abnormal state and tries try to reconnect.
Cause: The TCP connection is in abnormal state.
User Action: Check whether the network is good.
rpc_waiters STATE_CHANGE
DEGRADED
WARNING no Message: Pending RPC messages were found for the nodes: {0}.
Description: Nodes took more time to respond to pending RPC messages.
Cause: The mmdiag --network command returned pending RPC messages that took more time than the mmhealthPendingRPCWarningThreshold value.
User Action: If nodes do not respond to pending RPC messages, you might need to expel the nodes by using the mmexpelnode -N <ip> command.
rpc_waiters_expel INFO WARNING no Message: A request to expel the node {id} was sent to the cluster node {0} because of pending RPC messages.
Description: A node is expelled automatically because of pending RPC messages on the node.
Cause: The mmdiag --network command returned a pending RPC message that took more time than the mmhealthPendingRPCExpelThreshold value.
User Action: Verify the logs in the expelled node to find the reason for the pending RPC messages. For example, node resources, such as memory, might be exhausted. Use the mmexpelnode -r -N <ip>; command to allow the node to join the cluster again.
rpm_files_state_ok TIP INFO no Message: The content of the gpfs.base package remains unchanged.
Description: No changes detected in gpfs.base package files.
Cause: N/A
User Action: N/A
rpm_files_state_warn TIP TIP no Message: gpfs.base package files were modified from the original state: {0}.
Description: Some shipped files from the gpfs.base packages were modified after installation, which may cause unexpected behaviour.
Cause: Detected changes in gpfs.base package files using the 'rpm -V gpfs.base' command.
User Action: Revert the file changes e.g. by reinstalling the gpfs.base package, unless the IBM support instructed to perform these file changes.
scale_entitlement_failed STATE_CHANGE
DEGRADED
WARNING no Message: The features used on this node are not covered by your '{0}' entitlement license.
Description: Your Scale entitlement license on the current node does not include the currently used features.
Cause: Your entitlement license does not include the currently used features.
User Action: Either avoid using features which are beyond the scope of your Scale entitlement license or switch to an extended license that supports all of the used features.
scale_entitlement_not_monitored STATE_CHANGE
HEALTHY
INFO no Message: Scale entitlement not monitored on a non-cluster manager.
Description: Scale entitlement is not monitored on a non-cluster manager.
Cause: N/A
User Action: N/A
scale_entitlement_ok STATE_CHANGE
HEALTHY
INFO no Message: The currently used features on this node are covered by your '{0}' entitlement license.
Description: Your entitlement license on the current node includes the currently used features.
Cause: N/A
User Action: N/A
scale_ptf_update_available TIP TIP no Message: For the currently installed IBM Storage Scale packages, the PTF update {0} PTF {1} is available.
Description: For the currently installed IBM Storage Scale packages, a PTF update is available.
Cause: PTF updates are available for the currently installed gpfs.base package.
User Action: Visit IBM Fix Central to download and install the updates.
scale_up_to_date STATE_CHANGE
HEALTHY
INFO no Message: The last software update check showed no available updates.
Description: The last software update check showed no available updates.
Cause: N/A
User Action: N/A
scale_updatecheck_disabled STATE_CHANGE
HEALTHY
INFO no Message: The IBM Storage Scale software update check feature is disabled.
Description: The IBM Storage Scale software update check feature is disabled. Enable call home by using the mmcallhome capability enable command, and set 'monitors_enabled = true' in the mmsysmonitor.conf file.
Cause: N/A
User Action: N/A
shared_root_acl_bad STATE_CHANGE
DEGRADED
WARNING no Message: Shared root ACLs not default.
Description: The CES shared root file system's ACLs are different to default in CCR. If this ACLs prohibits read access of rpc.stadt, then NFS do not work correctly.
Cause: The CES framework detects that the ACLs of the CES shared root file system are different the default in CCR.
User Action: Verify that the user assigned to rpc.statd (such as, rpcuser) has read access to the CES shared root file system.
shared_root_acl_good STATE_CHANGE
HEALTHY
INFO no Message: Default shared root ACLs.
Description: The CES shared root file system's ACLs are default. These ACLs give read access to rpc.stadt when default GPFS user settings are used.
Cause: N/A
User Action: N/A
shared_root_bad STATE_CHANGE
DEGRADED
WARNING no Message: Shared root is unavailable.
Description: The CES shared root file system is bad or not available. This file system is required to run the cluster because it stores cluster wide information. This problem triggers a failover.
Cause: The CES framework detects the CES shared root file system to be unavailable on the node.
User Action: Check whether the CES shared root file system and other expected IBM Storage Scale file systems are mounted properly.
shared_root_ok STATE_CHANGE
HEALTHY
INFO no Message: Shared root is available.
Description: The CES shared root file system is available. This file system is required to run the cluster because it stores cluster wide information.
Cause: N/A
User Action: N/A
swapped_ok TIP INFO no Message: Swap memory usage is within the expected threshold.
Description: Swap memory usage is within the expected threshold.
Cause: N/A
User Action: N/A
swapped_warn TIP TIP no Message: Swap memory usage exceeds the expected threshold value {0}.
Description: Swap memory usage exceeds the expected threshold, which can significantly impact system performance.
Cause: Swap memory usage exceeds the expected threshold. Check swap usage by running the 'free' command.
User Action: Try to reduce memory usage and check the logs /var/adm/ras/top_data_mem_*.log to find the top memory consuming processes. Also, consider lowering the swapped value using the sysctl vm.swappiness=<value> command to reduce aggressive swapping.
test_call_home INFO ERROR service ticket Message: A test call home ticket is created.
Description: A test call home ticket is created.
Cause: ESS tooling triggered a test call home to verify that tickets can be created from this system.
User Action: No action required if a service ticket is successfully created. Otherwise connectivity, entitlement, etc must be checked.
tiebreaker_disks_ok TIP INFO no Message: The number of tiebreaker disks is as recommended.
Description: The number of tiebreaker disks is correct.
Cause: The number of tiebreaker disks is as recommended.
User Action: N/A
total_memory_ok TIP INFO no Message: The total memory configuration is OK.
Description: The total memory configuration is within the recommended range for CES nodes running protocol services.
Cause: The total memory configuration is OK.
User Action: N/A
total_memory_small TIP TIP no Message: The total memory is less than the recommended value.
Description: The total memory is less than the recommended value when CES protocol services are enabled.
Cause: The total memory is less than the recommendation for the currently enabled services, which is 128GB if SMB is enabled, or 64 GB for each, NFS and object.
User Action: For more information on CES memory recommendations, see the 'Planning for protocols' in the Concepts, Planning, and Installation Guide.
unexpected_operating_system TIP TIP no Message: An unexpected operating system was detected.
Description: An unexpected operating system was detected. A 'clone' OS may affect the support you get from IBM.
Cause: An unexpected OS was detected.
User Action: A list of supported operating systems and versions can be seen in the documentation.
vm_max_map_count_low STATE_CHANGE
DEGRADED
WARNING no Message: The sysctl settings for vm.max_map_count is configured too low (should be >= {0}).
Description: The sysctl settings for vm.max_map_count is configured too low. When dynamic pagepool feature is enabled, the optimal value for vm.max_map_count is >= total_memory/256KiB.
Cause: sysctl -n vm.max_map_count returns a value which is lower than recommended. When using the dynamic pagepool, the mmfsd process uses a high number of memory mappings for the dynamic pagepool. The Linux kernel imposes a configurable limit of memory mappings per process, this needs to be increased to avoid running out of available mappings.
User Action: Please increase the sysctl vm.max_map_count value according to the recommendation: vm.max_map_count >= total_memory/256KiB.
vm_max_map_count_ok STATE_CHANGE
HEALTHY
INFO no Message: The sysctl value for vm.max_map_count is configured correctly (>= {0}).
Description: The sysctl value for vm.max_map_count is configured correctly. When dynamic pagepool feature is enabled, the optimal value for vm.max_map_count is >= total_memory/256KiB.
Cause: N/A
User Action: N/A
waitfor_verbsport INFO INFO no Message: Waiting for verbs ports to become active.
Description: verbsPortsWaitTimeout is enabled, waiting for verbs ports becomes active.
Cause: N/A
User Action: N/A
waitfor_verbsport_done INFO INFO no Message: Waiting for verbs ports is done {0}.
Description: Waiting for verbs ports is done.
Cause: N/A
User Action: N/A
waitfor_verbsport_failed INFO ERROR no Message: Fail to startup because some IB ports or Ethernet devices, which in verbsPorts are inactive: {0}.
Description: verbsRdmaFailBackTCPIfNotAvailable is disabled and some IB ports or Ethernet devices, which in verbsPorts are inactive.
Cause: IB ports or Ethernet devices are inactive.
User Action: Check IB ports and Ethernet devices, which are listed in verbsPorts configuration. Increase verbsPortsWaitTimeout or enable the verbsRdmaFailBackTCPIfNotAvailable configuration.
waitfor_verbsport_ibstat_failed INFO ERROR no Message: verbsRdmaFailBackTCPIfNotAvailable is disabled but /usr/sbin/ibstat is not found.
Description: verbsRdmaFailBackTCPIfNotAvailable is disabled but /usr/sbin/ibstat is not found.
Cause: verbsRdmaFailBackTCPIfNotAvailable is disabled but /usr/sbin/ibstat is not found.
User Action: Install /usr/sbin/ibstat or enable verbsRdmaFailBackTCPIfNotAvailable configuration.