Predefined and user-defined thresholds

Threshold monitoring consists of the following two types of thresholds:

You can use the mmhealth thresholds list command to review all the threshold rules defined for a cluster. The predefined and the user-defined threshold rules can be removed by using the mmhealth thresholds delete command.

Predefined thresholds

In a cluster, the following three types of thresholds are predefined and enabled automatically:
A predefined threshold rule is deactivated in the following cases:
  1. If no metric keys exist for the defined threshold rule in the performance monitoring tool metadata.
  2. If the sensor corresponding to the defined threshold rule is not enabled.

Thresholds monitoring the file system capacity usage

The capacity available for the file systems depends on the fullness of the file system's fileset-inode spaces and the capacity usage of each data or metadata pool. Therefore, the predefined capacity threshold limit for a file system is broken down into the following thresholds rules:

  • Fileset-inode spaces that use the InodeCapUtil_Rule rule.
  • Datapool capacity that uses the DataCapUtil_Rule rule.
  • Metadata pool capacity that uses the MetaDataCapUtil_Rule rule.

The violation of any of these rules results in the parent file system receiving a capacity issue notification. The outcome of the file system capacity rules evaluation is included in the health status report of the FILESYSTEM component and can be reviewed by using the mmhealth node show filesystem command. For capacity usage rules, the default warn level is set to 80%, and the error level to 90%.

Since the file system capacity related thresholds are not node-specific, they are displayed on the current active threshold monitor node. For more information, see Use case 2: Observe the filesystem capacity usage by using default threshold rules.

Thresholds monitoring the memory usage

MemFree_Rule

The MemFree_Rule is a predefined threshold rule that monitors the free memory usage. The MemFree_Rule rule observes the memory-free usage on each cluster node and prevents the device from becoming unresponsive when memory is no longer available.

The memory-free usage rule is evaluated for each node in the cluster. The evaluation status is included in the node health status of each particular node. For memory usage rule, the warn level is set to 100 MB, and the error level to 50 MB.

The default value, MemFree_Rule evaluates the estimated available memory in relation to the total memory allocation. For more information, see the MemoryAvailable_percent measurement definition in the mmhealth command section. For the new MemFree_Rule, only a WARNING threshold level is defined. The node is tagged with a WARNING status if the Memfree_util value goes less than 5%.

For the nodes that have greater than or equal to 40 GB of total memory allocation, the available memory percentage is evaluated against a fixed value of 40 GB. This evaluation prevents the nodes that have more than 2 GB free memory from sending warning messages.

Note: For IBM Storage Scale 5.0.4, the default MemFree_Rule is replaced automatically. The customer-created rules remain unchanged.

AFMInQueue_Rule

The AFMInQueue_Rule is a predefined threshold rule that monitors the AFM gateway in-queue memory usage. The AFMInQueue_Rule value must be set to 40-50% of the available memory on the gateway node, which is considered to be a dedicated gateway node. If the value of the AFMInQueue_Rule rule is not defined, then its default value is set to 8GiB.

The AFMInQueue_Rule memory usage rule, as a warning level, is set at 80% of assigned memory, and as an error level, the memory usage rule is set at 90% of assigned memory. When either of these levels are reached or exceeded, then an mmhealth event is raised. The mmhealth event can be viewed in the IBM Storage Scale GUI or on the CLI by using the mmhealth command.

If the mmhealth events are raised, then a user can take the following steps to resolve the issue:

  • Check and fix any issues with the network connectivity and bandwidth throughput.
  • Adjust the AFM afmHardMemThreshold configuration to be in the 40-50% of available memory on the gateway node.

    Note: Ensure that the page pool setting on the AFM gateway is low to prevent a potential out-of-memory state.
  • Check whether the network throughput from all AFM gateway nodes is properly balanced.
  • Add more AFM gateway nodes to help handle the workload.

For more information, see General recommendations for AFM gateway node configuration section.

Thresholds monitoring the number of SMB connections

IBM Storage Scale can host a maximum of 3,000 SMB connections per protocol node and not more than 20,000 SMB connections across all protocol nodes. This threshold monitors the following information:
  • Number of SMB connections on each protocol node by using the SMBConnPerNode_Rule rule
  • Number of SMB connections across all protocol nodes by using the SMBConnPerNode_Rule rule
SMBConnPerNode_Rule

The rule compares the count of SMB connections on each protocol node with the allowed maximum and is evaluated for each protocol node in the cluster. The evaluation status is included in the node health status of each protocol node entity.

SMBConnTotal_Rule
The rule monitors the sum of all SMB connections in the cluster and ensures that it does not exceed 3000. The evaluation status is reported to the node that has the ACTIVE THRESHOLD MONITOR role.
For more information about SMB connection limitations, see Planning for SMB and SMB limitations.

User-defined thresholds

You can create individual thresholds for all metrics that are collected through the performance monitoring sensors. You can use the mmhealth thresholds add command to create a new threshold rule.

If multiple thresholds rules have overlapping entities for the same metrics, then only one of the concurrent rules is made actively eligible. All rules get a priority rank number. The highest possible rank number is one. This rank is based on a metric's maximum number of filtering levels and the filter granularity that is specified in the rule. As a result, a rule that monitors a specific entity or a set of entities becomes high priority. This high-priority rule performs entity thresholds evaluation and status update for a particular entity or a set of entities. This implies that a less specific rule, like the one that is valid for all entities, is disabled for this particular entity or set of entities. For example, a threshold rule that is applicable to a single file system takes precedence over a rule that is applicable to several or all the file systems. For more information, see Use case 4: Create threshold rules for specific filesets.