Monitoring thresholds by using GUI

You can configure the IBM Storage Scale to raise events when certain thresholds are reached. Use the Monitoring > Thresholds page to define or modify thresholds for the data that is collected through the performance monitoring sensors.

You can set the following two types of threshold levels for data that is collected through performance monitoring sensors:

Warning level: When the data that is being monitored reaches the warning level, the system raises an event with severity Warning. When the observed value exceeds the current threshold level, the system removes the warning.
Error level: When the data that is being monitored reaches the error level, the system raises an event with severity Error. When the observed value exceeds the current threshold level, the system removes the error state.

Certain types of thresholds are predefined in the system. The following predefined thresholds are available:

Inode utilization at the fileset level
Datapool capacity utilization
Metadata pool capacity utilization
Free memory utilization

Apart from the predefined thresholds, you can also create user-defined thresholds for the data that is collected through the performance monitoring sensors.

You can use the Monitoring > Thresholds page in the GUI and the mmhealth command to manage both predefined and user-defined thresholds.

Defining thresholds

Use the Create Thresholds option to define user-defined thresholds or to modify the predefined thresholds. You can use the Use as Template option that is available in the Actions menu to use an already defined threshold as the template to create a threshold. You can specify the following details in a threshold rule:

Metric category: Lists all performance monitoring sensors that are enabled in the system and thresholds that are derived by performing certain calculations on certain performance metrics. These derived thresholds are referred as measurements. The measurements category provides the flexibility to edit certain predefined threshold rules. The following measurements are available for selection:

DataPool_capUtil

Datapool capacity utilization, which is calculated as:
(sum(gpfs_pool_total_dataKB)-sum(gpfs_pool_free_dataKB))/sum(gpfs_pool_total_dataKB)

DiskIoLatency_read

Average time in milliseconds spent for a read operation on the physical disk. Calculated as:
disk_read_time/disk_read_ios

DiskIoLatency_write

Average time in milliseconds spent for a write operation on the physical disk. Calculated as:
disk_write_time/disk_write_ios

Fileset_inode

Inode capacity utilization at the fileset level. This is calculated as:
(sum(gpfs_fset_allocInodes)-sum(gpfs_fset_freeInodes))/sum(gpfs_fset_maxInodes)

FsLatency_diskWaitRd

File system latency for the read operations. Average disk wait time per read operation on the IBM Storage Scale client.
sum(gpfs_fs_tot_disk_wait_rd)/sum(gpfs_fs_read_ops)

FsLatency_diskWaitWr

File system latency for the write operations. Average disk wait time per write operation on the IBM Storage Scale client.
sum(gpfs_fs_tot_disk_wait_wr)/sum(gpfs_fs_write_ops)
MemoryAvailable_percent
Estimated available memory percentage. Calculated as:
- For the nodes that have less than 40 GB total memory allocation:
  (mem_memfree+mem_buffers+mem_cached)/mem_memtotal
- For the nodes that have equal to or greater than 40 GB memory allocation:
  (mem_memfree+mem_buffers+mem_cached)/40000000
MetaDataPool_capUtil

Metadata pool capacity utilization. This is calculated as:
(sum(gpfs_pool_total_metaKB)-sum(gpfs_pool_free_metaKB))/sum(gpfs_pool_total_metaKB)

MemoryAvailable_percent

Memory percentage available is calculated to the total RAM or 40 GB, whichever is lower, at the node level.
(memtotal < 48000000) * ((memfree memcached membuffers)/memtotal * 100) (memtotal >= 48000000) * ((memfree memcached membuffers)/40000000 * 100

DiskIoLatency_read

I/O read latency at the disk level.
disk_read_time/disk_read_ios

DiskIoLatency_write

I/O write latency at the disk level.
disk_write_time/disk_write_ios

NFSNodeLatency_read

NFS read latency at the node level.
sum(nfs_read_lat)/sum(nfs_read_ops)

NFSNodeLatency_write

NFS writes latency at the node level.
sum(nfs_write_lat)/sum(nfs_write_ops)

SMBNodeLatency_read

SMB read latency at the node level.
avg(op_time)/avg(op_count)

SMBNodeLatency_write

SMB writes latency at the node level.
avg(op_time)/avg(op_count)
Metric name: The list of performance metrics that are available under the selected performance monitoring sensor or the measurement.
Name: User-defined name of the threshold rule.
Filter by: Defines the filter criteria for the threshold rule.
Group by: Groups the threshold values by the selected grouping criteria. If you select a value in this field, you must select an aggregator criteria in the Aggregator field. By default, there is no grouping, which means that the thresholds are evaluated based on the finest available key.
Warning level: Defines the threshold level for warning events to be raised for the selected metric. When the warning level is reached, the system raises an event with severity Warning. You can customize the warning message to specify the user action that is required to fix the issue.
Error level: Defines the threshold level for error events to be raised for the selected metric. When the error level is reached, the system raises an event with severity Error. You can customize the error message to specify the user action that is required to fix the issue.
Aggregator: When grouping is selected in the Group by field, an aggregator must be chosen to define the aggregation function. When the Rate aggregator is set, the grouping is automatically set to the finest available grouping.
Downsampling: Defines the operation to be performed on the samples over the selected monitoring interval.
Sensitivity: Defines the sample interval value. If a sensor is configured with interval period greater than 5 minutes, then the sensitivity is set to the same value as sensors period. The minimum value that is allowed is 120 seconds. If a sensor is configured with interval period less than 120 seconds, the --sensitivity is set to 120 seconds.
Hysteresis: Defines the percentage of the observed value that must be under or over the current threshold level to switch back to the previous state. The default value is 0%. Hysteresis is used to avoid frequent state changes when the values are close to the threshold. The level needs to be set according to the volatility of the metric.
Direction: Defines whether the events and messages are sent when the value that is being monitored exceeds or goes less than the threshold level.

You can also edit and delete a threshold rule.

Threshold configuration - A scenario

The user wants to configure a threshold rule to monitor the maximum disk capacity usage. The following table shows the values against each field of the Create Threshold dialog and their respective functionality.

Table 1. Threshold rule configuration - A sample scenario
GUI fields	Value and Function
Metric Category	GPFSDiskCap Specifies that the threshold rule is going to be defined for the metrics that belong to the GPFSDiskCap sensor.
Metric name	Available capacity in full blocks The threshold rule is going to be defined to monitor the threshold levels of available capacity.
Name	Total capacity threshold By default, the performance monitoring metric name is used as the threshold rule name. Here, the default value is overwritten with "Total capacity threshold".
Filter by	Cluster The values are filtered at the cluster level.
Group by	File system Groups the selected metric by file system.
Aggregator	Minimum When the available capacity reaches minimum threshold level, the system raises event. If the following values are selected, then the nature of the threshold rule changes. Sum: When the sum of the metric values exceeds the threshold levels, the system raises the events. Average: When the average value exceeds the average, the system raises the events. Maximum: When the maximum value exceeds maximum level, the system raises the events. Minimum: When the minimum value exceeds the sum of or goes less than the threshold levels, the system raises the events. Rate: When the rate exceeds the threshold value, the system raises the events. Rate is only added for the "finest" group by clause. If you want to get a rate for a "partial key", this is not supported. That is, when Rate is selected, the system automatically selects the best possible values in the grouping field.
Downsampling	None Specifies how the tested value is computed from all the available samples in the selected monitoring interval, if the monitoring interval is greater than the sensor period: None: The values are averaged. Sum: The sum of all values is computed. Minimum: The minimum value is selected. Maximum: the maximum value is selected.
Warning level	10 GiB The system raises an event with severity Warning when the available capacity reaches 10 GiB.
Error level	9 GiB The system raises an event with severity level Error when the available capacity reaches 9 GiB.
Sensitivity	24 hours The threshold value is being monitored once in a day.
Hysteresis	0 If the value is reduced more than 10 GiB, the warning state is removed. Similarly, if the value is reduced more than 9 GiB, the error state is removed.
Direction	Low When the value that is being monitored goes less than the threshold limit, the system raises an event.