Monitoring thresholds by using GUI
You can configure the IBM Storage Scale to raise events when certain thresholds are reached. Use the page to define or modify thresholds for the data that is collected through the performance monitoring sensors.
- Warning level
- When the data that is being monitored reaches the warning level, the system raises an event with
severity
Warning
. When the observed value exceeds the current threshold level, the system removes the warning. - Error level
- When the data that is being monitored reaches the error level, the system raises an event with
severity
Error
. When the observed value exceeds the current threshold level, the system removes the error state.
- Inode utilization at the fileset level
- Datapool capacity utilization
- Metadata pool capacity utilization
- Free memory utilization
Apart from the predefined thresholds, you can also create user-defined thresholds for the data that is collected through the performance monitoring sensors.
You can use the mmhealth command to manage both predefined and user-defined thresholds.
page in the GUI and theDefining thresholds
- Metric category: Lists all performance monitoring sensors that are
enabled in the system and thresholds that are derived by performing certain calculations on certain
performance metrics. These derived thresholds are referred as measurements. The
measurements category provides the flexibility to edit certain predefined
threshold rules. The following measurements are available for selection:
- DataPool_capUtil
- Datapool capacity utilization, which is calculated as:
(sum(gpfs_pool_total_dataKB)-sum(gpfs_pool_free_dataKB))/sum(gpfs_pool_total_dataKB)
- DiskIoLatency_read
- Average time in milliseconds spent for a read operation on the physical disk. Calculated
as:
disk_read_time/disk_read_ios
- DiskIoLatency_write
- Average time in milliseconds spent for a write operation on the physical disk. Calculated
as:
disk_write_time/disk_write_ios
- Fileset_inode
- Inode capacity utilization at the fileset level. This is calculated
as:
(sum(gpfs_fset_allocInodes)-sum(gpfs_fset_freeInodes))/sum(gpfs_fset_maxInodes)
- FsLatency_diskWaitRd
- File system latency for the read operations. Average disk wait time per read operation on the
IBM Storage Scale
client.
sum(gpfs_fs_tot_disk_wait_rd)/sum(gpfs_fs_read_ops)
- FsLatency_diskWaitWr
- File system latency for the write operations. Average disk wait time per write operation on the
IBM Storage Scale client.
sum(gpfs_fs_tot_disk_wait_wr)/sum(gpfs_fs_write_ops)
- MemoryAvailable_percent
- Estimated available memory percentage. Calculated as:
- For the nodes that have less than 40 GB total memory
allocation:
(mem_memfree+mem_buffers+mem_cached)/mem_memtotal
- For the nodes that have equal to or greater than 40 GB memory
allocation:
(mem_memfree+mem_buffers+mem_cached)/40000000
- For the nodes that have less than 40 GB total memory
allocation:
- MetaDataPool_capUtil
- Metadata pool capacity utilization. This is calculated
as:
(sum(gpfs_pool_total_metaKB)-sum(gpfs_pool_free_metaKB))/sum(gpfs_pool_total_metaKB)
- MemoryAvailable_percent
- Memory percentage available is calculated to the total RAM or 40 GB, whichever is lower, at the
node level.
(memtotal < 48000000) * ((memfree memcached membuffers)/memtotal * 100) (memtotal >= 48000000) * ((memfree memcached membuffers)/40000000 * 100
- DiskIoLatency_read
- I/O read latency at the disk level.
disk_read_time/disk_read_ios
- DiskIoLatency_write
- I/O write latency at the disk level.
disk_write_time/disk_write_ios
- NFSNodeLatency_read
- NFS read latency at the node level.
sum(nfs_read_lat)/sum(nfs_read_ops)
- NFSNodeLatency_write
- NFS writes latency at the node level.
sum(nfs_write_lat)/sum(nfs_write_ops)
- SMBNodeLatency_read
- SMB read latency at the node level.
avg(op_time)/avg(op_count)
- SMBNodeLatency_write
- SMB writes latency at the node level.
avg(op_time)/avg(op_count)
- Metric name: The list of performance metrics that are available under the selected performance monitoring sensor or the measurement.
- Name: User-defined name of the threshold rule.
- Filter by: Defines the filter criteria for the threshold rule.
- Group by: Groups the threshold values by the selected grouping criteria. If you select a value in this field, you must select an aggregator criteria in the Aggregator field. By default, there is no grouping, which means that the thresholds are evaluated based on the finest available key.
- Warning level: Defines the threshold level for warning events to be
raised for the selected metric. When the warning level is reached, the system raises an event with
severity
Warning
. You can customize the warning message to specify the user action that is required to fix the issue. - Error level: Defines the threshold level for error events to be raised
for the selected metric. When the error level is reached, the system raises an event with severity
Error
. You can customize the error message to specify the user action that is required to fix the issue. - Aggregator: When grouping is selected in the Group by field, an aggregator must be chosen to define the aggregation function. When the Rate aggregator is set, the grouping is automatically set to the finest available grouping.
- Downsampling: Defines the operation to be performed on the samples over the selected monitoring interval.
- Sensitivity: Defines the sample interval value. If a sensor is configured
with interval period greater than 5 minutes, then the sensitivity is set to the same value as
sensors period. The minimum value that is allowed is 120 seconds. If a sensor is configured with
interval period less than 120 seconds, the
--sensitivity
is set to 120 seconds. - Hysteresis: Defines the percentage of the observed value that must be under or over the current threshold level to switch back to the previous state. The default value is 0%. Hysteresis is used to avoid frequent state changes when the values are close to the threshold. The level needs to be set according to the volatility of the metric.
- Direction: Defines whether the events and messages are sent when the value that is being monitored exceeds or goes less than the threshold level.
You can also edit and delete a threshold rule.
Threshold configuration - A scenario
The user wants to configure a threshold rule to monitor the maximum disk capacity usage. The following table shows the values against each field of the Create Threshold dialog and their respective functionality.
GUI fields | Value and Function |
---|---|
Metric Category | GPFSDiskCap Specifies that the threshold rule is going to be defined for the metrics that belong to the GPFSDiskCap sensor. |
Metric name | Available capacity in full blocks The threshold rule is going to be defined to monitor the threshold levels of available capacity. |
Name | Total capacity threshold By default, the performance monitoring metric name is used as the threshold rule name. Here, the default value is overwritten with "Total capacity threshold". |
Filter by | Cluster The values are filtered at the cluster level. |
Group by | File system Groups the selected metric by file system. |
Aggregator | Minimum When the available capacity reaches minimum threshold level, the system raises
event. If the following values are selected, then the nature of the threshold rule changes.
|
Downsampling | None Specifies how the tested value is computed from all the available samples in the selected monitoring interval, if the monitoring interval is greater than the sensor period:
|
Warning level | 10 GiB The system raises an event with severity Warning when the available capacity reaches 10 GiB. |
Error level | 9 GiB The system raises an event with severity level Error when the available capacity reaches 9 GiB. |
Sensitivity | 24 hours The threshold value is being monitored once in a day. |
Hysteresis | 0 If the value is reduced more than 10 GiB, the warning state is removed. Similarly, if the value is reduced more than 9 GiB, the error state is removed. |
Direction | Low When the value that is being monitored goes less than the threshold limit, the system raises an event. |