Performance metrics

IBM Spectrum Control can collect many different performance metrics, which indicate the particular performance characteristics of monitored resources.

About this task

Two important metrics for storage systems are throughput in I/O per second, and the response time in milliseconds. Throughput is measured and reported in several different ways:
  • Throughput of an entire box (storage system)
  • Each cluster
  • Each controller (Examples: DS8000)
  • Each I/O Group (Example: storage systems that run IBM Storage Virtualize)
Throughputs are measured for:
  • Each volume (or LUN)
  • At the Fibre Channel interfaces (ports) on some of the storage boxes
  • On Fibre Channel switches
  • At the RAID array after cache hits have been filtered out

For storage systems, the performance statistics are separated into frontend I/O metrics and back-end I/O metrics. Front-end I/O metrics are a measure of the traffic between the servers and storage systems. Back-end I/O metrics are a measure of all traffic between the storage system cache and the disks in the RAID arrays in the back-end of the storage system. Most storage systems give metrics for both kinds of I/O operations: front-end and back-end operations. It is important to know whether the throughput and response times are at the front-end (close to the system level response time as measured from a server) or back-end (between the cache and disk).

The main front-end throughput metrics are:
  • Total IO rate (overall)
  • Read IO rate (overall)
  • Write IO rate (overall)
The corresponding front-end response time metrics are:
  • Overall response time
  • Read response time
  • Write response time
The main back-end throughput metrics are:
  • Total back-end IO rate (overall)
  • Back-end read IO rate (overall)
  • Back-end write IO rate (overall)
The corresponding back-end response time metrics are:
  • Overall back-end response time
  • Back-end read response time
  • Back-end write response time

For planning purposes, it's important to track any growth or change in the rates and response times. It frequently happens that I/O rate grows over time, and that response time increases as the I/O rates increase. This relationship is what "capacity planning" is all about. As I/O rates increase, and as response times increase, you can use these trends to project when additional storage performance (as well as capacity) is required.

Depending on the particular storage environment, it might be that throughput or response time times change drastically from hour to hour or day to day. There might be periods when the values fall outside the expected range of values. In that case, other performance metrics can be used to understand what is happening. Here are some additional metrics that can be used to make sense of throughput and response times:
  • Total cache hit percentage
  • Read cache hit percentage
  • Write-cache delay percentage (previously known as NVS full percentage)
  • Read transfer size (KB/operation)
  • Write transfer size (KB/operation)

Low cache hit percentages can drive up response times, because a cache miss requires access to back-end storage. Low hit percentages also tend to increase the utilization percentage of the back-end storage, which might adversely affect the back-end throughput and response times. High write-cache delay percentage (previously known as NVS full percentage) can drive up the write response times. High transfer sizes typically indicate more of a batch workload, in which case the overall data rates are more important than the I/O rates and the response times.

All these metrics can be monitored through lists, charts, and reports in IBM Spectrum Control. Some examples of supported thresholds are:
  • Total I/O rate and total data rate thresholds
  • Total back-end I/O rate and total back-end data rate thresholds
  • Read back-end response time and write back-end response time thresholds
  • Total port I/O rate (packet rate) and data rate thresholds
  • Overall port response time threshold
  • Port send utilization percentage and port receive utilization percentage thresholds
  • Port send bandwidth percentage and port receive bandwidth percentage thresholds

For Fibre Channel switches, the important metrics are total port packet rate and total port data rate, which provide the traffic pattern over a particular switch port. Port bandwidth percentage metrics are also important to provide an indicator of bandwidth usage based on port speeds. When there are lost frames from the host to the switch port, or from the switch port to a storage device, the dumped frame rate on the port can be monitored.

The important things are:
  • Monitor the throughput and response time patterns over time for your environment
  • Develop an understanding of expected behaviors
  • Investigate the deviations from normal patterns of behavior to get warning signs of abnormal behavior
  • Generate the trend of workload changes