Measuring I/O response time

The key to finding and fixing I/O-related performance problems is DASD response time: the length of time it takes to complete an I/O operation. Response time can have a dramatic effect on performance, particularly with online and interactive subsystems, such as CMS.

The following figure illustrates how DASD response time is defined.

Figure 1. DASD response-time components
DASD response-time components

DASD response time is the elapsed time from the DIAGNOSE instruction at the start subchannel (SSCH) instruction to the completion of the data transfer, which is indicated by a channel end/device end (CE/DE) interrupt. It includes any queue time plus the actual I/O operation. Service time is the elapsed time from the successful SSCH instruction to the data transfer completion. It includes seek time, any rotational delays, and data transfer time. Service time plus queue time equals response time.

The above figure shows these DASD response-time components:

Queue wait time
This is the internal VM queueing of the I/O operations that are waiting for a previous I/O to the device to complete. Queue time represents the time spent waiting on a device. Delays in the other service components may cause the queue component to increase, or the queue time may be a function of skewed arrival of I/Os from a particular application.
Pending time
This is the time from the start of the I/O until the DASD receives it. The pending time indicates the channel path usage. A high pending time indicates that the channel or logical control unit is busy. Pending time can be caused by busy channels and controllers or device busy from another system.

If a device is behind a cache controller (non-3990), pending time can also be caused by cache staging (the device is busy during the staging operation). When using nonenhanced dual copy (3990), the device is busy while writing the data to the primary volume and to the duplex volume if fast copy was not selected.

Disconnect time
Disconnect time includes:
  • The time for a seek operation.
  • Latency, always assumed to be half a revolution of the device.
  • Rotational position sensing (RPS) reconnect delay, the time for the set sector operation to reconnect to the channel path.

    This time depends on internal path busy, control unit busy, and channel path busy. If any element in the path is busy, a delay of one revolution of the device is experienced.

  • For a 3990 Model 3 cache controller, if the record is not in cache, the time waiting while staging completes for the previous I/O to the device or until one of the four lower interfaces becomes available from either transferring, staging, or destaging data for other devices.

When a device cannot reconnect to the host to transfer data because all paths are busy, it must wait for another revolution.

Using cache control units reduces or eliminates disconnect time. Disconnect time is used as a measurement of cache effectiveness.

Connect time
This is the time actually spent transferring data between the channel and DASD or channel and cache. Connect time can also include the time a search operation is occurring between the channel and the DASD or the channel and the cache (usually done to search a directory to find the location of a program module so it can be loaded into storage).

A high connect time indicates that you are using large blocks to transfer the data. This can be a problem if you mix small blocks and large blocks. The small blocks may have to wait on the larger ones to complete, thus causing a delay.