Why Low I/O Rates Can Result in High Response Times for Reads and Writes
As IBM Storage Insights and Storage Insights Pro become more widely adopted, many companies who weren't doing performance monitoring previously are now able to see the performance of their managed storage systems. With the Alerting features on Storage Insights Pro, companies are much more aware of performance problems within their storage networks. One common question that comes up is why a volume with low I/O rates can have very high response times. Often these high response times are present even with no obvious performace impact at the application layer.
These response time spikes generally measured in the 10s or 100s of milliseconds, but can be a second or greater. At the same time, the I/O rates are low - perhaps 10 I/Os per second or less. This can occur on either read or write I/Os. As an example, this picture shows a typical pattern of generally low I/O rates with a high response time. The volume in question is a volume used for backups, so the volume is generally only written to during backups. The blue line is the I/O rate - in this case the write I/O rate, but the same situation can happen with reads from idle volumes. The orange line is the response time. You can see a pattern of generally low I/O rates overall to the volume and that the write response time spikes up when the I/O rate goes up. It is easy to see why if you were using Storage Insights Pro to alert on response time, you might be concerned about a response time greater than 35 ms.
This situation happens because of the way storage systems manage internal cache. This is generally true for all (IBM or non-IBM) storage subsystems and storage virtualization engines (VE). If a volume has low I/O rates then the volume is idle or nearly idle for extended periods of time. The volume can be idle for a minute or more. The storage device or VE will then flush the cache of that volume. It does this to to free up cache space for other volumes which are actively reading or writing. The first I/O that arrives after an idle period for the volume requires re-initialization of the cache. For storage systems and VEs with redundant controllers or nodes, this also requires that the cache is synchronized across the nodes or controllers of the storage subsystem. All of this takes time. The processes are often expensive in performance terms and the first I/O after an idle period can have significant delays. Additionally for write I/O, the volume may operate in write-through mode until the cache has been fully synchronize. In write-through mode the data is written to cach and disk at the same time. This can cayse further slowdowns because each write will be reported as complete only after the update has been written to the back-end disk(s). After the cache is synchronized, each write will be reported as complete after the update has been written to cache. This is a much faster process. You can see how, depending on the caching scheme of the storage subsystem, you would see a pattern of idle or almost idle volumes having extremely high response times. Unless you are seeing applications be impacted this is generally not a concern.
Response time spikes can also occur with large transfer sizes. This picture shows response time for the same volume as in the previous picture, except as it relates to transfer size. In this case it is the size of the write. As in the above picture the orange line is the response time, the blue line is the transfer size. You can see that the transfer size is large - almost 500 KB per I/O. The volume for this performance data is not compressed. If it were compressed there could be additional delays depending on the compression engine used in the storage. Barry Whyte gives an excellent writeup of Data Reduction Pools here that details how DRP gives better performance than IBM RACE or other comppression technologies.