fs_io_s and io_s output - how to aggregate and analyze the results

The fs_io_s and io_s requests can be used to determine a number of GPFS™ I/O parameters and their implication for overall performance.

The output from the fs_io_s and io_s requests can be used to determine:

The I/O service rate of a node, from the application point of view. The io_s request presents this as a sum for the entire node, while fs_io_s presents the data per file system. A rate can be approximated by taking the _br_ (bytes read) or _bw_ (bytes written) values from two successive invocations of fs_io_s (or io_s_) and dividing by the difference of the sums of the individual _t_ and _tu_ values (seconds and microseconds).

This must be done for a number of samples, with a reasonably small time between samples, in order to get a rate which is reasonably accurate. Since we are sampling the information at a given interval, inaccuracy can exist if the I/O load is not smooth over the sampling time.

For example, here is a set of samples taken approximately one second apart, when it was known that continuous I/O activity was occurring:

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862476 _tu_ 634939 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 3737124864 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 3570 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862477 _tu_ 645988 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 3869245440 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 3696 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862478 _tu_ 647477 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4120903680 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 3936 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862479 _tu_ 649363 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4309647360 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4116 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862480 _tu_ 650795 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4542431232 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4338 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862481 _tu_ 652515 _cl_ cluster1.ibm.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4743757824 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4530 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862482 _tu_ 654025 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 4963958784 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4740 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862483 _tu_ 655782 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 5177868288 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 4944 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862484 _tu_ 657523 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 5391777792 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 5148 _dir_ 0 _iu_ 5

_fs_io_s_ _n_ 199.18.1.3 _nn_ node1 _rc_ 0 _t_ 1095862485 _tu_ 665909 _cl_ cluster1.xxx.com
_fs_ gpfs1m _d_ 3 _br_ 0 _bw_ 5599395840 _oc_ 4 _cc_ 3 _rdc_ 0 _wc_ 5346 _dir_ 0 _iu_ 5

This simple awk script performs a basic rate calculation:

BEGIN {
  count=0;
  prior_t=0;
  prior_tu=0;
  prior_br=0;
  prior_bw=0;
}

{
  count++;

  t = $9;
  tu = $11;
  br = $19;
  bw = $21;

  if(count > 1)
  {
    delta_t = t-prior_t;
    delta_tu = tu-prior_tu;
    delta_br = br-prior_br;
    delta_bw = bw-prior_bw;
    dt = delta_t + (delta_tu / 1000000.0);
    if(dt > 0) {
      rrate = (delta_br / dt) / 1000000.0;
      wrate = (delta_bw / dt) / 1000000.0;
      
printf("%5.1f MB/sec read %5.1f MB/sec write\n",rrate,wrate);
    }
  }

  prior_t=t;
  prior_tu=tu;
  prior_br=br;
  prior_bw=bw;
}

The calculated service rates for each adjacent pair of samples is:

  0.0 MB/sec read    130.7 MB/sec write
  0.0 MB/sec read    251.3 MB/sec write
  0.0 MB/sec read    188.4 MB/sec write
  0.0 MB/sec read    232.5 MB/sec write
  0.0 MB/sec read    201.0 MB/sec write
  0.0 MB/sec read    219.9 MB/sec write
  0.0 MB/sec read    213.5 MB/sec write
  0.0 MB/sec read    213.5 MB/sec write
  0.0 MB/sec read    205.9 MB/sec write

Since these are discrete samples, there can be variations in the individual results. For example, there may be other activity on the node or interconnection fabric. I/O size, file system block size, and buffering also affect results. There can be many reasons why adjacent values differ. This must be taken into account when building analysis tools that read mmpmon output and interpreting results.

For example, suppose a file is read for the first time and gives results like this.

  0.0 MB/sec read      0.0 MB/sec write
  0.0 MB/sec read      0.0 MB/sec write
 92.1 MB/sec read      0.0 MB/sec write
 89.0 MB/sec read      0.0 MB/sec write
 92.1 MB/sec read      0.0 MB/sec write
 90.0 MB/sec read      0.0 MB/sec write
 96.3 MB/sec read      0.0 MB/sec write
  0.0 MB/sec read      0.0 MB/sec write
  0.0 MB/sec read      0.0 MB/sec write

If most or all of the file remains in the GPFS cache, the second read may give quite different rates:

  0.0 MB/sec read      0.0 MB/sec write
  0.0 MB/sec read      0.0 MB/sec write
235.5 MB/sec read      0.0 MB/sec write
287.8 MB/sec read      0.0 MB/sec write
  0.0 MB/sec read      0.0 MB/sec write
  0.0 MB/sec read      0.0 MB/sec write

Considerations such as these need to be taken into account when looking at application I/O service rates calculated from sampling mmpmon data.

Usage patterns, by sampling at set times of the day (perhaps every half hour) and noticing when the largest changes in I/O volume occur. This does not necessarily give a rate (since there are too few samples) but it can be used to detect peak usage periods.
If some nodes service significantly more I/O volume than others over a given time span.
When a parallel application is split across several nodes, and is the only significant activity in the nodes, how well the I/O activity of the application is distributed.
The total I/O demand that applications are placing on the cluster. This is done by obtaining results from fs_io_s and io_s in aggregate for all nodes in a cluster.
The rate data may appear to be erratic. Consider this example:
```
 0.0 MB/sec read    0.0 MB/sec write
 6.1 MB/sec read    0.0 MB/sec write
92.1 MB/sec read    0.0 MB/sec write
89.0 MB/sec read    0.0 MB/sec write
12.6 MB/sec read    0.0 MB/sec write
 0.0 MB/sec read    0.0 MB/sec write
 0.0 MB/sec read    0.0 MB/sec write
 8.9 MB/sec read    0.0 MB/sec write
92.1 MB/sec read    0.0 MB/sec write
90.0 MB/sec read    0.0 MB/sec write
96.3 MB/sec read    0.0 MB/sec write
 4.8 MB/sec read    0.0 MB/sec write
 0.0 MB/sec read    0.0 MB/sec write
```
The low rates which appear before and after each group of higher rates can be due to the I/O requests occurring late (in the leading sampling period) and ending early (in the trailing sampling period.) This gives an apparently low rate for those sampling periods.

The zero rates in the middle of the example could be caused by reasons such as no I/O requests reaching GPFS during that time period (the application issued none, or requests were satisfied by buffered data at a layer above GPFS), the node becoming busy with other work (causing the application to be undispatched), or other reasons.

For information on interpreting mmpmon output results, see Other information about mmpmon output.