No replies
2092 Posts

Pinned topic disk oversubscription + long RPC waits on IO

‏2012-11-26T22:22:47Z |
Hi all,

We've running some benchmarking on some new storage+GPFSv.3.5, and I think we might have set some of our parameters for prefetching (which also handles write behind buffers) a bit too high. We are experiencing a rampdown near the end of our tests, where IO was running at full capacity for most of the test, but near the end, we are only running at 75% aggregate storage bandwidth for maybe 10% of the test (the end). This obviously hurts aggregate bandwidth numbers.

  • Note, if someone could tell me how to use the proper post formatting in this forum so the post looks better, that would be great*
Some relevant client-side settings (this is over RDMA):

verbsRdmasPerConnection 128
verbsRdmasPerNode 128
worker1Threads 256
prefetchThreads 1024
During the rampdown periods (client side waiters):

0xFFA04001C70 waiting 20.091231892 seconds, PrefetchWorkerThread: on ThCond 0xFF81C003C58 (0xFF81C003C58) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node <c2n143>
0xFFB4C002F70 waiting 7.005385961 seconds, PrefetchWorkerThread: on ThCond 0xFF928002038 (0xFF928002038) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node <c2n41>
0xFF968001C70 waiting 5.293804283 seconds, PrefetchWorkerThread: on ThCond 0xFFD3800B208 (0xFFD3800B208) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node <c2n55>
0xFF878000910 waiting 26.265957631 seconds, PrefetchWorkerThread: on ThCond 0x180050797D0 (0x80000000050797D0) (LkObjCondvar), reason 'waiting for WW lock'
0xFF880000910 waiting 17.258222869 seconds, PrefetchWorkerThread: on ThCond 0xFF6E00035B8 (0xFF6E00035B8) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node <c2n143>
0xFF88C000910 waiting 48.113863758 seconds, PrefetchWorkerThread: on ThCond 0xFFA78002268 (0xFFA78002268) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node <c2n136>
0xFF888000910 waiting 6.173766017 seconds, PrefetchWorkerThread: on ThCond 0xFFDF4006158 (0xFFDF4006158) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node <c2n41>
I've checked the NSD servers during these timeframes, and we don't lose access to our LUNs, nor do we have any failing disks. Often times, many of our LUNs (in random groups) have 5-6 second IO waiters.

Also, while running these tests, I have iostat running in the background on all of our NSD servers, as well seem to have very high queue lengths at the beginning of the runs (an average of 85 per LUN/NSD), and at certain other points.

What I'm suspecting is that we've not only set our client side prefetch (and worker1Threads) too high, but we've also set our NSD server settings for nsd worker threads too high. So it seems like GPFS is happily sending read/write requests to disk, but our disks cannot keep up, so we're option hitting the max nr_requests limit (/sys/block/*/queue/nr_requests) on each LUN. That will mean requests submitted to each LUN queue wait longer as the number of requests increases. The average wait time when load is heavy is around 1.1 seconds per IO request (due to being in a long queue). Hitting a number or almost hitting a performance target number is great for C-level people, but I'd like my storage system to be well balanced, and looking at the various charts and graphs, it seems to be very up and down at times.

Around the industry, many people seem to say more than 2-3 requests in the disk queue seem to be too high. If you multiply 3 requests per disk ( * 8 disks in a RAID6 10 disk array), you get 24. Now, I'm expecting bursts, and certain times where the queue length on an entire LUN gets high because a disk is going bad, but on average, that just seems high to be.

Does this sound like a case where we are oversubscribing our storage hardware because we are being too aggressive with GPFS?

If so, is there a way I can dump all of the relevant GPFS client/server information (via mmfsadm dump all), and find some relevant figures? My thinking was to dump the nsd stats from GPFS, and figure out if the number of nsd worker threads at a current point corresponds to the number of IOs in the queue, etc...