(updated 08.10.: dump thread all output, some more comments)
on a cluster of 3 RHEL 6.4 blades I see some strange behaviour.
The setup is not very fortunate for GPFS:
2 LUNs as virtIO devices, from a DS8000 (probably the same physical disks).
GPFS 126.96.36.199, default config. 1 FS with 128kiB BS, no replication (neither data nor metadata).
dd read from the GPFS FS is reasonably fast (400..500MiB/s), dd write as well (about the same throughput).
When running one dd read and one dd write on the same cluster node, however, writes show a much decreased performance, and reads sometimes a slighter decrease.
dd reports about 20..30MiB/s (1GiB in about 40s), but the dd command itself takes up to 5 or 6 minutes , that might mean that dd cannot even open the file for some time.
When running the dd read and the dd write on two different nodes of the cluster, both are fast.
When running dd read and dd write on the same node, but one against an ext3-FS carved from the same disk pool as the GPFS FS, performance is also good. same for the case that read and write is done against that ext3 FS on the same node.
The IO times from the GPFS IO trace in the problem config, i.e. reading and writing on GPFS from the same node, are very high (dump disk:
full-block write operations <= 1 sec: 582
avg time: 159.137 ms
full-block write operations <= 1 sec: 572
avg time: 140.532 ms
but the storage is not as slow on its own, as the distrubuted read and write from two different GPFS nodes or the read/write by ext3 or the ext3-GPFS combination proves.
I have not yet done a trace byond class io, but dump all does not show an exhaustion of threads :
Dump of thread system: TH at 0x146BC80
Running thread 0x1566280 (MMFSADMDumpCmdThread)
nThreads 143 sequenceNum 175094320
Thread pool utilization (current/highest/maximum):
MultiThreadWork helpers: wanted 0, running 0, limit 332
Hence it looks like the long IO times when doing concurrent reads and writes are indeed causing the problem (but again: why do we not see the problem when using two different FS on the same storage pool -- this shares everything but the LUNs themselves, and why not using GPFS reads and GPFS writes from two different nodes - this shares the LUNs at least ...)
waiters (when concurrently writing and reading in GPFS from one node) are mostly IO waiters matching the higgh IO times, with sometimes add'l waiters like
linx16-116: 0x165D1D0 waiting 10.339519000 seconds, SGExceptionLogBufferFullThread: for I/O completion on disk vde
linx16-116: 0x16604B0 waiting 10.340031584 seconds, DeallocHelperThread: on ThCond 0x15E03D0 (0x15E03D0) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete'
linx16-116: 0x16591D0 waiting 10.340165802 seconds, OpenHandlerThread: on ThCond 0x15E03D0 (0x15E03D0) (LogFileBufferDescriptorCondvar), reason 'force wait on force active buffer write'
linx16-116: 0x14EF1B0 waiting 19.497356491 seconds, SyncHandlerThread: on ThCond 0x1800205EE90 (0xFFFFC9000205EE90) (LkObjCondvar), reason 'waiting for RO lock'
linx16-116: 0x165D1D0 waiting 1.451253000 seconds, SGExceptionLogBufferFullThread: for I/O completion on disk vde
linx16-116: 0x16604B0 waiting 16.814353584 seconds, DeallocHelperThread: on ThCond 0x15E03D0 (0x15E03D0) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete'
linx16-116: 0x16591D0 waiting 16.814487802 seconds, OpenHandlerThread: on ThCond 0x15E03D0 (0x15E03D0) (LogFileBufferDescriptorCondvar), reason 'force wait on force active buffer write'
linx16-116: 0x14EF1B0 waiting 25.971678491 seconds, SyncHandlerThread: on ThCond 0x1800205EE90 (0xFFFFC9000205EE90) (LkObjCondvar), reason 'waiting for RO lock'
It seems that there is always a combination of the DeallocHelperThread, OpenHandlerThread, and SyncHandlerThread. Those can grow a few 100s of seconds old. That triplett is always accompanied by particularly long waiting SGExceptionLogBufferFullThreads waiting for IO completion; however, that accomponaying long waiter seems to be "renewed" during the SyncHandlerThread waiting period (see example above : the SGExceptionLogBufferFullThread in the second readout is a newer one than in the first readout).
When these SGExceptionLogBufferFullThread occur, the dd write seems to be blocked to open the file, in that moment, the read is full speed. Read throughput decreases slightly (to about 350..400MiB/s), when the write is running (i.e. when this SGExceptionLogBufferFullThread is not seen).
Is there some known and quick explanation for this?