Topic
  • 3 replies
  • Latest Post - ‏2013-10-09T14:41:18Z by ufa
ufa
ufa
133 Posts

Pinned topic slow write with concurrent reads on same node

‏2013-10-08T01:25:47Z |

Hi,

(updated 08.10.: dump thread all output, some more comments)


on a cluster of 3 RHEL 6.4 blades I see some strange behaviour.

The setup is not very fortunate for GPFS:

2 LUNs as virtIO devices, from a DS8000 (probably the same physical disks).

GPFS 3.5.0.11, default config. 1 FS with 128kiB BS, no replication (neither data nor metadata).

 

dd read from the GPFS FS is reasonably fast (400..500MiB/s), dd write as well (about the same throughput).

When running one dd read and one dd write on the same cluster node, however, writes show a much decreased performance, and reads sometimes a slighter decrease.

dd reports about 20..30MiB/s (1GiB in about 40s), but the dd command itself takes up to 5 or 6 minutes , that might mean that dd cannot even open the file for some time.

When running the dd read and the dd write on two different nodes of the cluster, both are fast.

When running dd read and dd write on the same node, but one against an ext3-FS carved from the same disk pool as the GPFS FS, performance is also good. same for the case that read and write is done against that ext3 FS on the same node.

The IO times from the GPFS IO trace in the problem config, i.e. reading and writing on GPFS from the same node, are very high (dump disk: 

   full-block write operations <= 1 sec: 582
      avg time: 159.137 ms
    full-block write operations <= 1 sec: 572
      avg time: 140.532 ms

but the storage is not as slow on its own, as the distrubuted read and write from two different GPFS nodes or the read/write by ext3 or the ext3-GPFS combination proves.

I have not yet done a trace byond class io, but dump all does not show an exhaustion of threads :

Dump of thread system: TH at 0x146BC80
  Running thread 0x1566280 (MMFSADMDumpCmdThread)
  nThreads 143 sequenceNum 175094320
  Thread pool utilization (current/highest/maximum):
    ThPoolNonCritical: 22/36/512
    ThPoolDaemonStartup: 13/16/36
    ThPoolMBHandler: 55/55/66
    ThPoolRcvWorkers: 3/3/384
    ThPoolRevokeWorkers: 0/1/35
    ThPoolRangeRevokeWorkers: 1/1/20
    ThPoolRecLockRevokeWorkers: 0/0/10
    ThPoolPrefetchWorkers: 29/29/72
    ThPoolSGException: 0/5/18
    ThNsdIoWorkers: 0/0/0
    ThPoolSdrServWorkers: 1/1/10
    ThPoolReceivers: 2/2/2
    ThPoolPcache: 1/2/15
  MultiThreadWork helpers: wanted 0, running 0, limit 332
 

Hence it looks like the long IO times when doing concurrent reads and writes are indeed causing the problem (but again: why do we not see the problem when using two different FS on the same storage pool -- this shares everything but the LUNs themselves, and why not using GPFS reads and GPFS writes from two different nodes - this shares the LUNs at least ...)

waiters (when concurrently writing and reading in GPFS from one node)  are mostly IO waiters matching the higgh IO times, with sometimes add'l waiters like

linx16-116:  0x165D1D0 waiting 10.339519000 seconds, SGExceptionLogBufferFullThread: for I/O completion on disk vde
linx16-116:  0x16604B0 waiting 10.340031584 seconds, DeallocHelperThread: on ThCond 0x15E03D0 (0x15E03D0) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete'
linx16-116:  0x16591D0 waiting 10.340165802 seconds, OpenHandlerThread: on ThCond 0x15E03D0 (0x15E03D0) (LogFileBufferDescriptorCondvar), reason 'force wait on force active buffer write'
linx16-116:  0x14EF1B0 waiting 19.497356491 seconds, SyncHandlerThread: on ThCond 0x1800205EE90 (0xFFFFC9000205EE90) (LkObjCondvar), reason 'waiting for RO lock'

####
linx16-116:  0x165D1D0 waiting 1.451253000 seconds, SGExceptionLogBufferFullThread: for I/O completion on disk vde
linx16-116:  0x16604B0 waiting 16.814353584 seconds, DeallocHelperThread: on ThCond 0x15E03D0 (0x15E03D0) (LogFileBufferDescriptorCondvar), reason 'force wait for buffer write to complete'
linx16-116:  0x16591D0 waiting 16.814487802 seconds, OpenHandlerThread: on ThCond 0x15E03D0 (0x15E03D0) (LogFileBufferDescriptorCondvar), reason 'force wait on force active buffer write'
linx16-116:  0x14EF1B0 waiting 25.971678491 seconds, SyncHandlerThread: on ThCond 0x1800205EE90 (0xFFFFC9000205EE90) (LkObjCondvar), reason 'waiting for RO lock'

###
It seems that there is always a combination of the DeallocHelperThread, OpenHandlerThread, and SyncHandlerThread. Those can grow a few 100s of seconds old. That triplett is always accompanied by particularly long waiting SGExceptionLogBufferFullThreads waiting for IO completion; however, that accomponaying long waiter seems to be "renewed" during the SyncHandlerThread waiting period (see example above : the SGExceptionLogBufferFullThread in the second readout is a newer one than in the first readout).

When these SGExceptionLogBufferFullThread occur, the dd write seems to be blocked to open the file, in that moment, the read is full speed. Read throughput decreases slightly (to about 350..400MiB/s), when the write is running (i.e. when  this SGExceptionLogBufferFullThread is not seen).

Is there some known and quick explanation for this?

ufa

 

Updated on 2013-10-08T07:22:52Z at 2013-10-08T07:22:52Z by ufa
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: slow write with concurrent reads on same node

    ‏2013-10-08T15:25:05Z  

    Anytime you see a "waiting for I/O completion" it means the disk subsystem is the hangup. You may be getting an good IO average, but there may be some IO that are exceedingly slow. If the IO is at a critical synchronization point (some internal logging can be this), then many other threads get held up waiting for locks to be cleared behind the IO.

    It may be that the disk subsystem has problems with IO smaller than 4K, which log records can be. All your data blocks are more likely multiples of 4K and only the writes with odd sector counts are slow. Look at "mmfsadm dump iohist" to see the last 512 IOs to see if some stand out as slow. Maybe it's only a few disks or IO sizes that are slow.

  • db808
    db808
    86 Posts

    Re: slow write with concurrent reads on same node

    ‏2013-10-08T18:46:08Z  

    I agree with the value of checking GPFS io history via "mmfsadm dump iohist".  I also wonder what the IO latencies are.

    A couple other thoughts.  Please confirm that storage array write caching on the DS8000 is properly enabled for the 2 LUNs being used. 

    Writes are very resource-intensive on RAID5 or RAID6 disk topologies, especially if write caching is not enabled on the LUN.  A single RAID6 write could involve as many as 3 physical reads and 3 physical writes, for a total of 6 IOs ... if the write is not a "full stripe" write which is typically optimized to only require 1 IO per disk in the stripe, amortized across a larger IO.

    128 kb is a very small GPFS block size, and is ill-suited to high performance sequential IO.  The 128kb blocksize is also probably too smalll to be the full-stripe size on the underlying DS8000 LUN.  Therefore, 6 physical IOs are needed for each host write if R6.  This is also true if the small writes to update GPFS metadata are going to a RAID6 LUN.

    What are your atime and mtime settings?  File writes have the extra burden of updating mtime also.  GPPS will buffer these metadata updates, especially in the single-node case, but will flush them to disk periodically and on file close.  Each of these metadata updates would generate the 6 IOs on RAID6

    So for the "normal" case, you would expect a 6-to-1 difference in performance between GPFS reads and GPFS writes, due to the 6-IOs-per RAID6 penalty.  Add an additional metadata write every couple of blocks for allocating and writing the index blocks, mtime ...metadata journaling ,,, and observed 15-20 -to-1 difference is explainable, simply if you are doing non-full-stripe writes under RAID6.

    This behavior is the justification behind the GPFS recommendation that the GPFS block size be integer multiples of the underlying hardware RAID stripe size.  This results in the full GPFS block writes taking advantage of full-stripe write optimizations that are often available in storage arrays.

    I don't know what your specific DS8000 LUN configuration is, and I have no DS8000 experience.  A quick Google search finds that:

         Each stripe on a DS8000 is made up of 4 logical tracks.
         Each logical track is 64KiB, so a track is 256KiB in size.

    So, if this is typical of your LUN configuration, then your 128kb GPFS block size is ill-suited for the 256kb hardware stripe.  If you are using RAID5, each host write generates 4 back-end disk IOs, for a 4-to-1 write amplification ratio.  Not as bad as RAID6, but still significant.

    BTW:  Remember that when doing IO into the same LUNs, you are are in a zero-sum game.  Every host write you do generates 4 (R6) or 6(R6) back-end IOs, that steals IO capability from concurrent host reads.  What you might think is a 50/50 read/write split from the host perspective, ends up being a 1/(3+3) split on back-end IO activity if R6.  The back-end IO statistics can be misleading because a host WRITE generates 3 back-end READS.  In the 50/50 host read/write scenario, you get (4 reads)/(3 writes) ... which does not look too far off ... but three of the reads were generated from the host write ... misleading.

    So ... if you are using a hardware full stripe of 256kb, then your GPFS block size should be 256kb or LARGER.

    Then, the single large GPFS host IO, will be translated into a full stripe hardware write ... no read/modify/write.    For (4+1)R5. the full stripe write amplification is an effective 1.25 ... much better than 4-to-1 for less-than-full-stripe writes.

    Other thoughts, but less likely:

    If the 2 LUNs that are being used are separate data and metadata, you can use the normal Linux "iostat" to monitor disk performance.  You have to deal with multiple paths to the same LUN, but it is not too bad with a small number of LUNs.

    You mentioned that the simple dd ran as fast as 500 MiB/sec ... and with a 128kb block size, that equates into 4,000 IOs per second.  That would normally be associated with 40+ disks doing 100 IOs each.  Is there that much disk power configured behind the two LUNs?  If not, there is some type of storage-array-level read ahead going on, which is possible if the GPFS allocation mode is "cluster", not "scatter".

    With only 2 LUNs visible to GPFS, GPFS would traditionally be limited to 4 threads per LUN, with 3 of the threads available for read-ahead/write-behind.  If you truly have 40+ disks behind these LUNs, then these default GPFS settings are not appropriate.  If you are getting 500 MiB/sec from 2 LUNs or 4000 IOs per second, that is 2,000 IOs/sec per LUN.  With a maximum of 4 threads per LUN, the individual IOs would need to be about 1/500'th of a second ... or 2 milliseconds with 4 IOs outstanding.  The only way this occurs is via reading from storage array cache.

    As soon as the IO pattern changes, perhaps with the addition of additional streams (or metadata updates), the read-ahead/write-behind caching on the storage array will be effected. It won't go away, but the hit ratio will go down from 90+% to a lower value.  This results in the average IO latency increasing .... but GPFS is not allowed to have more than 4 IOs queued per LUN.

    So it is good that the read-ahead/write-behind techniques within the high-end DS8000 storage is helping, your current collection of configuration settings may be too dependent on storage-array acceleration of anticipated access patterns to scale well when running multiple IO streams, especially when constrained  to only 4 threads per LUN.

    Hope this helps.

    Dave B

     

     

    Updated on 2013-10-08T18:48:17Z at 2013-10-08T18:48:17Z by db808
  • ufa
    ufa
    133 Posts

    Re: slow write with concurrent reads on same node

    ‏2013-10-09T14:41:18Z  
    • db808
    • ‏2013-10-08T18:46:08Z

    I agree with the value of checking GPFS io history via "mmfsadm dump iohist".  I also wonder what the IO latencies are.

    A couple other thoughts.  Please confirm that storage array write caching on the DS8000 is properly enabled for the 2 LUNs being used. 

    Writes are very resource-intensive on RAID5 or RAID6 disk topologies, especially if write caching is not enabled on the LUN.  A single RAID6 write could involve as many as 3 physical reads and 3 physical writes, for a total of 6 IOs ... if the write is not a "full stripe" write which is typically optimized to only require 1 IO per disk in the stripe, amortized across a larger IO.

    128 kb is a very small GPFS block size, and is ill-suited to high performance sequential IO.  The 128kb blocksize is also probably too smalll to be the full-stripe size on the underlying DS8000 LUN.  Therefore, 6 physical IOs are needed for each host write if R6.  This is also true if the small writes to update GPFS metadata are going to a RAID6 LUN.

    What are your atime and mtime settings?  File writes have the extra burden of updating mtime also.  GPPS will buffer these metadata updates, especially in the single-node case, but will flush them to disk periodically and on file close.  Each of these metadata updates would generate the 6 IOs on RAID6

    So for the "normal" case, you would expect a 6-to-1 difference in performance between GPFS reads and GPFS writes, due to the 6-IOs-per RAID6 penalty.  Add an additional metadata write every couple of blocks for allocating and writing the index blocks, mtime ...metadata journaling ,,, and observed 15-20 -to-1 difference is explainable, simply if you are doing non-full-stripe writes under RAID6.

    This behavior is the justification behind the GPFS recommendation that the GPFS block size be integer multiples of the underlying hardware RAID stripe size.  This results in the full GPFS block writes taking advantage of full-stripe write optimizations that are often available in storage arrays.

    I don't know what your specific DS8000 LUN configuration is, and I have no DS8000 experience.  A quick Google search finds that:

         Each stripe on a DS8000 is made up of 4 logical tracks.
         Each logical track is 64KiB, so a track is 256KiB in size.

    So, if this is typical of your LUN configuration, then your 128kb GPFS block size is ill-suited for the 256kb hardware stripe.  If you are using RAID5, each host write generates 4 back-end disk IOs, for a 4-to-1 write amplification ratio.  Not as bad as RAID6, but still significant.

    BTW:  Remember that when doing IO into the same LUNs, you are are in a zero-sum game.  Every host write you do generates 4 (R6) or 6(R6) back-end IOs, that steals IO capability from concurrent host reads.  What you might think is a 50/50 read/write split from the host perspective, ends up being a 1/(3+3) split on back-end IO activity if R6.  The back-end IO statistics can be misleading because a host WRITE generates 3 back-end READS.  In the 50/50 host read/write scenario, you get (4 reads)/(3 writes) ... which does not look too far off ... but three of the reads were generated from the host write ... misleading.

    So ... if you are using a hardware full stripe of 256kb, then your GPFS block size should be 256kb or LARGER.

    Then, the single large GPFS host IO, will be translated into a full stripe hardware write ... no read/modify/write.    For (4+1)R5. the full stripe write amplification is an effective 1.25 ... much better than 4-to-1 for less-than-full-stripe writes.

    Other thoughts, but less likely:

    If the 2 LUNs that are being used are separate data and metadata, you can use the normal Linux "iostat" to monitor disk performance.  You have to deal with multiple paths to the same LUN, but it is not too bad with a small number of LUNs.

    You mentioned that the simple dd ran as fast as 500 MiB/sec ... and with a 128kb block size, that equates into 4,000 IOs per second.  That would normally be associated with 40+ disks doing 100 IOs each.  Is there that much disk power configured behind the two LUNs?  If not, there is some type of storage-array-level read ahead going on, which is possible if the GPFS allocation mode is "cluster", not "scatter".

    With only 2 LUNs visible to GPFS, GPFS would traditionally be limited to 4 threads per LUN, with 3 of the threads available for read-ahead/write-behind.  If you truly have 40+ disks behind these LUNs, then these default GPFS settings are not appropriate.  If you are getting 500 MiB/sec from 2 LUNs or 4000 IOs per second, that is 2,000 IOs/sec per LUN.  With a maximum of 4 threads per LUN, the individual IOs would need to be about 1/500'th of a second ... or 2 milliseconds with 4 IOs outstanding.  The only way this occurs is via reading from storage array cache.

    As soon as the IO pattern changes, perhaps with the addition of additional streams (or metadata updates), the read-ahead/write-behind caching on the storage array will be effected. It won't go away, but the hit ratio will go down from 90+% to a lower value.  This results in the average IO latency increasing .... but GPFS is not allowed to have more than 4 IOs queued per LUN.

    So it is good that the read-ahead/write-behind techniques within the high-end DS8000 storage is helping, your current collection of configuration settings may be too dependent on storage-array acceleration of anticipated access patterns to scale well when running multiple IO streams, especially when constrained  to only 4 threads per LUN.

    Hope this helps.

    Dave B

     

     

    Thanks to both of you,

    I am with you that the setup is not suited for high streaming performance. However, I was pointing to the fact, that concurrent reads and writes on the same GPFS node see a very bad write performance, with delays of file opens by tens to hundreds of seconds and that this does not occur when we do just write. 

    That bad write behaviour is also not seen when reading on one node and writing on the other, into the same fs

    It is also not seen, when we define two GPFS file systems and do the read and write on just one and the same node but targeting the one fs with read and the other with write.

    That at least tells me that the storage is not the only point we need to look at.

    As I said: the disk write times are awfully high when we see the issue, but they are considerably lower (and acceptable) when writing in a "fast" config (i.e. read and write on two different nodes, or on one node into two different GPFS-FS).

    For illustration I attach two diagrams (IO times as retrieved by the trsum.awk from GPFS traces). 18.08.46 was slow (read and write into one fs), 18.21.09 fast (into two different fs). The seen dd performances were about 25MiB/s vs 380MiB/s. the used fs were build on just one NSD each. However, the picture is very consistent with the situation of one FS spanning both NSDs: Then we got 400..550MiB/s for both read and write when reading on one node and writing on another. So, the fact that we use two different fs and one disk takes only read requests while the other does mostly writes is no explanaqtion for the different IO latencies (mind: we see the same when reading writing into one and the same fs but from one (slow) or  two (fast) different nodes, and the long latencies we measured in the slow case on the two-nsd FS (see attached trsum.*131007.15.33.05*.png) would never allow the good write performance in the fast case.

    In addition, mind that the write (as the read) was done in a loop : you see in the fast trace data that these writes are really consecutive, while in the slow write data you see just one  write (with the mentioned "dead time" before each file open  actually occurs) .

    Please accept my apologies for not having been clearer -- we do not need the ultimatative streaming performance, but we need to understand why the storage latencies depend on the way GPFS is used and how.

    Summary:

    node1 writes to /gpfs01/file1, node1 reads from /gpfs01/file2 :    bad write performance, file open for write  delayed (see trsum.trcrpt.131007.15.33.05.linx15-115.write_iot.png and 04642.073.724.trsum.trcrpt.131008.18.08.46.linx15-115.write_iot.png)

    node1 writes to /gpfs01/file1, node2 reads from /gpfs01/file2 :    good write performance, file open for write  not delayed

    node2 writes to /gpfs01/file1, node1 reads from /gpfs01/file2 :    good write performance, file open for write  not delayed

    node2 writes to /gpfs01/file1, node2 reads from /gpfs01/file2 :    bad write performance, file open for write  delayed

    node1 writes to /gpfs01/file1, node1 reads from /gpfs02/file2 :    good write performance, file open for write  not delayed (see 04642.073.724.trsum.trcrpt.131008.18.21.09.linx15-115.write_iot.png)

     

    ufa