Topic
  • 10 replies
  • Latest Post - ‏2013-07-10T16:03:58Z by db808
mrkfact
mrkfact
7 Posts

Pinned topic Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

‏2013-06-11T19:58:25Z |

After upgrading our GPFS cluster from 3.4.0.7 to 3.4.0.15, the average IO size on our NSDs to the LUNs dropped by about 30%, with an associated increase in the requests/sec, queue sizes and average wait time.  We believe this is the cause of a significant increase in the run time of many of our processes after the upgrade.  We did not make any OS or GPFS configuration changes, and upgrading to 3.4.0.21 did not resolve the issue.  Has any one else experienced this issue?  

We use do not use NFS; our cluster is composed of dozens of GPFS clients connected to three NSD servers.

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-06-11T20:41:16Z  

    Some changes in 3.4.0.11 for cache management adversely affected performance of some workloads. Several changes in the later maintenance releases attempted to alleviate these, but it was not completely fixed until 3.4.0.18. So please upgrade to at lease 3.4.0.18 (preferably higher than this) and try your testing again.

  • mrkfact
    mrkfact
    7 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-06-11T22:15:42Z  
    • dlmcnabb
    • ‏2013-06-11T20:41:16Z

    Some changes in 3.4.0.11 for cache management adversely affected performance of some workloads. Several changes in the later maintenance releases attempted to alleviate these, but it was not completely fixed until 3.4.0.18. So please upgrade to at lease 3.4.0.18 (preferably higher than this) and try your testing again.

    We did try upgrading one of our three clusters to 3.4.0.21, but there was no improvement compared to 3.4.0.15.  

    Thank you for your help.

  • mrkfact
    mrkfact
    7 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-06-12T15:43:46Z  

    I've been testing more version of GPFS, and it's now clear that the large decrease in IO sizes, when compared to 3.4.0.7, happened in 3.4.0.8.  There is another small drop in 3.4.0.9, which matches 3.4.0.10, 3.4.0.15 and 3.4.0.21.

  • mrkfact
    mrkfact
    7 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-06-12T22:28:31Z  
    • dlmcnabb
    • ‏2013-06-11T20:41:16Z

    Some changes in 3.4.0.11 for cache management adversely affected performance of some workloads. Several changes in the later maintenance releases attempted to alleviate these, but it was not completely fixed until 3.4.0.18. So please upgrade to at lease 3.4.0.18 (preferably higher than this) and try your testing again.

    You were correct dlmcnabb, it seems 3.4.0.11 is when performance degraded significantly for us.

    I ran some more tests, and it seems that small random write performance is degraded when paired with streaming reads and writes on the same GPFS client.  My tests that simulate some of our random write workload show a 2-2.5x increase in runtime, starting with 3.4.0.11.  The performance does not improve with 3.4.0.21.   I've tried "yes" and "no" for stealFromDoneList, but it seems to have little impact.  Do you know of any other options that could be tuned to improve performance?

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-06-13T04:54:23Z  
    • mrkfact
    • ‏2013-06-12T22:28:31Z

    You were correct dlmcnabb, it seems 3.4.0.11 is when performance degraded significantly for us.

    I ran some more tests, and it seems that small random write performance is degraded when paired with streaming reads and writes on the same GPFS client.  My tests that simulate some of our random write workload show a 2-2.5x increase in runtime, starting with 3.4.0.11.  The performance does not improve with 3.4.0.21.   I've tried "yes" and "no" for stealFromDoneList, but it seems to have little impact.  Do you know of any other options that could be tuned to improve performance?

    StealFromDoneList was a temporary change and is no longer connected to anything.

     

    You workload must be something different from anything we have seen. Get a trace.

     Start test and in the middle do:

      mmtrace trace=io; sleep 20; mmtrace stop

      mmfsadm dump all > dump.all

  • mrkfact
    mrkfact
    7 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-06-13T12:47:44Z  
    • dlmcnabb
    • ‏2013-06-13T04:54:23Z

    StealFromDoneList was a temporary change and is no longer connected to anything.

     

    You workload must be something different from anything we have seen. Get a trace.

     Start test and in the middle do:

      mmtrace trace=io; sleep 20; mmtrace stop

      mmfsadm dump all > dump.all

    In my test, I'm running on a GPFS client, running a stream read/write (while true; do time cp -f 5gbfile 5gbfilea; done) while running a custom binary that replicates a process we run regularly.  It's making many small pwrite() calls to semi-random locations within a 20gb file.  On 3.4.0.7, the random write script is consistently running in about 5 minutes, but it takes 3.4.0.21 about 12 minutes.

    Attachments

  • mrkfact
    mrkfact
    7 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-06-15T02:03:26Z  
    • dlmcnabb
    • ‏2013-06-13T04:54:23Z

    StealFromDoneList was a temporary change and is no longer connected to anything.

     

    You workload must be something different from anything we have seen. Get a trace.

     Start test and in the middle do:

      mmtrace trace=io; sleep 20; mmtrace stop

      mmfsadm dump all > dump.all

    I was able to reduce the noise in the trace by changing the streaming read+write to just a streaming read (while true; do time cp  5gbfile /dev/null; done).  This caused the run time of the random write script to be 12m47s on 3.4.0.7 and 34m12s on 3.4.0.21.

    Attachments

  • db808
    db808
    87 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-07-08T23:13:24Z  

    Hi mrkffact,

    Do you use separate data and metadata NSDs?  Can you tell if the average IO size to the DATA NSD decreased, or did the metadata activity (typically .5kb to 32kb) increase such that it brought the average IO size (data+metadata) down?

    If metadata activity went up, then you have a big clue.  You should be able to use mmfsadm dump disks to dump out the detailed disk statistics and see where all the IO activity is going, by disk object type.

    It would be interesting to perform a "mmfsadm dump disks" before and after your test and compute the delta.  This would represent all the IO that was needed to accomplish the test.  How much of this IO is "constructive"?

    Have you optimized your Linux disk IO stack?  For each Linux disk device, you should be using the noop scheduler, with ZERO read ahead, and an the "max_sectors_kb" set to your GPFS block size.

    These setting are found at::

    /sys/block/{disk device}/queue/scheduler

    /sys/block/{disk device}/queue/max_sectors_kb

    /sys/block/{disk device}/queue/read_ahead_kb

    where {disk device} is sdxx.  With request-based multipath (RHEL 6 and newer), the dm-xx multipath disk names also need to be handled.

    The Linux default IO scheduler of "cfq" or completely (un)fair queuing is a timesharing-like scheduler which is unappropriate for GPFS.  Also, the default Linux read ahead attempts to perform a physical read ahead, assuming the data is contiguous on disk.  GPFS performs a much more intelligent read ahead, and you don't need Linux second-guessing GPFS.  The CFQ scheduler burns more CPU than NOOP, keeping IO statistics to distinguish timesharing-like use cases.  It would not take too many incorrect read-ahead guesses by Linux to increase the IO activity.

    If you are running the CFQ scheduler and GPFS "appears" more compute bound you could have delayed or re-ordered IO.  The CFQ scheduler will allow an IO request to languish a bit to see if it can be coalesced with a future IO.

    There still could be some issues with GPFS buffer management increasing IO activity as you have suggested, but if you are running on a sloppy Linux IO stack, those GPFS differences could be amplified.

    Dave B

     

  • mrkfact
    mrkfact
    7 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-07-10T13:52:28Z  
    • db808
    • ‏2013-07-08T23:13:24Z

    Hi mrkffact,

    Do you use separate data and metadata NSDs?  Can you tell if the average IO size to the DATA NSD decreased, or did the metadata activity (typically .5kb to 32kb) increase such that it brought the average IO size (data+metadata) down?

    If metadata activity went up, then you have a big clue.  You should be able to use mmfsadm dump disks to dump out the detailed disk statistics and see where all the IO activity is going, by disk object type.

    It would be interesting to perform a "mmfsadm dump disks" before and after your test and compute the delta.  This would represent all the IO that was needed to accomplish the test.  How much of this IO is "constructive"?

    Have you optimized your Linux disk IO stack?  For each Linux disk device, you should be using the noop scheduler, with ZERO read ahead, and an the "max_sectors_kb" set to your GPFS block size.

    These setting are found at::

    /sys/block/{disk device}/queue/scheduler

    /sys/block/{disk device}/queue/max_sectors_kb

    /sys/block/{disk device}/queue/read_ahead_kb

    where {disk device} is sdxx.  With request-based multipath (RHEL 6 and newer), the dm-xx multipath disk names also need to be handled.

    The Linux default IO scheduler of "cfq" or completely (un)fair queuing is a timesharing-like scheduler which is unappropriate for GPFS.  Also, the default Linux read ahead attempts to perform a physical read ahead, assuming the data is contiguous on disk.  GPFS performs a much more intelligent read ahead, and you don't need Linux second-guessing GPFS.  The CFQ scheduler burns more CPU than NOOP, keeping IO statistics to distinguish timesharing-like use cases.  It would not take too many incorrect read-ahead guesses by Linux to increase the IO activity.

    If you are running the CFQ scheduler and GPFS "appears" more compute bound you could have delayed or re-ordered IO.  The CFQ scheduler will allow an IO request to languish a bit to see if it can be coalesced with a future IO.

    There still could be some issues with GPFS buffer management increasing IO activity as you have suggested, but if you are running on a sloppy Linux IO stack, those GPFS differences could be amplified.

    Dave B

     

    Hey Dave,

    We do not use separate data and metadata NSDs.  Currently we're undergoing a migration from EVA storage to Violin, and based on my understanding of GPFS, I don't think separating them will provide much benefit on Violin.  We migrated our 3.4.0.7 cluster to Violin already, and should have the 3.4.0.15 one done in a couple weeks.  It'll be interesting to see how the performance compares between the two versions after the storage change.

    I opened a PMR with IBM, and support noted that we have many "waiting for exclusive use of connection" messages in an mmtrace, which they claim is due to network congestion.  We've been told that, on the NSDs, ACKs for writes are stalled behind read packets and we've been trying various sysctl socket buffer and interface tuning options, but none have helped thus far.  I have also tried to identify the cause of the network congestion (tcp/ip stack on our boxes or our networking gear), but I haven't been able to find anything.

    We have not optimized the Linux disk IO stack yet (we are using RHEL6), but we have planned to make the same changes you're recommending in our dev cluster sometime next week.  Tonight I'm going to test pinning all IRQs for the NICs to one CPU (currently they're evenly balanced across all CPUs), and pinning mmfsd to the same CPU.  In our lab it showed improvement on 3.4.0.21, but that's in a quiet environment using 1GbE.  Our dev cluster has a lot of real world workload and uses mostly 10GbE, so I'm not sure what to expect yet.

    Thanks for your help,

    Mike

     

  • db808
    db808
    87 Posts

    Re: Decreased Average IO size after 3.4.0.7 to 3.4.0.15 upgrade

    ‏2013-07-10T16:03:58Z  
    • mrkfact
    • ‏2013-07-10T13:52:28Z

    Hey Dave,

    We do not use separate data and metadata NSDs.  Currently we're undergoing a migration from EVA storage to Violin, and based on my understanding of GPFS, I don't think separating them will provide much benefit on Violin.  We migrated our 3.4.0.7 cluster to Violin already, and should have the 3.4.0.15 one done in a couple weeks.  It'll be interesting to see how the performance compares between the two versions after the storage change.

    I opened a PMR with IBM, and support noted that we have many "waiting for exclusive use of connection" messages in an mmtrace, which they claim is due to network congestion.  We've been told that, on the NSDs, ACKs for writes are stalled behind read packets and we've been trying various sysctl socket buffer and interface tuning options, but none have helped thus far.  I have also tried to identify the cause of the network congestion (tcp/ip stack on our boxes or our networking gear), but I haven't been able to find anything.

    We have not optimized the Linux disk IO stack yet (we are using RHEL6), but we have planned to make the same changes you're recommending in our dev cluster sometime next week.  Tonight I'm going to test pinning all IRQs for the NICs to one CPU (currently they're evenly balanced across all CPUs), and pinning mmfsd to the same CPU.  In our lab it showed improvement on 3.4.0.21, but that's in a quiet environment using 1GbE.  Our dev cluster has a lot of real world workload and uses mostly 10GbE, so I'm not sure what to expect yet.

    Thanks for your help,

    Mike

     

    Hi Mike,

    Thanks for the information.  I would HIGHLY recommend using separate data and metadata.  Initially, it may not "improve" performance, but it can increase visibility as to what is going on, and it provides the foundation for further optimization that may require or benefit from separate data and metadata.  We've been living with separate data and metadata for over 3 years now, and from an administration and monitoring viewpoint, I would not go back.

    It is more tedious because you have more LUNs and NSDs to deal with, but in large GPFS systems the dealing with LUNS and NSDs is often scripted due to the bulk, so modifying a script that handles hundreds of LUNs/NSDs to also handle metadata NSDs is not that bad.

    A few examples.  With separate metadata LUNs, you can use standard Linux disk performance monitoring tools (post-processed by a script) to monitor metadata activity and data activity independently.  We often find, when we do a "sanity check" on the amount of IO activity that occurs for a given workload.  Most often, the "data" activity for a workload is easily explainable.  You read X number of files, of average size Y and it should generate an aggreate IO of Z. 

    However, when you try to explain the metadata activity for the workload, you are often off by factors of 10 or more.  This behavior is masked when you combine data and metadata, unless you use some invasive GPFS monitoring.  (BTW:  I would love to have a script that monitored the statistics available in "mmfsadm dump disks", but I have not built it yet.)

    In our case, we often find significiant amounts on unneeded and/or unconstructive work occuring, and it greatly amplifies the amount of metadata activity, especially when compared to the "constructive" data-centric work.  Improper use of "searchlist" like functionality an generate tons of metadata activity.  atime and mtime handling can be extremely significant.  Sometimes we create a small separate GPFS file system for the cases where atime and mtime must be accurate cluser-wide, rather than have to suffer the overhead penalties across the bulk of the file system.

    We have gone so far as to co-mingle a small metadata LUN with its corresponding data LUN on the same RAID group, which might be heresy from a "best practice" standpoint.  It may not be the ideal configuraton, but we have found it more productive and lower cost. We would all like to have high speed storage for metadata, but that increases the cost of the solution. From an initial performance standpoint, co-mingling the data and metadata LUNs on the same RAID group should be no worse than a single combination data+metadata LUN.  However,  GPFS now has twice the amount of internal IO queuing resources available, and the data LUN and metadata LUN can be configured differently.

    If you are using GPFS for an IOP-centric workload with small GPFS block sizes, the advantage of independently tailoring the data and metadata is small and may not be worth the effort.  However, if you primarly have "large" IO files, with average data IO sizes of 1+ mbytes, the advantage of metadata tailoring can be significant.  In our case, we currently use a 4 MB GPFS block size, and are migrating to 8 MB.

    The disk space used by a directory is one GPFS fragment, or 1/32 of the block size.  With the GPFS 1MB default block size, the fragment size is 32kb.  This results in the smallest size directory being 32kb ... even for a few files.  If you have a "tall" directory structure, where there are a lot of directories with just a few files, allocating 32kb for the directory can be very wasteful.  A 32k directory can probably handle 500+ files, depending on file name lengths.  In out case, with 4MB GPFS block sizes, our minimum size directory would consume 128kb.

    If you were going to use high speed storage for metadata, this increases the amount of high speed storage needed.

    If you have an independent metadata LUN/NSD, you can now specify the metadata block size, which infers the metadata fragment size and the minimum directory size.  If the metadata block size was 256kb, the fragment and min directory size would be only 8kb.

    As to your network congestion remediation experiments, I would suggest caution.  You can easily make things worse.

    I would first recommend "undoing" any network optimizations, beyond what is listed elsewhere in GPFS best practices (primarily kernel network memory parameters).  DO NOT mess with the socket buffer size options.  If you specify a socket buffer size, this completely disables the very competent Linux auto-size management of network buffers.  In fact, messing with the socket buffer sizes (and disabling Linux auto-management" can LEAD to congestion and a condition known as "buffer bloat".

    On the network congestion issue ... is it occuring on the NSD server side, or the NSD client?

    Are you running with Ethernet hardware flow control (PAUSE) enabled on both the NIC and the switch port?  Some people still suggest staying away from hardware flow control becase of fragility and interoperatility issues when the standards first came out over 10 years ago (and poor implementation), but today, with contemporary NICs and modern switches it is very reliable.

    Please note ... Infiniband and fibre channel have hardware flow control and lossless operation built-in to the standard.  These channels are often used for storage transports, and the storage IO stack empirically optimized based on the assumtion that the transport is flow-controled and lossless.  IO errors are very, very rare, and the remediation code, although critical, is not executed very often.  I've run 8gbit FC wire speed many-to-one tests across muliple ports for days without errors.  Contrast this to being able to force a network error on a 2-to-1 fan in network test at will.

    So profilaticly, the more you can make Ethernet operate with flow control and near-lossless operation, the less hassle you will have further up the IO stack.  Ethernet hardware flow control (PAUSE) is one important element to this.  It is not a silver bullet, but it helps substantially in a "congested" environment, to allow you to better identify what is the root cause of the mis-behaving network activity.  The major gotcha is that you must enable the PAUSE functionality on the switch port, along with enabling it on the NIC. 

    If the network congestion is happening on the GPFS client, you could be experiencing a condition called "incast", where a client sends out multiple asynchrounous requests to different servers, and multple servers respond concurrently, overwhelming the single connection on the client.  The simplist way to manage incast on the GPFS client is via the maxMBpS parameter.  It should be 1-to-2 times the bandwidth of the client-side interconnect.  For safety on an Ethernet interconnect, you could try 90% of the interconnect speed.  For a 10 gig link, this would be 0.0 x 1250 = 1125.

    If you try this, and the problem goes away, you had classic incast.  Then you can decide how much time you want to spend to help improve the network to improve many-to-one conditions, and further increase maxMBpS.

    Note, avoiding bufferbloat (by letting Linux manage the socket buffers), and enabling Ethernet hardware flow control from the GPFS clients to the network switch, and the GPFS servers to the network switch will also reduce incast issue due to improved flow control.

    On your attempt to "pin" GPFS proceesses ...... be extremely careful.  It probably won't react like you expect.  GPFS is heavily multi-threaded, and also operates on both kernel-based threads and user-space threads.  Many of the GPFS daemons just enter the kernel and stay there.  Others spend signficant time in user space.  Affining the parent thread after the thread has been created, will NOT impact the child threads.  Also, affining a thread only is in effect fot the lifespan of that thread..  If the thread is destroyed and later recreated, the new thread will not be affined.

    The other major issue with GPFS threads is that many are multi-function.  One thread can perform different functions at different times.  When managing CPU affinity and CPU priorities, you often want to identify the producer/consumer relationship, with the "producer" running with higher priority.  This helps avoid priority inversion issues.  Unfortunately, the multi-function nature of the GPFS threads makes this type of technique difficult.

    Please also check that you have enabled the various off-load functionality in the NICs that you are using.  Be careful, some NIC parameter changes may not persist across network up/down or across a reboot.

    Somewhere else on the GPFS wiki is a posting about the NSD network perf tool, that will generate NSD-like network traffic, without GPFS running.  I have been told that it can be a useful tool to stress and validate a network for GPFS use ... without GPFS.  The tests are also easily reproducible.

    Lastly, I would have to believe that if GPFS is detecting any form of congestion, there must be some visible artifact lower down the IO stack.  From my experience, the "problem" is likely NOT visible on the Ethernet interface-level statistics, but is more likely visible on the IP and TCP-level statistics, visible via netstat.  You have to look at the delta statistics across two netstat samples.  A challenge with TCP-level statistics that they are system-wide, and not NIC-specific.

    RHEL 6.x also introduced a "dropwatch" tool that uses systemtap to monitor where in the Linux network stack a packet is being dropped.  There are also useful macros/examples of using the systemtap tool for network-centric troubleshooting.

    Don't be alarmed if the dropwatch tool immediately starts logging dropped packets.  These are probably legitimate messages being sent to "services" that are not running on the system.  Many of these can be broadcast packets that don't apply to the system.  You will likely need to post-process the dropwatch output to filter out such drops.

    Hope this helps.

     

    Dave B