IC4NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
23 replies Latest Post - ‏2012-10-15T16:57:14Z by db808
122 Posts

Pinned topic GPFS Blocksize, Storage segment size, IO size, Performance

‏2012-09-21T11:46:05Z |
Dear all,

I am about to check out the GPFS setup for a HPC system.
The design follows the suggestion to have storage building blocks comprising 2 NSD servers (x3650, SLES11SP2, kernel 3.0.38, LSI SAS2116 ), directly linked by 6 SAS links to 3 DCS3700 (no expansions, 60 3TB SAS NL disks in 6 RAID6 ).

I am just testing one such block. Currently, the IB uplinks of the NSD servers are not functional, so I can just test local access from the NSD servers themselves.

The current setup has GPFS blocksize of 4M and RAID segment size of 512KB -- according to documentation this should allow for full stripe width writes and reads omitting the parity RAID-specific read/modify/write penalty on full block IOs.

However, the DCS3700 does only receive IO requests of 512KB size (as seen by relating the IO throughput (MB/s) to the IO rate (IO/s).
This is regardless of setting max_sector_kb to 4096 for the LUNs (it is at 1024 anyway) and also regardless of using RDAC or DMMP/MPIO (AKA multipath). BTW, if I am setting max_sector_kb to 4096 when using multipath, the IO system seems to block (IOs hang forever) and trying to mmshutdown freezes the OS. max_hw_sectory_kb is 4096. max_sectors is at 128, I have learned that that makes for a BIO size of 512KB, but the system should be able to concatenate several BIOs into one IO request to send down to the storage.

I suspect the SAS driver to cut anything down to 512KB, as iostat states IO sizes around 1024KB for the LUNs. Or could it be anything else? My question for another cause comes from the fact that, when running dd to the LUNs (multipath devices, e.g. the /dev/dm-* in DMMP), the storage system does get IO requests of 1024KB. Is that to do with unaligned IO? Why can GPFS not align its IO/write its data aligned to disk?

Nevertheless, if our IOs are 512KB (or even 1024KB, anyway, smaller than the GPFS blocksize of 4MB), it does IMHO not make sense to follow the books by setting segment_size=GPFS_BS/stripe_width. Instead, first approach was
Now with our stripe width of 8 this would yield a segment size of only 64KB.
For 2GB/s from one DCS3700, we'd need about 43MB/s from each disk, and the above segment size of 64KB would then relate to 672IO/s. Somewhat beyond the capabilities of the disks, i.e. unrealistic. But I think, to chose 256KB as segment size should be better (ok, 160 IO/s is still much for those disks).
But the best would be to get the IO size extended.

Another issue I have seen now in two GPFS setups, the one described above and one using DDN SFA storage and IB instead of SAS: In many situations, write bandwidth exceeds read bandwidth considerably. In the DDN SFA setup, I suppose that might be due to slow disks which can be masked by write cache, but not when reading (this view is supported by the fact that removing the slow LUNs and NSD clients appears to get read equal to write). However, in the DCS/SAS setup, i did not see any particularly slow disks, and given our effective IO size being the same as the RAID segment size we should have some penalty on writes rather.
Running gpfsperf (I used some wrapper scripts I got from B.H.) on the two NSD servers yields about 2.2GB/s on each of them for write, but only 0.9..1.5GB/s on each of them for read. Consistent behaviour is seen when just running parallel dd commands (accessing the GPFS filesystem) on the two GPFS nodes, except that the higher read rates could not be reached by any number of parallel dds, but only seen running gpfsperf with high numbers of threads (>=32, maximum at 64 threads). We'd like to see >2GB/s per node for read. The average rate per node depended on the number of threads as follows:
TH=2 ### RATE/CLI=720MiB/s
TH=4 ### RATE/CLI=949MiB/s
TH=8 ### RATE/CLI=1029MiB/s
TH=16 ### RATE/CLI=1117MiB/s
TH=18 ### RATE/CLI=1129MiB/s
TH=24 ### RATE/CLI=1228MiB/s
TH=32 ### RATE/CLI=1302MiB/s
TH=36 ### RATE/CLI=1310MiB/s
TH=48 ### RATE/CLI=1445MiB/s
TH=64 ### RATE/CLI=1500MiB/s
TH=72 ### RATE/CLI=1414MiB/s
TH=96 ### RATE/CLI=1414MiB/s

write rates hovered around 2100..2200MiB/s per node without prominent dependency on thread numbers, but still the highest values are seen for high threasd numbers:
TH=2 ### RATE/CLI=2184MiB/s
TH=4 ### RATE/CLI=2184MiB/s
TH=8 ### RATE/CLI=2137MiB/s
TH=16 ### RATE/CLI=2160MiB/s
TH=18 ### RATE/CLI=2114MiB/s
TH=24 ### RATE/CLI=2160MiB/s
TH=32 ### RATE/CLI=2091MiB/s
TH=36 ### RATE/CLI=2137MiB/s
TH=48 ### RATE/CLI=2091MiB/s
TH=64 ### RATE/CLI=2234MiB/s
TH=72 ### RATE/CLI=2234MiB/s
TH=96 ### RATE/CLI=2286MiB/s

One more, maybe interesting, point, just to let you know:
I run first with OS kernel 3.0.13, where I saw clearly imbalanced IO rates between the two nodes: one got just half the rates of the other one. That could not be overcome by reverting the start order of jobs. However, that phenomenon disappeared with the update to the 3.0.38 kernel.
So, as a summary of my concerns:

Does anybody know a way to get larger IO requests down to the storage subsystem in our equipment?

If we have to live with 512KB IOs, would you agree that 256KB is a better choice for segment size than 512KB, regardless of GPFS BS?

How could the worse read performance compared to write be explained in our setup?

Thanks in advance for any thoughts, suggestions, tips.

Updated on 2012-10-15T16:57:14Z at 2012-10-15T16:57:14Z by db808
  • HajoEhlers
    251 Posts

    Re: GPFS Blocksize, Storage segment size, IO size, Performance

    ‏2012-09-21T15:52:51Z  in response to ufa
    Just a few thoughts
    0) Check what the maximum transfer size for the give Array is.
    1) Check what the maximum transfer size for the given adapter is.
    2) Check what the maximum transfer size for the given OS is .
    3) Check that the OS is configured for the lowest number given from point 0,1,2

    Even with a block size of 4 MB and a transfer size of 512K it does NOT mean that your disk will receive chunks in 512K sizes.
    It means that your OS has to the split the 4 MB chunk into 8 * 512K. So the number of commands are going to increase.

    Since at least current disks and arrays should coalesce write streams at the end we are back to one 4MB write.

    We will write 4 MB to a array with 4+2 Disks and 4MB stripe size

    On a GPFS with 4MB bs , 256k transfer size from the adapter we have:
    4MB write -> 1 GPFS block ( 1 LUN used ) -> 16 Transfers -> 1 Write to 1 Lun ( ONE IO / 4 IOs total )*

    On a GPFS with 1MB bs , 256k transfer size from the adapter we have:
    4MB write -> 4 GPFS blocks ( 4 LUN used ) -> 16 Transfers -> 4 Writes to 4 Luns ( FOUR IOs / 16 IOs total)*

    So for large streams use a large raid block sizes ( In streaming mode even "slow" disk are fast ) and have a GPFS bs equal or multiple of.
    But do not make the block size to large since then the lun is blocked for others.
    In your setup

    > The current setup has GPFS blocksize of 4M and RAID segment size of 512KB

    i would try/test a RAID segment size of 4MB ( 1MB/disk in a 4+2 raid 6) since simple speaking a 4MB write would result only in 4IOs on the array.
    > If we have to live with 512KB IOs, would you agree that 256KB is a better choice for segment size than 512KB, regardless of GPFS BS?
    Like i said, for streaming data i would go even much larger.

    > How could the worse read performance compared to write be explained in our setup?
    Because the way you read and the way you setup/configured your array ?

    Keep in mind that on any disk the "seek time" is in ms . Thus you must prevent steady seeks on the disk/array. Thats the reason for large block sizes.

    But if the block size is to large ( applies to shared luns ) one node gets its data fast but other will suffer since they have to wait until the transfer has finished.
    In case the block size is to small the seek time will determine the overall transfer speed and then you are down to a few megabytes/sec.

    But as said - Just a few thoughts
    Even if you could not have a large Raid segment size you could check if the array supports large prefetch's up to the GPFS block size.

    • Not counting any raid type related IOs
    • ufa
      122 Posts

      Re: GPFS Blocksize, Storage segment size, IO size, Performance

      ‏2012-09-24T11:55:49Z  in response to HajoEhlers
      Hi HaJo,
      I guess there is some misunderstanding WRT RAID segment size andstripe width (what you call stripe size).

      stripe_width= N*segment size
      with N being the number of data disks in a stripe (e.g. in a 4+2 RAID6, N would be 4, in an 8+1 RAID5 N would be 8)
      If I say "segment size of 512KB), that means that in our 8+2 RAID6, IOs are split so each physical disk gets a share of 512KB, and the full stripe size / width is 4MB.
      I suppose that's what you recommend.

      You wrote that the storage adapters would coalesce smaller IO requests on their own - how could I check that?

      • ufa
        122 Posts

        Re: GPFS Blocksize, Storage segment size, IO size, Performance

        ‏2012-09-24T12:03:19Z  in response to ufa
        In my previous post, please read storage controllers for storage adapters.

      • HajoEhlers
        251 Posts

        Re: GPFS Blocksize, Storage segment size, IO size, Performance

        ‏2012-09-24T14:01:47Z  in response to ufa
        > You wrote that the storage adapters would coalesce smaller IO requests on their own - how could I check that?

        I would check the IOs per disk on a given LUN on your DCS3700 during a test run. Thus checking how many IOs are getting really to the final disk(s).

        Coalesce writes.
        Like i said - The idea is to store a certain amount of data on the storage array lun cache before it is written to disk.
        If the storage array is able todo so and configured to do so.

        Otherwise you optimize the upper stacks for 4MB block but then the storage array caches only 1 MB before it starts to write to disk ..... ;-(
        Coalesce reads
        And of course have a max prefetch equal to the GPFS block size. ( If we have to read from slow disks its cheaper to read more data into the cache and coalesce this way read requests . )
        But be aware that i am not familiar with the DCS3700 - My experience comes from an EMC Clariion.

        So ymmv
        Little bit old but nice to read ( Page 8 & 9)
  • db808
    86 Posts

    Re: GPFS Blocksize, Storage segment size, IO size, Performance

    ‏2012-09-21T17:57:19Z  in response to ufa
    Hello ufa,

    Warning .... long answer.

    I have significant experience with the DCS3700 in the GPFS environment and am running 1600 MB/sec per 60-disk shelf, across multiple shelves. Our largest current cluster tops out at ~10,400 MB/sec. We can scale further if we wanted to.
    One of my previous postings:

    Therefore in summary ... it can be done. Now, let's figure out what are the constrictions.

    First, realize that with only 2 x 6GBit SAS connections per DCS3700, you will never go faster than 1200 MB/sec read or 1200 MB/sec write from a single NSD node You might be able to get to 1600 MB/sec in mixed read/write full-duplex mode. Only when both NSD nodes are concurrently accessing the same DCS3700 will you consistently get more than 1200 MB/sec from a single DCS3700.

    Focusing on the 1200 MB/sec ceiling for a single NSD server, this equates to an average of 200 MB/sec for 6 LUNs. In this case, you are constrained by the limited SAS connectivity (to a single node).

    You correctly identified that the most-critical performance "knob" that you have to work with is the GPFS block size. It is critical to do 4 MB host IO to get stellar performance.

    I like to use two analogies.

    The first is that the Linux IO stack is like an automotive drive train. There are many sub-components, and the overall end-to-end "gear ratio" is an arithmetic combination of the intermediate "gear ratios". The individual sub-components' "gear ratios" need to be cross-coordinated, and there are multiple combinations that can yield the same end-to-end result. Therefore, there can be multiple "right" answers.

    For RHEL Linux (and probably the same for SUSE), the default "gear ratios" are designed for good overall performance for a mid-size sedan. As you would expect, the "gear ratios" for a tractor-trailer truck would be much different. Similarly, if you tried to drop a truck engine into a car-centric drive train, the results would be sub-optimal. Using 4MB IO and running in bandwidth-centric mode is like putting a truck engine in a car drive train. I can step you through cross-coordinating the respective gear ratios.

    The second analogy is that the Linux IO stack is like a Japanese Pachinko pinball machine. A large IO comes in the top of the machine and flows down to the bottom, being deflected by pegs in the way. As you discovered, the standard Linux IO stack has many pegs in the way to disturb the flow, and there are some pegs with minimal spacing equivalent to 512kb. We need to either remove these pegs and/or use a method to work around them. The challenge is that there are multiple pegs in the path at various heights, and any one of them can disrupt the large IO. Fixing one is not enough. You have to create a clear channel all the way through.

    I apologize; I don't have hands-on experience with SUSE or the SAS controllers. I can identify what the issues are (the peg or gear ratio needed), and hopefully you can translate it to the proper syntax.

    First, I assume that you are running 64-bit Linux. Trying to do 4MB IO in 32-bit Linux is possible, but often not worth the effort. Too many "pegs" to remove.

    OK. Let's start from the top and work down.

    mmdiag --iohist will give you a list of the last 512 IOs from GPFS. If GPFS is properly configured for a 4 MB block size, you should see 8192 sector "data" IO being logged when you do a simple file copy.

    Based on you having only 2 SAS paths per DCS3700 for a given server, you have only one active / one passive path per node. With only one active path per node, you can use iostat to monitor disk performance without needing to aggregate multiple active paths' worth of statistics.

    If you have a data-only NSD (and corresponding LUN) you can monitor your average IO size using 'iostat'.

    When you run your copy test, it appears that you see an average IO size of 512kb from iostat. Is this from the "sdxx" path, or the "dm-xx" disk entry? The statistics of the "dm-xx" multipath pseudo disks are known to be inaccurate. They need to be ignored.

    There are two major versions of DM-Multipath out there. The original version was "BIO" based, and had challenges doing IO larger than 512kb (but it could be done). The newer version is called the "request-based" DM-Multipath. With the request-based Multipath, it is much more straightforward (but still tricky) to do 4 MB IO.

    On RHEL, RHEL 5.x has BIO-based multipath, and RHEL 6.x offers request based. I don't know which SUSE revs have request-based multipath.

    BIO-based multipath is version 4.7.x. Request-based is 4.9.x. The command "multipath -h 2>&1 | head 1" will show you the version of multipath-tools which corresponds to the version of multipath.

    If you are not on a version of Linux that supports request-based multipath, I would highly recommend that you look into upgrading. Doing large IO is so much more straightforward and efficient.

    OK ... two cases, BIO-based and request-based.

    I did a quick check on the SUSE web site, and SLES 11 SP2 is the most recent version, so I am going to assume that it includes the better request-based multipath (like RHEL 6.x does).

    With request based multipath, the dm-xx multipath pseudo disks are fully functional "block" devices, and have block-layer parameters whose defaults are NOT appropriate for GPFS usage. In general, we want the Linux block layer to "get out the way", and let GPFS manage the IO queues and read ahead.

    For GPFS LUNs I would recommend:

    /sys/block/dm-xx/queue/scheduler -> recommend "noop"
    /sys/block/dm-xx/queue/max_sectors_kb -> 4096 or greater
    /sys/block/dm-xx/queue/read_ahead_kb -> 0 ... Let GPFS do the read ahead

    You probably are familiar with making these changes for the "sdxx" disk entries. With request-based multipath, you need to ALSO perform these optimizations at the "dm-xx" level.

    Note, these settings are NOT persistent, and the dm-xx entries are re-created each time you start/stop multipathd, or reload the multipath.conf file. Therefore, put these settings in a script, and run the script whenever you start or change multipathd.

    The next layer of parameters is within multipath itself. With only a single active path, the values of rr_min_io (or rr_min_io_rq), and rr_weight don't matter.

    If you have 2 or more active paths, you should set rr_weight to "uniform", and rr_min_io (or rr_min_io_rq if supported) to 1. This will round-robin the 4MB IOs every single IO. The defaults are grossly sub-optimal for 2 or more active paths. The typical default DCS3700 multipath.conf entry has rr_min_io set to 100, and the rr_weight set to "priority". This sets the round robin interval to (rr_min_io * effective path priority). The rdac path checker module assigns "6" to the active paths, and "1" to the passive paths. The result of the defaults is that the round robin interval is (6 * 100) = 600. With a 4 MB IO taking about 16.6 msec, 10-seconds' worth of IO will be directed down a single path before switching to the second (active) path. For a single LUN, GPFS would need to queue 600-deep to get two paths operating concurrently. Won't happen. Using rr_min_io set to 1 and rr_weight set to uniform, the round robin interval is 1, and GPFS only needs to read ahead 1-deep to use 2 paths.

    Below multipath, you have the block layer (again) with the sdxx disk entries.

    The block parameters in /sys/block/sdxx/queue need to be set just like the dm-xx entries.

    Now ... you have removed all the "pegs" in the middle of the center of the Pachinko machine ... so far. Now you are at the top of the SCSI stack and the driver, the SAS driver in your case.

    Before we go further, you should check the
    /block/sys/dm-xx/queue/max_segments -> typically 128
    /block/sys/dm-xx/queue/max_segment_size -> typically 65536

    This says that multipath can handle an IO that is 128 X 64kb = 8 MB in size

    /block/sys/sdxx/queue/max_segments -> typically 256
    /block/sys/sdxx/queue/max_segment_size -> typically 65536

    This says that the low-level "sd" block device can handle an IO that is 256 x 64kb = 16 MB in size.

    Oh ... as soon as you find a constricting value, correct it and re-run your copy test, monitoring it with iostat. Did the average IO size increase to 4MB? If yes, you are done. If no, you need to continue to a lower level.

    The next layer is the top of the SCSI stack and SAS or FC driver.

    Go to /sys/class/scsi_host/{hostn}

    You have 6 active scsi_host entries for the 6 SAS controllers, and probably another for the controller used for the boot disk.

    Find the scsi_host entry for the SAS controller(s).

    What is the value of sg_tablesize? It needs to be 256 or greater to reliably do 4MB IO. sg_tablesize is the driver-level scatter/gather table size. This is likely a read-only entry. If it can be changed, there will be a driver-specific parameter that can be set that will reflected as sg_tablesize when the controller is initialized. To change this requires a modprobe.conf entry, rebuilding the kernel, and a reboot.

    For Emulex FC controllers, the parameter is lpfc_sg_seg_cnt. It defaults to 128, and must be increased to 256 or greater to enable 4MB IO. This is done via a modprobe.conf entry specific to the lpfc driver.

    For Qlogic FC controllers, recent drivers have a default sg_tablesize of 1024, so it should not be a problem ...I am told. I have no direct Qlogic experience.

    If sg_tablesize is less than 256, you will need to get into the LSI SAS driver specifics to determine what LSI driver parameter controls sg_tablesize. I do not know the specifics, but 4MB SAS IO is possible. IBM has published GPFS benchmark results using SAS controllers and 4 MB block sizes.

    I need to read the other questions you asked in more detail before I can respond.

    Dave B
    • chr78
      132 Posts

      Re: GPFS Blocksize, Storage segment size, IO size, Performance

      ‏2012-09-22T16:21:17Z  in response to db808
      nice post, Dave !

      I'd only disagree with

      First, realize that with only 2 x 6GBit SAS connections per DCS3700, you will never go faster than 1200 MB/sec read or 1200 MB/sec write from a single NSD node You might be able to get to 1600 MB/sec in mixed read/write full-duplex mode. Only when both NSD nodes are concurrently accessing the same DCS3700 will you consistently get more than 1200 MB/sec from a single DCS3700.

      At least my in my DS35xx/DCS3700 environments I see numbers twice as high. IMHO, simply to the fact that all these 6Gb SAS ports are implemented with two lanes - i.e.
      each 6 Gb port actually peaks at 12Gb (the DS/DCS internally uses 4 lanes, thus getting a theoretical peak perf of more than 4000GB/s)

      • chr78
        132 Posts

        Re: GPFS Blocksize, Storage segment size, IO size, Performance

        ‏2012-09-22T16:22:48Z  in response to chr78
        4000MB/s ...
    • ufa
      122 Posts

      Re: GPFS Blocksize, Storage segment size, IO size, Performance

      ‏2012-09-24T11:35:33Z  in response to db808
      Hello, db808,
      I am grateful for your elaborate description. That was about what I was looking for. So far it was clear to me that the data are passed down layer to layer to the final disk, but especially in the Linux part I was always unsure how many stages are indeed involved (and hence, what bottleneck in terms of IO size might be there preventing other knobs from becoming effective).
      So, one of the keywords in your text for me was "ALSO" :-) - I had set the queue parameters for dm-* before, but hadn't thought of setting them for the "real" sd* devices - that explains of course the crashes I've seen, now that I set them all to 4096 things are running smoothly. But not so WRT to the IO size seen by the storage.
      The IO size arriving at the store I derive from the monitoring output produced by SMcli ("save storageSubsystem performanceStats"). I always get a ratio between IO throughput (MB/s) to IOps pointing to 512KB/IO. Interestingly, sometimes that ratio appears to exceed the 512 slightly (10% or so). That is puzzling me - if there is a hard limitation upstream, there should not be the slightest excess -- or can that be due to time discretisation of the measurement?

      As you said, we got only one active path for each LUN on each server. hence path rotation is not an issue here (rr_*).

      max_segments is at 128 and max_segment_size at 64K for all devices, dm-* and sd*. As for this I am a bit confused as I had understood from other explanations that the max_segments determines together with the memory page size (4KB) what goes into a BIO (i.e., 128x4KB=512KB), and BIOs again might be put together into one IO (size restricted by max_sector_kb). Your words sound slightly different and at that point i am lost what is going on.

      Going further down the pipe (not to say the drain :-), the sg_tablesize is a mere 128 and the sysfs file is ro. Looking into the mpt2sas sources I see
      static int max_sgl_entries = -1;
      module_param(max_sgl_entries, int, 0);
      MODULE_PARM_DESC(max_sgl_entries, " max sg entries ");
      so it looks like max_sgl_entries could be our candidate - but I do not see any default setting.

      Just finding
      if (max_sgl_entries != -1)
      sg_tablesize = max_sgl_entries;
      sg_tablesize = MPT2SAS_SG_DEPTH;

      if (sg_tablesize < MPT2SAS_MIN_PHYS_SEGMENTS)
      sg_tablesize = MPT2SAS_MIN_PHYS_SEGMENTS;
      else if (sg_tablesize > MPT2SAS_MAX_PHYS_SEGMENTS)
      sg_tablesize = MPT2SAS_MAX_PHYS_SEGMENTS;
      ioc->shost->sg_tablesize = sg_tablesize;

      in /etc/modprobe.d/mpt2sas.conf
      and rebooting did not change the sg_tablesize for the two SAS cards.
      You wrote I need to rebuild the kernel but why should submitting a supported parameter in modules.conf require that?

      Changes I did so far :
      increase max_sector_kb to 4096 for the dm-* and the sd* devices (where I just did it for dm-* before).
      enter a module parameter for mpt2sas of max_sgl_entries=256 but that might be done wrong, at least it has no effect on sg_tablesize so far.

      And, you're right, the nominal speed of 6Gbps of SAS translates to 0.75 GB/s per SAS link. however, each of the DCS controllers has 4 links (two to each NSD server), the aggregated maximum follows to be 3GB/s per Box, just from the nominal SAS speed.

      The high-level GPFS blocksize is maintained, mmdiag shows 8192-sector IOs

      iostat still tends to show IO sizes of 1024KB or just a bit more (for dm-* and sd* devices), rarely one sees substantially bigger IOs (up to 2MB) but this is just for one readout period (used 3 secs).

      I will go on trying to find out about the mpt2sas settings and details - if anybody happens to know them, please shed some light.

      • chr78
        132 Posts

        Re: GPFS Blocksize, Storage segment size, IO size, Performance

        ‏2012-09-24T15:04:27Z  in response to ufa
        Setting max_sgl_entries=256 in /etc/modprobe.d/mpt2sas.conf and rebooting did not change the sg_tablesize for the two SAS cards.

        mpt2sas might be inlcuded in the initrd - please make sure to run mkinitrd after changing things in
        modprobe.d - and, be careful, mpt2sas might not load with wrong (too high?) values and your system might not
        boot anymore (mpt2sas is often used for internal disks is well): keep a valid initrd around ...

      • bhartner
        58 Posts

        Re: GPFS Blocksize, Storage segment size, IO size, Performance

        ‏2012-09-24T17:05:20Z  in response to ufa
        May be limited to 128:

        #if CONFIG_SCSI_MPT2SAS_MAX_SGE < 16
        #define MPT2SAS_SG_DEPTH 16
        #elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 128
        #define MPT2SAS_SG_DEPTH 128
        #define MPT2SAS_SG_DEPTH 128 /* MAX_HW_SEGMENTS */

        Did you disable preReadRedundancyCheck on the dsc3700? That may be impacting read performance.
        • ufa
          122 Posts

          Re: GPFS Blocksize, Storage segment size, IO size, Performance

          ‏2012-09-24T18:19:10Z  in response to bhartner
          pre-Read Redundancy Check is and has been off.

          When grepping through the sources I got of the mpt2sas, e.g. for "MPT2SAS_SG_DEPTH", I do not find any settings and upper limits. What I see instead is, in Kconfig:

          config SCSI_MPT2SAS_MAX_SGE
          int "LSI MPT Fusion Max number of SG Entries (16 - 256)"
          depends on PCI && SCSI && SCSI_MPT2SAS
          default "128"
          range 16 256
          This option allows you to specify the maximum number of scatter-
          gather entries per I/O. The driver default is 128, which matches
          MAX_PHYS_SEGMENTS in most kernels. However in SuSE kernels this
          can be 256. However, it may decreased down to 16. Decreasing this
          parameter will reduce memory requirements on a per controller instance.

          and this setting is digested as (no upper limit though), see mpt2sas_base.h :
          * Set MPT2SAS_SG_DEPTH value based on user input.
          #if (LINUX_VERSION_CODE < KERNEL_VERSION(2,6,25))
          #define MPT2SAS_MIN_PHYS_SEGMENTS 16
          #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE

          So it looks the module at least needs to be built anew ...
          But: I do not know whether I'd need to build with amended settings for MPT2SAS_SG_DEPTH (and whether setting this to 256 would be valid). And: as we run kernel 3.0.x, the above clause would not match.
          As the driver load is reported in dmesg, I think I do not need to build a new initrd:
          io1r01s25-nsd:~/ufa/mpt2sas/lsi-mpt2sas- # dmesg | grep mpt2
          http:// 1.927022 mpt2sas version loaded
          http:// 1.927363 mpt2sas 0000:1b:00.0: PCI INT A -> GSI 40 (level, low) -> IRQ 40
          .. but it doesn't care about modprobe.conf:
          http:// 2.399663 mpt2sas0: Scatter Gather Elements per IO(128)

          So, still puzzled.

          • ufa
            122 Posts

            Re: GPFS Blocksize, Storage segment size, IO size, Performance

            ‏2012-09-24T18:29:00Z  in response to ufa
            ah, sorry, "else" clause applies, so we have

            #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE

            The latter applies if we define CONFIG_SCSI_MPT2SAS_MAX_SGE, and we might leave MPT2SAS_MAX_PHYS_SEGMENTS / SCSI_MAX_SG_SEGMENTS alone, probably.

            Was blind before.

            • db808
              86 Posts
              ACCEPTED ANSWER

              Re: GPFS Blocksize, Storage segment size, IO size, Performance

              ‏2012-09-24T19:57:34Z  in response to ufa
              OK. I'm back. Sorry for the delay.

              Thanks to everyone who has contributed. As I mentioned in my posting, I have no direct experience with the SAS-specific host-side IO stack, and the SAS-specific operation of the DCS3700.

              chr78 was able to provide some useful SAS-specific knowledge that a single SAS cable (the fat Infiniband-like cables) provides ~ 12Gbit's worth of bandwidth in her configuration. She suggest that the SAS link is operating as a 2-lane 6-gbit SAS channel. It could be also operating as a quad-lane 3-gbit SAS channel, but at this point it does not matter.

              ufa ... I think you are going down the wrong path with mpt2sas driver changes ... at this time. There were some red flags I saw in your comments higher up the IO stack. I want to back track, and also correct some driver-level suppositions that are likely to be incorrect (due to the way that the DCS3700 operates).

              Also, although it is poor forum etiquette, I offer to ufa to contact me directly by email to exchange phone numbers and continue with an interactive conversation (time zones notwithstanding). I pledge that I will press ufa to post a summary of the resulting troubleshooting session, and the final result. If we should get collectively stuck, ufa will update the posting with a summary of he is, asking for additional insight. When a solution is found, it will be posted to make the topic discoverable to the forum community.

              There are 1 top issues, two clarifications, and one piece of useful "hidden" information that I would like to discuss.

              I'm putting together the information now... but I wanted you to know.

              1) ufa said that the average IO size as measured by iostat on the sdxx device entries were inconsistent, and NOT 4MB. This is a problem. Stop. If you can't measure large IO at the sdxx device level, the lower layers will just make it worse.

              to be continued .... shortly.

              Dave B
              • db808
                86 Posts
                ACCEPTED ANSWER

                Re: GPFS Blocksize, Storage segment size, IO size, Performance

                ‏2012-09-24T23:41:06Z  in response to db808
                There are 1 top issues, two clarifications, and one piece of useful "hidden" information that I would like to discuss.

                1) Top issue - ufa said that the average IO size as measured by iostat on the sdxx device entries were inconsistent, and NOT 4MB. This is a problem.

                Stop. If you can't measure large IO at the sdxx device level, the lower layers will just make it worse.

                Question for the community using SAS and a GPFS blocksize of 4MB .... can you see 4MB IO to the "sdxx" disk device when running iostat? If so, what is the sg_tablesize value?

                For an Emulex fiber channel HBA connected to a DCS3700 with FC ports, the 4MB IO can be seen.

                I do think we need to properly increase the max_sgl_entries in
                /etc/modprobe.d/mpt2sas.conf. More on this in part 3.

                2) Clarification. Scatter/gather list terminology is INCONSISTENT up and down the IO stack within Linux, and also has an analogous restriction on the storage size.

                Unfortunately we have to "live with it". This is part of the Japanese Pachinko pinball machine analogy. Not all layers in the IO stack are equally competent and/or capable of handling the a scatter/gather list with the maximum number of entries ... and "leaf" entries that are of the maximum size.

                For lots of historical and efficiency reasons, IO on contemporary systems is NOT completely formatted and laid out in a physically contiguous area of memory. To assemble the IO, and add all the frame headers and trailers, checksums, etc. in-line would require too many memory-to-memory copies. The other issue is that in a virtual-memory, paging system, memory buffers that are logically contiguous in user-space, may not be contiguous in physical memory. There may be IO-space memory mapping address translation hardware (IOMMU), and in 64-bit systems with fully-capable 64-bit IO interfaces, direct addressing of main memory by the IO interface may be possible and still maintain appropriate security.

                So, in summary, the logical IO is presented as a list of "items", that are on-the-fly concatenated when the IO is performed. This list is often called a scatter/gather list.

                Depending on where you are in the IO stack, what is considered an "item" in the scatter/gather list, and how the scatter/gather list is internally structured ARE DIFFERENT ... but the terms used in documentation (assuming you can find some) are often similar or even the same ... resulting in confusion. At the higher levels, scatter/gather lists contain only the data "payload". Further down the list, the scatter/gather lists may also contain framing headers and trailers, and blank spaces where some lower level offload facility may deposit checksums, lengths, flags, etc.

                So when talking about scatter/gather lists, we need to try to clarify when possible.

                Also, we need to recognize that the Linux IO stack itself is abstracted as a list of function pointers working on structures (or objects). These objects or structures have maximum limits, but an individual function may not accept or allow the same dimension-ality.

                For example, a 4 MB IO could be represented as a 1-entry scatter/gather list with a "leaf" item of 4MB of contiguous memory. This is easily doable if the buffer was allocated in a Linux "huge page", which are 8MB in size.

                The problem with this, is when you pass such a structure to the next layer in the stack, it assumes that this layer can handle a 4MB leaf item. "Old" routines may only be able to handle "leaf" items that are 4kbytes in size, and decomposes the 1 x 4MB into a list of 1024 x 4kbytes ... and then now has a problem with a list with 1024 entries. Without going into a lot of history, the base level of functionality assumes a list of 128 x 4kb = 512kb. If the IO request was larger than this, the IO was serialized into multiple 512kb (128 x 4kb) sub-IOs. Other than the overhead involved, this is not a problem as long as the bottom of the IO stack can coalesce the IO back together ... and is motivated to do so.

                The next generation of legacy baselines was based on 256 x 4kb = 1MB lists. If you see a 1MB ceiling somewhere it is probably touching one of these code paths.

                The parameters at the /block/sys/sdxx/queue/max_segments -like level are negotiable between the two layers that are interacting.

                The next level down is the top of the driver.

                In the fibre channel world, GPFS can issue a kernel-level IO (with kernel-allocated buffers with less physical fragmentation) of 4MB in size and have it work with a FC-layer driver with a sg_tablesize of 256. When you do the math, this comes to an average of 16 kbytes per leaf item * 256 = 4MB. If you set the sg_tablesize to 255, you can't do 4MB IO ... on fibre channel. In Linux kernel space, the default slab memory allocator will allocate memory in chunks that are 32kb contiguous at a minimum.

                3) Clarification ... you are NOT modifying the mpt2sas parameter properly. The modprobe.conf -type entries need to be read from the RAMDISK image during the boot process, so you need to rebuild the kernel image (which includes the ramdisk file system) whenever you want to use module parameters that need to take effect early in the boot process. For RHEL 6.x, the procedure uses the "dracut" command. In past versions, an "mkinitrd" command was used. I don't know what the SUSE syntax is.

                Without posting the /etc/modprobe.conf -style changes into the boot ram disk, the boot process can not "see" the changes, and they don't take effect.

                As a test .... DECREASE the mpt2sas driver segment count parameter, and see if the /sys/class/scsi_host/{hostn}/sg_tablesize parameter decreases. If so, then you are using the proper procedures (modify modprobe.conf, rebuild ramdisk image, reboot).

                Once you can properly DECREASE the mpt2sas driver segment count parameter (and the corresponding sg_tablesize), then attempt to increase it. You can do a binary search -style exploration to home in on what the maximum is.

                Try 1024 .... does it work. If not, there is probably an error message in the system log or dmesg. You may need to increase the logging verbosity for the mpt2sas driver.

                If 1024 does not work, try 512 and so forth. Don't be surprised if one value works, and the value one higher does not. You have just identified some internal threshold.

                If you can get a value of 256 to work, you "should" be able to do 4MB IO ... using FC as an example. If the mpt2sas driver wants to be competitive, it should allow a setting of 1024 ... as the Emulex and Qlogic drivers do for their FC and FCOE drivers.

                Using a value that is larger than needed does slightly increase the amount of driver memory allocated for the IO control blocks. Many tiny systems use SAS and SATA interfaces from LSI, and LSI allows those users to decrease the segment count parameter all the way down to 16 to conserve memory on a 16-bit or 32-bit system. LSI's SAS driver legacy (and expected gear ratios for the drive train) are on the smaller side.

                3) Hidden information.

                On the fiber channel side, the DCS3700 can not actually accept a monolithic fibre channel transfer of 4 MB. The DCS3700's "SCSI burst size" is only 2MB, and is negotiated during the initial SCSI session establishment handshake. The 4MB host request in the FC driver gets sent as 2 x 2MB SCSI-over-FC transfers, and re-assembled in the DCS3700 controller. I don't know if a similar thing happens on the SAS side, but I would not be surprise. As an aside, the IBM DCS9900 / DDN S2A9900 advertizes the willingness to accept a 128 MB SCSI burst size on the FC interfaces.

                Even though the FC cable transfer is 2 x 2MB, the DCS3700 properly handles the request as a 4MB logical request and performs what is expected ... mostly. The front-end performance statistics are slightly distorted. The 2*2MB FC transfers are metered as 2, 2MB transfers, the second of which is always 100% found in cache ... even if the 4MB IO was "random". This is a distortion. If you run a 100% random IO benchmark using 4MB GPFS blocksize, with "scatter" allocation and a FC connection, the DCS3700 statistics will say there are twice as many 2MB transfers, and a 50% cache hit rate. It is unfortunate, but that is the way it is.

                ufa ... please let us know how it goes ...

                Dave B.
                • db808
                  86 Posts
                  ACCEPTED ANSWER

                  Re: GPFS Blocksize, Storage segment size, IO size, Performance

                  ‏2012-09-24T23:48:53Z  in response to db808
                  One more quick thing.

                  The DCS3700 can NOT do a 4MB host IO, resulting in 8 x 512kb segments if the cache memory page size is small ... like 4kb. This says that the DCS3700-level scatter/gather list can not handle 128 * 4kb buffers to perform a 512kb IO to the segment.

                  512kb IO to the segment DOES work if the cache memory page size is 32kb. This infers that a scatter/gather list of 16 x 32kb is OK. I don't know what the actual capability is, I only tested the two extremes.

                  So, since you want large IO, set the DCS3700 cache page size to 32kb.
          • chr78
            132 Posts

            Re: GPFS Blocksize, Storage segment size, IO size, Performance

            ‏2012-09-24T20:13:18Z  in response to ufa
            As the driver load is reported in dmesg, I think I do not need to build a new initrd

            you'll have to - according to the timestamps you provided, mpt2sas is part of your initrd.
            (it shows up ~2s after kernel launch)

            modprobe.conf settings have to be read from the initrd as well (there's no access to real root at that stage)

            • ufa
              122 Posts
              ACCEPTED ANSWER

              Re: GPFS Blocksize, Storage segment size, IO size, Performance

              ‏2012-09-26T16:24:11Z  in response to chr78
              Hi, just an intermediate update:
              We found another issue today which had slipped our mind before:
              It appeared that the DCS controller attached to the first SAS HBA was always slower than the other one (verified by reassigning disks and switching cables).
              read using dd with 1MB blocksize from the LUNs yielded about 370..400MB/s per LUN for the fast and about 200..250MB/s per LUN for the slow controller/HBA.

              From previous experience with Sandy Bridge architectures we knew that there might be some adverse effect on the PCI behaviour from C state transitions. I implemented the following BIOS/UEFI settings :

              set Processors.ProcessorPerformanceStates Disable
              set Processors.C-States Disable
              set Processors.C1EnhancedMode Disable

              This brought both HBA/controller chains to perform at the 370..400 MB/s/LUN level.

              I could, by setting the max_sgl_entries modprobe.conf, reduce sg_tablesize to any value below 128, but could not get above. It might be possible to reach that by rebuilding the module, but as chr78 told me, he tried that but didn't succeed.

              • db808
                86 Posts
                ACCEPTED ANSWER

                Re: GPFS Blocksize, Storage segment size, IO size, Performance

                ‏2012-09-26T19:23:54Z  in response to ufa
                Thanks for the update. Seems like you had some hardware issues. Good catches.

                On the sas mpt2sas driver max_sgl_entries parameter not going above 128.

                I see that as a non-trivial issue that will restrict 4MB host IO, (and 8 mb if you want to go even larger).

                No other community member has posted that they can validate 4MB IO via iostat and are running SAS.

                I would suggest double checking if there are any newer mpt2sas driver versions that indicate that "max_sgl_entries" can increase to 256. I would also explore the LSI SAS controller BIOS menus to see if there are any BIOS-level SAS firmware settings related to the maximum IO size. One of the source files that you referred to earlier in the posting seem to indicate that the range of values were 16 - 256. What version of the driver was that, and are you running that version or newer?

                When you try to increase the max_sgl_entries beyond 128, you should be getting an error message sent to the syslog file or dmesg. Are you seeing such a message?

                You may want to selectively increase the verbosity level of logging from the mpt2sas driver to get a more descriptive reason why the parameter can not be increased beyond 128. There may be some other indirect resource (like memory for increase-sized control blocks) that is preventing the increase of the value.

                Since you are IBM staff .... perhaps you should reach out the Ray Paden, the author of many of the GPFS "best practices" documents, training, and tutorials. In the DCS3700-specific document that I have from Ray, he documents the configuration and many of the settings used to obtain the reference performance levels. Unfortunately, the document that I have does not mention the max_sgl_entries parameter, nor the validation of the IO size using iostat. However, based on his performance numbers, he appears to be doing 4 MB IO.

                Personally, I have found Ray Paden's presentations very useful as a baseline and a sanity check. From the baseline that Ray illustrates, you may be able to further improve performance or efficiency by addressing additional areas contributing to congestion or overhead specific to your topology.

                I would like to step back for a second, and make a comment on PRRC. I understand the usefulness of disabling it as you are troubleshooting, but please be aware that without PRRC enabled, the DCS3700 is susceptible to "silent data corruption".

                You also mentioned that you were using some DDN SFA 12K series storage, often found in HPC environments. IBM even previously resold the older DDN S2A9900 as the DCS9900. These DDN systems are well known for their prowess in straightforwardly addressing "silent data corruption" with minimal impact to their stellar streaming performance.

                The DCS3700 by default does NOT handle silent data corruption -detection, and as such, your data is "at risk", even when running in RAID6 mode. The only method to avoid silent data corruption is to enable PRRC, Pre-Read-Redundancy-Check. This forces the DCS3700 to read all 10 disks in the RAID6 group, compute and validate parity, on EVERY read. At a minimum, this slows performance about 20% (for 60 disks) to ~ 1600 MB/sec, and kills small IO performance (for metadata). The extra parity computation on every read also causes the controller pair to bottleneck at ~ 1800 MB/sec.

                The recently announced "Performance Controllers" for the DCS3700, along with the new firmware support T10 PI (520 byte sectors) that stores an additional checksum with the data, and checks the checksum on each read. The new "Performance Controllers" have the additional parity generation performance to check parity on every read without significant performance penalties.

                When/if the 4MB IO enablement issue on the Linux IO stack is resolved, you still have questions about the physical layout of the DCS3700 storage. Ray Paden's best practice document for the DCS3700 is a good starting baseline.

                You mentioned that you are using a 8x512kb segment = 4MB hardware stripe ... the maximum allowed. This is good for 4 MB IO, but poor for smaller IO. Decreasing the stripe size to 8x256kb = 2MB hardware stripe results in the DCS4700 decomposing the 4MB host IO into 2x2MB, and doing back-to-back IO on each disks (1 or 2 256kb IOs), that run without a rotational latency penalty. So, you get the same 4MB performance, and about 8% better 2MB performance. We have tested and measured this.

                I have been told (but not tested) that 8x128kb segments can also be used for 4MB host IO in a similar fashion. I will be testing in a few weeks. I have also been told that 8 x 128kb =1MB hardware stripe will also allow a 8MB host IO without an additional penalty. 8 x 128kb +P +Q is important ... it is also the building block of disk pools. Using disk pools with more than 10 disks allows the possibility of creating a larger single LUN with (multi-threaded) performance greater than 240-300 MB/sec. With a 30-disk disk pool, you could theoretically do 3 x (240 - 300 MB/sec) for a single LUN and single NSD, rather than using 3 LUNs and 3 NSDs.

                I'd suggest asking Ray Paden about 4MB IO across the mpt2sas driver....

                Good luck.

                Dave B
              • ufa
                122 Posts
                ACCEPTED ANSWER

                Re: GPFS Blocksize, Storage segment size, IO size, Performance

                ‏2012-09-27T08:58:13Z  in response to ufa
                Using the new BIOS settings (see previous posting) I saw the following gpfsperf read outcome per server (compare to my initial posting in this thread). TH is number of threads in gpfsperf.

                TH=2 ### RATE/CLI=1086MB/s
                TH=4 ### RATE/CLI=1092MB/s
                TH=8 ### RATE/CLI=1236MB/s
                TH=16 ### RATE/CLI=1985MB/s
                TH=32 ### RATE/CLI=3171MB/s
                TH=64 ### RATE/CLI=2808MB/s
                TH=96 ### RATE/CLI=3120MB/s
                TH=128 ### RATE/CLI=2586MB/s

                The tests run on the two NSD servers (one gpfsperf on each) which have the filesystem NSDs directly attached, file size is 192GB. While this improvement is not related to any changes in the Linux IO, it looks substantial. I am just not sure whether we see the effect of some caching here. However, the maximum is reached at 32 threads. If gpfsperf does equally spaced reads, we would need to hold the entire 2x192GB in cache somewhere for 100% hits which exceeds the system capabilities.
                However, if gpfsperf starts all threads reading sequentially from file start, we'd see a lot of cache hits.
                My suspicion that we see a cache issue comes from the write rates, which, for the same thread numbers look per client like:
                TH=2 ### RATE/CLI=2340MB/s
                TH=4 ### RATE/CLI=2259MB/s
                TH=8 ### RATE/CLI=2160MB/s
                TH=16 ### RATE/CLI=2069MB/s
                TH=32 ### RATE/CLI=2091MB/s
                TH=64 ### RATE/CLI=2259MB/s
                TH=96 ### RATE/CLI=2209MB/s
                TH=128 ### RATE/CLI=2209MB/s
                But that may be due to the fact that the write cache of the DCS buffers all writes from the start, while the low-thread reads are not cache-buffered and do, due to our as-seen 512KB IO size and 512KB segment size, involve only a few numer of spindles. When going to more threads, reads involve more spindles and thus get faster. We'll see how the system behaves when using more client systems.

                • db808
                  86 Posts
                  ACCEPTED ANSWER

                  Re: GPFS Blocksize, Storage segment size, IO size, Performance

                  ‏2012-09-27T16:59:00Z  in response to ufa
                  Hi ufa,

                  IMO, you have leaped forward too far, are using a difficult-to-understand and control testing tool without documented baselines for comparison, potentially distorted the test due to the "small" file (0.2 % of the usable disk space), AND are not measuring back-end performance.

                  You are questioning why the results are so erratic. I ask what you are trying to measure or validate, and what have you done to "control", assuming the test environment has sufficient degrees of control?

                  I suggest that you need to continue validating expected behavior in the low-mid layers of the IO stack ... before you put a complex IO exerciser on top of GPFS several layers above.

                  chr78 clarified that the SAS configuration can do 1200 MB/sec on a single path, apparently implemented as a dual-lane 6gbit channel.

                  With one active path per storage processor, each handling 3 x 10-disk RAID6 LUNs, there is a peak potential of 2400 MB/sec from a single node ... we don't need to do 2-node testing (yet) to identify the storage-centric performance envelope.

                  The last I understood was that there was a restriction at the SAS mpt2sas driver level with resulting in a sg_tablesize of only 128 ... which should allow GPFS to do atomic 2 MB IO. OK. So be it. Let's see if we can live with the behavior ... realizing that it might taint some areas and distort some statistics.

                  If a 4MB IO comes down the IO stack, and is presented to the top of the driver level, and internally the driver can only handle 2MB due to scatter/gather limitations, the 4MB IO will be serialized and decomposed into 2 x 2MB IOs back to back, on the same SAS path.

                  Based on the performance numbers (greater than 1000 MB/sec per 60 disks) it appears that the serialized 2 x 2MB host IOs are being properly handled on the storage processors, and are resulting in back-to-back IOs being sent to the individual stripe segments in such a manner that they do NOT incur a rotatational latency penalty. Given this observation, other than a 2x higher setup/takedown overhead and a falsely 2-deep back-end disk queue ... the overall IO performance is comparable to a 4 MB host IO ... which is good.

                  You can also reverse-engineer that the storage is NOT behaving like it had (only) 2 MB IO. Performing random 2MB IO would result in 60-disk performance at 1000 MB/sec or less.

                  Before we go further, have you estimated how fast you "should" go ... based on the disk mechanics? This theoretical estimate is useful to understand, and also a flag if performance levels above this estimate are observed.

                  At the mechanical level, seek performance is different for reads and writes. This results in different mechanical performance for reads and writes. For the near-line SAS disks, the average seek for reads is 8.5 msec, and 9.5 msec for writes. I usually don't worry about this ... at this point ... and use an average (9.0 msec) for both reads and writes. Later ... if I think it might be tainting the results, I can subtract .5 msec for reads and add .5 msec for writes ... from the "average" results.

                  If you go through the math, using the blended seek time you get 2,000 MB/sec for 60 disks doing 512kb IO ... assuming no controller bottlenecks, and perfect balancing, and no congestion.

                  For a RAID6 read without PRRC, subject to silent data corruption, you only need to "touch" 8 disks ... the parity segments are skipped over ... and there is no parity generation or comparison CPU overhead. You will approach 2000 MB/sec for 60 disks, or 333 MB/sec per 10 disk LUN. This equates to an average of 12 msec for each 4 MB IO. The IOs to the segments are not really happening at 12 msec, that is the effective average.

                  A 10-disk RAID6 group receiving a 4MB IO, ends up performing a 512kb IO across 8 disks (or 2x256kb) ... with 2 disks idle. If there is a second 4MB IO queued, it can start reading the idle disks. On average you can perform 1.25 8-disk reads on a 10-disk raid group.

                  If you are running in PRRC mode, all 10 disks are read on EVERY read, 8 data segments, and the P parity and Q parity segments. This results in getting 8 data payload "units of work" per 10 physical disks ... or 80% of the 10-disk read bandwidth. 2,000 MB/sec for 60-disk * .8 = 1600 MB/sec for 60 disks in PRRC mode ... assuming the controllers can handle all the parity work.

                  On a 4-MB write, you also engage all 10 disks. There will be 8 writes of data and 2 writes of parity. So similar to the PRRC read case, you get 8 disks's worth of data payload per a 10 disk raid group. This results in 1600 MB/sec for writes for 60 disks.

                  Based on the understanding that read seek time is .5 msec faster than the blended value we are using, and writes are .5 msec slower ... you could bump the read estimate up by about 4%, and the write estimate down by about 4%.

                  So the numbers I like to use as a ceiling when doing 4 MB IO are:
                  1600 MB/sec per 60 disks, 4 MB PRRC read
                  2000 MB/sec per 60 disks, 4 MB read susceptible to silent data corruption
                  1600 MB/sec per 60 disks, 4 MB write

                  The DCS3700 controller's parity processing capability seems to max out around 1800-1900 MB/sec, so if you add an expansion chassis of 60 disks, bringing the total to 120 disks, performance does not scale as expected. The PRRC reads and writes max out at about 1800 MB/sec. You can get close to 4000 MB/sec doing 4 MB reads susceptible to silent data corruption on 120 disks .... and you are at the advertised controller max.

                  If you create a single file system across all 60 disks, and use GPFS "scatter" allocaation you should be able to easily achieve those numbers ...Given that we already addressed ensuring 4 MB IO from GPFS to the SAS driver. The GPFS buffer pool should be gigabytes in size, in my opinion. Don't forget to increase maxMBps.

                  It is useful to take the expected IO rate across all storage for a give node, and divide it into the shared buffer pool size to get a gross estimate of the data velocity. Can you keep a few seconds of data in the buffer pool when running at full speed?

                  You can write a simple program to just read a file, and throw away the data. Like a dd output to /dev/null ... without the overhead of doing the writes. Similarly, you can write a simple program to just write a data pattern to a file, given a length. This is like a dd from /dev/zero to a file ... without the read overhead.

                  Even using "dd", you should be able to create a series of large files. The files should be larger than the buffer pool size, so you know they won't be cached.

                  For simple testing, I like to use separate, disjoint files, so there are no file locking issues. This is the best case.

                  dd if=testfile1 of=/dev/null bs=4M .... should run close to the maximum read speed ... 1600 MB/sec for PRRC. The GPFS read ahead threads will quickly ramp up and keep all 6 LUNs busy. You can run iostat and monitor the performance ... and you should see 4MB average IO size on fiber channel, and 2 MB average IO size on SAS. The SAS statistics for latency and queue depth will be distorted due to the 2x2MB serialization.

                  If you want to fine tune the test, when you create the NSDs, order the NSDs such that they alternate between storage processor A, then storage processor B, and so forth. That will help avoid micro-clumping on the SAS path.

                  If you run significantly faster than this ... what are you measuring? If it at the application level ... you are measuring what is going into GPFS. GPFS caching can distort this. Compare the application-level measured rate with the IO statistics from iostat (to the disk), and the DCS3700 statistics (which meter IO to/from the host, not the back end disks).

                  You may also observe that write performance appears faster. This is most often caused by buffer mismanagement, resulting in starvation for read buffers and read parallelism.

                  If you are only testing a single DCS3700, you need to be aware that the host can easily over-drive the storage and drive the storage controllers into non-linear congestion behaviors.

                  On the write performance being better issue ... here is what is likely happening.
                  A read buffer (in both GPFS and the storage) is easily stolen because it has not been modified, and does not need to be flushed to disk. Write buffers contain modified data, that must be flushed to disk before it can be stolen.

                  If GPFS or storage gets a burst of write activity, "free" buffers are used to handle the incoming data. These modified buffers are marked for flushing, but writing to disk takes time (12-18 msec). If more incoming writes are received, what buffers are used? A buffer that is most easily steal-able ... a read buffer. If this condition continues, the backup of write buffers waiting to be flushed gets larger, as more and more read buffers are stolen, eventually limiting the amount number of parallel reads outstanding ... to a point that reads end up being "flow-controlled" before the writes.

                  GPFS uses much more sophisticated buffer management algorithms, and is dynamically more aggressive flushing modified buffers .. and will perform flow control on incoming writes.

                  The buffer management algorithms in the DCS3700 are much less sophisticated ... and the default cache assignment (80% write, 20% read) amplifies this behavior.

                  In general, in a large-file IO case, with primarily random IO (indicating that buffers will not normally be reused) ... you want to cajole the DCS3700 to flush write data SOONER, rather than let it languish much. A backlog of unflushed buffers will displace reads, and give the appearance of less read performance ... because the writes are effectively monopolizing the disks, leaving less bandwidth for the reads, comparatively. I described this behavior in another posting, along with a workaround that involve not over-driving the storage from the host perspective (do you really want a default of 72 prefetch threads (also used for write-behind) queued to only 6 LUNs (12 x 4 MB IOs per LUN)? Perhaps 2 to 3 per LUN is more reasonable.

                  But also on the storage side, you need to force flushing to disk sooner. Even if you set the write flush time constant to 1 second the minimum), you could have 250-350 MB of modified data languishing. You may need to decrease your write cache buffer allocation to be only 1-2%, equivalent to only a few 4MB buffers per LUN.

                  If you are seeing consistently higher performance numbers (based on iostat), then you most likely have a data set that is not seeking a typical distance. If you use the "scatter" allocation method, with file systems that span all the storage LUNs, with files larger than 32 GB, you should see consistent results ... for multiple independent sequential streams of reading or writing, using different files.

                  Beyond those sequential-stream centric tests, you begin to get get much higher levels of metadata activity, journaling, token management, and cross-node coordination. Analyzing this much more complex activity is much more difficult, and a topic for another discussion.

                  Dave B
                  • ufa
                    122 Posts
                    ACCEPTED ANSWER

                    Re: GPFS Blocksize, Storage segment size, IO size, Performance

                    ‏2012-10-14T21:34:48Z  in response to db808
                    Hi, db808 ,
                    I am overwhelmed by what you wrote.
                    Stirred as well for the PRRC issue.

                    I had, very unfortunately, not had time to read this earlier, as both projects, the one with the SFA12k as well as the one with the DCS3700 storage have been eating all of my time recently.
                    I had, once I'd seen the numbers shown before, created the final filesystems on the storage. I know from what you wrote that this was probably to early WRT getting the maximum out, but people were screaming here.

                    However, the numbers I got with the gpfsperf-framework used before were not looking so bad. As discussed with the customer, there is one small filesystem, spanning 6 DCS3700 with 4 NSD servers attached, and a big one which comprises the other 30 DCS3700 with 20 NSD servers attached plus 2 NSD servers linked to 2 DS3512 with 6 RAID1 each out of 550GB SAS disks.

                    I saw, when running that gpfsperf on the NSD servers, a maximum throughput of aggregated 47GB/s write and 57GB/s read into / out of the big fs. file size was 192GB, beyond all cache sizes, one file per client (here the NSD servers).
                    As we need to show 55GB/s from the whole setup, and used just 5/6 of it, and given the pressure to get things ready, I decided to move on. Would be very sweet to get such a system for playing, pity I can't afford it privately :-(

                    Is there any rule of thumb how often those PRRC-targeted bit flips do occur? As we have a scratch fs (the big one), and a home fs (the small one) it might be ok to switch on PRRC on the home fs storage ...

                    You rightfully ask for the lower-level baselines to evaluate further measurements of higher-level behaviour, but as said before: the time pressure made me head on - once we had seen the small-fs throughput was in the range per storage unit we needed it.
                    One more thing: as the customer was first insisting to get 2MB IO blocksize, i tested small (from the 18 LUNs of 3 DCS3700) filesystems with 2MB and 4MB BS on both 256KB and 512KB segment size. I found that (expected from the SAS-restricted IO size) the SS does not have a significant effect on the streaming read/write perf (a point you also made if I got you correctly) but that 4MB BS is significantly faster than 2MB BS (about 20% at maximum throughput). Thus I set the SS to 256KB but could convince the customer to accept 4MB BS.

                    Currently we chase the IB performance between clients and NSD servers.

                    BTW, I've sent mails to Ray before starting this thread but got no reply so far ... Maybe I am to ignorant ...
                    The more I am grateful to you taking the time to write such elaborate postings. I admit I need a bit more time to digest all the information and sort it into my picture, but it is definitely worth reading a few times.

                    • ufa
                      122 Posts
                      ACCEPTED ANSWER

                      Re: GPFS Blocksize, Storage segment size, IO size, Performance

                      ‏2012-10-14T21:39:37Z  in response to ufa
                      read 2MB GPFS BS in my last posting instead of 2MB IO BS !
                      • db808
                        86 Posts
                        ACCEPTED ANSWER

                        Re: GPFS Blocksize, Storage segment size, IO size, Performance

                        ‏2012-10-15T16:57:14Z  in response to ufa
                        Hi ufa,

                        Thank you for the update. It is good to see you making progress, and I understand the dilemma working with customers, but ultimately, since they are paying the bill, the "customer is always right". If the customer chooses not to follow recommendations/suggestions ... then they are ultimately responsible for the outcome. In some cases, it can be a loss of efficiency, in other cases it can be an improvement in efficiency. As a consultant, you just try to ensure that the customer is making an "informed" decision. In our case, we started from the IBM "best practices" and pushed further ... but we have a much simpler, non-HPC style setup. Efficiency is also "in the eye of the beholder". For many large HPC customers that have millions invested in the compute complex, and millions more in the storage, using a few less NSD servers is roundoff error.

                        If you send me a personal email (follow the db808 link to my profile) with your email, I can send you a copy of the IBM GPFS best practice for the DCS3700, authored by Ray Payden. My copy is about a year old.

                        Your overall configuration is very "rich" in NSD servers. The average pair of NSD servers only manages 3 DCS3700's or 2 DS3512's. If you assume ~ 2,000 MB/sec per 60-disk DCS3700 without PRRC ... that equates to (3 x 2000 MB/s) / NSD pair ... 6000 MB/sec / 2 NSD servers = 3000 MB/sec per NSD server. That is a very respectable sizing ... and very modular and building-block style. It is probably also fairly straightforward to manage.

                        You performance numbers look very good, but as I said before, I don't know exactly what gpfsperf is measuring, but I understand that it is HPC-centric workload and simulates many clients, which is appropriate for HPC.

                        The 60-disk DCS3700 will service about 2,000 MB/sec of read requests without PRRC. For reads with PRRC or writes, you will get about 1,600 MB/sec.

                        For your large file system, made of 30 x DCS3700:
                        30 x 2000 Mb/sec (read w/o PRRC) = 60,000 MB/sec (ideal) .... 57,000 MB/s via gpfsperf
                        30 x 1600 MB/sec (write) = 48,000 MB/sec (ideal) .... 47,000 MB/sec via gpfsperf

                        So as you said, for the large file system, you are running very well ... assuming the DCS3700s have only 60 disks each.

                        I suspect that there may be some momentary clumping due to nsd-to-nsd sequencing, but this potential clumping only impacts cases where you have limited read-ahead (reducing the effective IO threading). I do not believe that gpfsperf exercises those conditions.

                        The NSD-to-NSD ordering may be having a minor impact on the IB performance issues that you say you are chasing. Look at the GPFS IO history (using mmdiag or mmfsadm), and see how well the IO requests are dispersed across the 30 NSD servers. What you don't want to see is 9 IO requests to NSD server1, then 9 IO requests to NSD server2, and so forth. Based on your topology, an NSD server will handle ~ 3,000 MB/sec ... so during that 9 IO "clump", you will be limited to the performance of the NSD server you are talking to. That is "only" 3,000 MB/sec or about 37.5 Gbit of IB bandwidth ... not the 40 or 56 gbit IB bandwidth you might be expecting.

                        You don't get a second NSD server contributing to the performance (in this example) until you get the 10'th IO queued. This behavior can be corrected by manipulating the order of the NSDs when you are creating the file system. Unfortunately, you can't change it after the file system is created. I can tell you how to better order the NSDs ... send me an email directly.

                        The best thing for the customer was them agreeing to use a 4 MB GPFS block size. I've attached a chart of large file throughput vs. blocksize. If the customer uses 2 MB blocksize, they will need many more disks than when using 4 MB. BTW ... we're moving to an 8 MB blocksize, now that it is officially supported with GPFS. Of course, sooner or later you hit a bottleneck in the DCS3700 controller.

                        You segment size decision of 8 x 256kb segment = 2MB hardware stripe is fine. The DCS3700 takes the 4MB host IO and issues back-to-back IOs to the disk(s), without an additional rotational latency penalty. As I said in my previous post, you usually want to use the SMALLEST hardware stripe size (and corresponding segment size) that still yields the large IO performance without the extra rotational latency. Your gpfsperf benchmark will probably not be able to show the benefit ... but with smaller hardware stripes ... IO's that are smaller than 4 MB are more efficiently handled (still in parallel mode). If your files are really all ultra-large, GPFS will always use full blocks, and never read/write individual GPFS sub-blocks (which are 1/32 of the block size).

                        As to the PRRC topic, just do a Google search on disk "silent data corruption" ... you will get over 60K hits, The two big studies were:

                        University of Wisconsin-Madison, University of Toronto, and NetApp
                        1.53 million disks in thousands of systems
                        Data collected over 4 year period

                        CERN (European Organization for Nuclear Research)
                        3000 independent nodes, ~500 with attached JBOD, aggregate of 1.5PB
                        Data collected over 2 month period

                        Summary from the University study-
                        Checksum mismatches
                        Possible causes are data bit corruption, torn writes
                        Nearline: 0.66%, Enterprise: 0.06%
                        Parity inconsistencies
                        Lost writes, misdirected writes, incorrect parity calculation
                        Nearline: 0.147%, Enterprise: 0.017%
                        Identity discrepancies
                        Lost writes, misdirected writes
                        Nearline: 0.042%, Enterprise: 0.006%

                        From the CERN study
                        Write, then read a ‘special’ 2GB file to and from 3,000 nodes every 2 hours
                        After 5 weeks, they had 500 errors on 100 nodes
                        33,700 files compared against a known checksum
                        22 mismatches found
                        Translates to 1 in 1,500 files being unreadable
                        10,000 compressed files comparison
                        99.8% that a single unrecoverable bit error would lead to a corrupted file

                        Pre-Read-Redundancy-Check one of the poorest ways to handle silent data corruption, but it is all that is available. The best solution is now called T10 Protection Information (PI). T10 is the SCSI standards committee. You will also see it called the T10 Data Integrity Field (DIF) or T10 Data Integrity Extension (DIX), or 520-byte sectors.

                        The new "performance controller" option for the DCS3700 supports 520-byte sectors (T10 PI) on both the back end and front-end to the host. Most users will just enable the back-end 520-byte sectors, and keep the host connection the standard 512 byte sectors. With AIX, you can also enable T10 PI support all the way to the host. This is most often used with GPFS native RAID on the Power 775 super computer, where the NSD server acts as a deconstructed RAID controller.

                        Many "enterprise" disk storage vendors include T10 PI support by default, and do not even offer a mechanism to disable it. Unfortunately, they also don't market this capability very well. These enterprise disk vendors have been doing 520 byte sectors for decades, and for many customers, they no of nothing else. There are many mid-range storage vendors that "cut corners" (in my opinion) and put the data at risk. There are billions of dollars of storage being sold each year by these (often large) mid-range vendors that are susceptible to silent data corruption. It is the industry's "dirty little secret", but the customers are letting them get away with it.

                        I remember reading about a researcher who ran readily-available IO testing tool that would create specially formatted (and internally checksummed) data files, and then read the files back ... checking the checksum on every read. If the IO test was run as fast as possible, typically requiring multiple clients, checksum errors could be detected within THREE DAYS of testing.

                        Part of the need for double-parity RAID6 was due to getting silent data corruption errors during a rebuild after a single disk failed. When you go through the error statistics math, the probability of an error is about once in every ~ 14 TB of IO. Once disks became larger than 1TB, the 5-disk RAID 5 groups, did 4-5 TB of IO during a rebuild, and there was a ~30% chance that the rebuild would fail. The industry cleverly migrated to double-parity RAID6 to mask the problem during rebuilds.... but it still exists if you are not explicitly handling it.

                        Periodic data scanning does catch some of the errors, but NOT all of them.