Topic
  • 11 replies
  • Latest Post - ‏2014-05-22T19:00:14Z by yuri
heb
heb
8 Posts

Pinned topic change order of round-robin nsd striped writes

‏2014-05-19T10:26:03Z |

Hello,

in a previous post yuri did explain how GPFS determines the order in which it stripes writes to disks. I wonder whether I can change this order after the initial  storage pool creation? I want to do this for two reasons:

  • I did the intital ordering wrong, I didn't alternate controllers as recommended by yuri
  • I want to add more controllers and luns to an existing storage pool, hence I want to reorder the stripe-sequence, otherwise all new LUNs will be added as one block.

If I need to shutdown all nodes and say, edit the mmsdrfs file, that would be o.k.. I do run gpfs 3.5.0-13 on linux.

Citation from previous post by yuri:

The NSD stripe ordering is controlled entirely by the order of NSDs, something that's specified by the disk descriptor file given to mmcrfs/mmadddisk, and can be queried via mmlsdisk.  If the latter lists, say "nsd1; nsd2; nsd3; nsd4", that's going to be the disk order for striping. 

In order to spread the IO across multiple arrays, one has to be judicious about the order in which disks are listed in the original disk descriptor at mmcrfs/mmadddisk time.  Instead of doing "disk1-from-controller1; disk2-from-controller1; ...; disk1-from-controller2; disk2-from-controller2" one should choose "disk1-from-controller1; disk1-from-controller2; ...; disk2-from-controller1; disk2-from-controller2".

yuri

Updated on 2014-05-19T10:26:45Z at 2014-05-19T10:26:45Z by heb
  • yuri
    yuri
    277 Posts
    ACCEPTED ANSWER

    Re: change order of round-robin nsd striped writes

    ‏2014-05-19T16:23:25Z  

    Once the striping order is established, it can be changed, but not easily.  As a rule, whenever one comes up with a solution to a problem than involves "edit mmsdrfs", it's almost a guarantee that the solution is wrong.  Yes, mmsdrfs is a plain text file, and hacking it would in fact be a simple task, and thus a temptation, but this is a case of "easy" being the opposite of "useful".  In this case, mmsdrfs has no relevance to the problem at hand.  The disk striping order is controlled by the ordering of the disks in a given storage pool, and the ordering is recorded in the file system descriptor on disk.  Furthermore, the list of disks is a fundamental part of disk addressing.  Disk addresses refer to disk slots, and thus one can't just reorder disks without scrambling the file system content.  The only way to change the disk ordering is to use mmdeldisk and mmadddisk, very carefully, to remove disks that are ill-placed in the striping order and then add them back in the desired order.  "mmlsdisk fsname -L" will show the current order, including disk slot numbers.  mmadddisk uses a simple "first fit" algorithm for finding an empty slot for a newly added disk, so the placement of disks is deterministic.  If one, say, deletes diskA in slot 2 and diskB in slot 4, and then adds them back in the "diskB;diskA" order, then diskB will be in slot 2, and diskA in slot 4 (provided slots 1 and 3 are not empty).  So disk reordering in an existing file system is certainly doable, but it's a fairly time-intensive task, not a simple configuration tweak.

    yuri

  • dlmcnabb
    dlmcnabb
    1012 Posts
    ACCEPTED ANSWER

    Re: change order of round-robin nsd striped writes

    ‏2014-05-19T16:49:27Z  

    The round-robin order is defined by the failure groups (FGs) and the order the disks are added to each FG which you can see in mmlsdisk $fsname -i. The actual FG value is not used in the following example, but rather the order in which they are found.

    The RR order for data blocks of a file is (this does not apply to metadata blocks):

    FG1 disk1, FG2 disk1, FG3 disk1, ...

    FG1 disk2, FG2 disk2, FG3 disk2, ...

    ...

    As FGs run out of disks, they are not used any more until the entire list of disks is used up, and the next RR starts over at the beginning.

    Each file picks a hashed value based on the inode number as the starting point in the RR order for its first block, and then cycles through that order.

    You can use mmchnsd to reassign FGs to each disk, but you cannot reorder the disks in each FG. Then run mmrestripefs $fsname -b to rebalance every block of every file to the new RR order. This is IO intensive, so run it at a low usage time.

     

  • esj
    esj
    104 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-19T15:06:36Z  

    Whenever is see or hear the words "edit the mmsdrfs file" I cringe ... But we'll leave
    this aside for the moment.  The command that you need is mmchnsd.  It can be used
    to replace the current NSD server declarations with whatever you specify.

    Take a look at "Changing your NSD configuration" in chapter 4 of the Admin guide
    and at the mmchnsd man page.  Note that when changing NSD servers, the affected
    file system cannot be mounted.  So, if you can shut down the entire cluster, you can
    change all of the NSD server definitions with a single mmchnsd command.
    Your other alternative is to create separate input files for each of the file systems
    and then process the file systems one at a time (mmunmount - mmchnsd - mmmount).
    In either case, avoid running separate mmchnsd commands for each disk separately.

    Bypassing the mmchnsd command and editing the mmsdrfs file directly is not a good idea.
    Practically all changes to the mmsdrfs file have subtle side effects that may break
    things down the road and it is very hard to figure out what went wrong.

    Eugene

  • yuri
    yuri
    277 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-19T16:23:25Z  

    Once the striping order is established, it can be changed, but not easily.  As a rule, whenever one comes up with a solution to a problem than involves "edit mmsdrfs", it's almost a guarantee that the solution is wrong.  Yes, mmsdrfs is a plain text file, and hacking it would in fact be a simple task, and thus a temptation, but this is a case of "easy" being the opposite of "useful".  In this case, mmsdrfs has no relevance to the problem at hand.  The disk striping order is controlled by the ordering of the disks in a given storage pool, and the ordering is recorded in the file system descriptor on disk.  Furthermore, the list of disks is a fundamental part of disk addressing.  Disk addresses refer to disk slots, and thus one can't just reorder disks without scrambling the file system content.  The only way to change the disk ordering is to use mmdeldisk and mmadddisk, very carefully, to remove disks that are ill-placed in the striping order and then add them back in the desired order.  "mmlsdisk fsname -L" will show the current order, including disk slot numbers.  mmadddisk uses a simple "first fit" algorithm for finding an empty slot for a newly added disk, so the placement of disks is deterministic.  If one, say, deletes diskA in slot 2 and diskB in slot 4, and then adds them back in the "diskB;diskA" order, then diskB will be in slot 2, and diskA in slot 4 (provided slots 1 and 3 are not empty).  So disk reordering in an existing file system is certainly doable, but it's a fairly time-intensive task, not a simple configuration tweak.

    yuri

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-19T16:49:27Z  

    The round-robin order is defined by the failure groups (FGs) and the order the disks are added to each FG which you can see in mmlsdisk $fsname -i. The actual FG value is not used in the following example, but rather the order in which they are found.

    The RR order for data blocks of a file is (this does not apply to metadata blocks):

    FG1 disk1, FG2 disk1, FG3 disk1, ...

    FG1 disk2, FG2 disk2, FG3 disk2, ...

    ...

    As FGs run out of disks, they are not used any more until the entire list of disks is used up, and the next RR starts over at the beginning.

    Each file picks a hashed value based on the inode number as the starting point in the RR order for its first block, and then cycles through that order.

    You can use mmchnsd to reassign FGs to each disk, but you cannot reorder the disks in each FG. Then run mmrestripefs $fsname -b to rebalance every block of every file to the new RR order. This is IO intensive, so run it at a low usage time.

     

  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-19T16:51:20Z  
    • dlmcnabb
    • ‏2014-05-19T16:49:27Z

    The round-robin order is defined by the failure groups (FGs) and the order the disks are added to each FG which you can see in mmlsdisk $fsname -i. The actual FG value is not used in the following example, but rather the order in which they are found.

    The RR order for data blocks of a file is (this does not apply to metadata blocks):

    FG1 disk1, FG2 disk1, FG3 disk1, ...

    FG1 disk2, FG2 disk2, FG3 disk2, ...

    ...

    As FGs run out of disks, they are not used any more until the entire list of disks is used up, and the next RR starts over at the beginning.

    Each file picks a hashed value based on the inode number as the starting point in the RR order for its first block, and then cycles through that order.

    You can use mmchnsd to reassign FGs to each disk, but you cannot reorder the disks in each FG. Then run mmrestripefs $fsname -b to rebalance every block of every file to the new RR order. This is IO intensive, so run it at a low usage time.

     

    Oops in last paragraph:

    "You can use mmchdisk ...", not mmchnsd.

  • db808
    db808
    87 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-19T20:37:26Z  
    • dlmcnabb
    • ‏2014-05-19T16:51:20Z

    Oops in last paragraph:

    "You can use mmchdisk ...", not mmchnsd.

    Hello all...

    I would like to chime in here, especially given all the relatively "negative" responses to this question.

    Ok ... we have established that it will not be "easy" or trivial to do.

    So then the next question would be how might it be possible to "churn" the file system using a combination of mmdeldisk mmadddisk - like commands at the disk/LUN level and/or the NSD-level .. and how can the amount of IO that would be done be bounded or estimated?

    Given that it will likely take a substantial amount of IO, what then might be the benefit for the work done?  If there are already bottlenecks preventing full exploitation of the back-end parallelism, correcting the NSD ordering may not result in a substantial improvement.  So I suggest that we need to help quantify how well or not-well the cluster is performing IO and if other items in the IO stack might need to be addressed first.

    At our installation, we are using a non-HPC-like GPFS topology, with "fat" NSD servers that are SAN-attached to all LUNs.  We also run a significant workload on the NSD servers themselves, with the non-directly-attached GPFS clients performing a relative minority of the IO.  I understand that this is NOT the typical GPFS topology, and our scalability pain-points are different than the typical HPC user, with a subset of LUNs co-owned by a pair of NSD servers, with 1/2 primary on one NSD server and 1/2 using the second NSD server as the primary.

    However, I would suggest that re-ordering can be done in a painful manner, and it is up to the user to decide if it is worth the effort.

    So .. there are a few sub-topics here. 

    1. Where do you think the bottleneck is and why?
    2. The NSD-server ordering as visible to the GPFS client within its subnet.
    3. The NSD ordering when the file system was created, along with the size differences across the LUNs, if any.
    4. The storage processor ordering within a storage array. (sometimes changing the storage processor "owner" for a LUN can work wonders)
    5. The host port and storage port balance if multiple active paths are used.

    Some of these items are relatively easy to change, perhaps with a filesystem shutdown, change, and restart.

    Also significant is the ratios of the bandwidths of sub-components.  If you are using a 8 MB GPFS block size and (8+2) RAID6 on nearline SAS disks with scatter mode allocation, you can achieve ~ 610 MB/sec per LUN.  Your concern about balancing across NSD servers is much different if the NSD server is 10GbE connected (for 1250 MB.sec throughput) than if you are 56kbit IB connected, with 4,000 Mb/sec throughput.  At 610 MB/sec per LUN ... a significant issue could be the multi-path IO balance and ordering  from the NSD server to the storage array ... which CAN be transparently changed behind GPFS's knowledge.

    If all the other bottlenecks were removed, and sources for micro-clumping and micro-starvation addressed, at some point the NSD ordering within the GPFS file system can become significant.

    If this were true, how can the ordering be improved to allow better parallel operations?

    I understand that GPFS uses a construct called NSD "slots" to describe the ordering of NSDs.  So we also need to understand how a NSD is assigned to a "slot", and if a disk or NSD is deleted/destroyed/replaced what happens to the "slot".  What determines when/if a "slot" is reused?

    For example ...  Assume that the file system is NOT full, and there is enough free space to delete a few NSDs' worth of space.  If this is not true now, it would be true in the future when the storage space was expanded, and additional LUNs were about to be introduced.

    Let's say I took the last NSD, and did a mmdeldisk on the disk, and an mmdelnsd on the NSD.  This will take some time to relocate all the data.  It also would require that some metadata to be updated, and some of this metadata may need to be journalled.

    Assuming we have the output of the "mmdf" command which shows the amount of space in use on the disk, can we quantify the amount of IO needed to evict all the data, the amount of IO needed to write the evicted data elsewhere (and its replica), the amount of IO needed to update the metadata, and whether such metadata updates are easily cached.  Also, what if any metadata journalling is needed.  Don't forget the extra writes if the metadata is replicated.

    Now we have a free disk.  Let us use mmcrnsd to create a new NSD, named "Z" in this example.

    I posted a question similar to this in the past, also answered by Yuri:

    https://www.ibm.com/developerworks/community/forums/html/topic?id=f655d80d-30f4-465e-8692-7dbaed08a55a&ps=25

    If you had an original sequence of NSDs "A B C", and then did a mmrpldisk of "B", with "Z", the resulting order would be "A Z C" with disk B now being free.

    You have just changed the effective ordering of the NSDs.

    Lather, rinse, repeat.  You need to determine the sequence of "replace disks" that you need to perform to end up with the final ordering.  If you are adding storage, the new disks are already free, so no initial eviction of disk data is needed.  When you finally get done, you need to do a final mmrestripefs to balance the space usage.

    The big question is the order of magnitude of the needed intermediate metadata updates, metadata journalling, and metadata replication in addition to the data movement, and what storage performance levels are available for the metadata updates and metadata journalling.  If your metadata is on SSDs, this might not be a big deal ... other than moving the data.  If the metadata is on mechanical disks, the there could be several metadata updated per data block or fragment moved.  Even still .... it might be worth the effort.

    We attempted using mmrpldisk to change the NSD ordering on one GPFS file system, but unfortunately had a very performance-challenged metadata implementation.  Because of this, it became impractical to perform the re-ordering that we desired when we were adding 60% more storage.

    We still think that this is practical, once you can bound the amount of IO activity needed for all the metadata work.

    In your case, your file system sound like it is fairly young.  If it is not that full yet, and it is on contemporary storage with write caching, it should be doable.

    Dave B.

     

  • heb
    heb
    8 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-20T09:29:02Z  

    Hello all,

    thank you for all the replies. So I understand that there are two levels of ordering that determine the round-robin disk striping for data in a storage pool

    1. sequence of failure group
    2. sequence of NSDs in a failure group (sequence of disk slots occupied by the NSDs)

    Changing the failure group is rather easy (mmchdisk) and can be done on a mounted file system. Of course it will not move any data of existing files but will affect all files written in the future.

    Changing the sequence of NSDs in a failure group is cumbersome and definitely needs to move data off the NSDs and back as you need to delete/recreate NSDs.

    Now in my case I don't care much about existing data, and I care most about writes to newly created files - the GPFS file system is a cache, files get deleted after a few weeks' time. Write performance is most critical. I will add more storage, so I do have some space to move data around. Currently I have two failure groups, each with 12NSDs. My plan to get a nice round-robin ordering is

    1. add a third failure group with 12 new additional NSDs ordered nicely from the beginning
    2. suspend all NSD in one existing failure group
    3. wait some time until the suspended NSDs gets almost empty - we do regularly delete data, so this should happen
    4. run mmrestripefs -m or mmrestripefs -r to empty the suspended NSDs completely
    5. delete all suspended NSDs and add them again to the failure group, now in a nice order
    6. repeat the steps above for the  second failure group

    This will take some time and I won't see the performance increase of the additional NSDs  until I've finished , but it doesn't require a lot of data moves and the file system can stay mounted all times.

    I do see the point in Dave's reply that I probably may not see any measurable benefit as other devices/systems/configurations are the bottleneck. The filesystem is rather new and we want the best possible performance,  I don't like to have some non-optimal configuration right from the beginning.  So I'll change now as long as I still can. The NSD servers have double FDR Infiniband links and each server handles 12 NSDs as primary server.

    kind regards,

    Heiner B.

  • ufa
    ufa
    145 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-20T20:24:08Z  
    • heb
    • ‏2014-05-20T09:29:02Z

    Hello all,

    thank you for all the replies. So I understand that there are two levels of ordering that determine the round-robin disk striping for data in a storage pool

    1. sequence of failure group
    2. sequence of NSDs in a failure group (sequence of disk slots occupied by the NSDs)

    Changing the failure group is rather easy (mmchdisk) and can be done on a mounted file system. Of course it will not move any data of existing files but will affect all files written in the future.

    Changing the sequence of NSDs in a failure group is cumbersome and definitely needs to move data off the NSDs and back as you need to delete/recreate NSDs.

    Now in my case I don't care much about existing data, and I care most about writes to newly created files - the GPFS file system is a cache, files get deleted after a few weeks' time. Write performance is most critical. I will add more storage, so I do have some space to move data around. Currently I have two failure groups, each with 12NSDs. My plan to get a nice round-robin ordering is

    1. add a third failure group with 12 new additional NSDs ordered nicely from the beginning
    2. suspend all NSD in one existing failure group
    3. wait some time until the suspended NSDs gets almost empty - we do regularly delete data, so this should happen
    4. run mmrestripefs -m or mmrestripefs -r to empty the suspended NSDs completely
    5. delete all suspended NSDs and add them again to the failure group, now in a nice order
    6. repeat the steps above for the  second failure group

    This will take some time and I won't see the performance increase of the additional NSDs  until I've finished , but it doesn't require a lot of data moves and the file system can stay mounted all times.

    I do see the point in Dave's reply that I probably may not see any measurable benefit as other devices/systems/configurations are the bottleneck. The filesystem is rather new and we want the best possible performance,  I don't like to have some non-optimal configuration right from the beginning.  So I'll change now as long as I still can. The NSD servers have double FDR Infiniband links and each server handles 12 NSDs as primary server.

    kind regards,

    Heiner B.

    One more remark: Whether you see an effect or not (regardless of other bottlenecks) will also depend on thy workload. Due to the inode-to-disk hashing, a massive parallel write pattern (many concurrent writes at any or most times) would distribute your load fairly well to all NSDs and Storage Units already now, and I'd not expect any improvement then. However, if you have single (non-concurrent)  writes of large files which need to be fast, the reordering of NSDs could help.

    ufa

  • db808
    db808
    87 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-21T20:06:25Z  
    • ufa
    • ‏2014-05-20T20:24:08Z

    One more remark: Whether you see an effect or not (regardless of other bottlenecks) will also depend on thy workload. Due to the inode-to-disk hashing, a massive parallel write pattern (many concurrent writes at any or most times) would distribute your load fairly well to all NSDs and Storage Units already now, and I'd not expect any improvement then. However, if you have single (non-concurrent)  writes of large files which need to be fast, the reordering of NSDs could help.

    ufa

    Ufa brings up a good point.

    If your workload, (access patterns, file sizes, number of concurrent jobs, rate at which data can be "consumed" or "generated"), when combined with intelligent, stride-aware, read-ahead and write-behind algorithms which amplify application-level parallelism, is capable of sustaining a large enough number of parallel IOs to keep all the disks busy ...then NSD ordering is noise.  Typically the number of concurrent IOs needed to achieve this goal is 1.3 times the number of IO devices on the low end, perhaps up to 3x on the high end. I used the term "IO devices" to incorporate the idea that the virtual LUN itself may be made up of multiple sub-components.  A 10-disk R6 LUN, may only need 1.3 IOs to stay busy in full-stripe mode, but could easily need 10+ IOs if running in individual disk mode.  The new GPFS 3.5 io scheduler with small and large queues allows you to handle this.

    Whatever the number of concurrent IOs ... the big question is can this activity level be sustained.  If yes ... NSD ordering will have little impact. This is often true in HPC use cases.

    WARNING, WARNING "Your Performance will vary" ...

    With that big caveat, I wanted to illustrate how NSD ordering can make a difference.  It may not be significant for many configurations, but for our typical GPFS workloads, and the non-traditional manner we configure the GPFS cluster, it made a measurable difference.

    In our case, we are using GPFS as a file server for media files, running simple batch jobs that perform input-process-output workflows.  Individual workflows are not well threaded, and are often single threaded.  These serial-oriented workflows greatly benefit from GPFS read-ahead and write behind.

    But what happens of your file is only a few GPFS blocksizes in length?  The GPFS read-ahead would be limited, because the file is medium size, and the impact of a momentary micro-clump or micro-starvation would be more pronounced.

    I will offer the GPFS forum readers the actual measured performance profiles running sequential workloads ... on one of our GPFS cluster configurations.  The GPFS file system is striped across 6 x DCS3700 (base controllers), each with 120 x 2TB disks.  The disks are configured with (8+2) RAID6 with 512kb stripe segment size, resulting in a 4 MB hardware stripe.  The GPFS blocksize is also 4MB. 

    The GPFS NSD servers are SAN connected to ALL LUNs, using 4 FC host controllers per server.  This results in 4 active and 4 passive paths per LUN using Linux dm-multipath.  We also performed techniques to optimize large IO efficiency for quad controllers on the host, beyond the scope of this email. Many customers may not be able to scale 98+% across 4 FC controllers on a node.

    What is significant in this example is the relationship of the IO performance of a single LUN, and the throughput of the storage array processor.  We rank storage processors as "weak", where they can handle only a few LUN's worth of IO, or "strong" where they can landle a larger number.  In this example, the base DCS3700 storage processor is good for 800-900 MB/sec each, and the LUN performance @ 4MB random IO is ~ 260 MB/sec, so the SP can handle about 2.5 IOs before "clumping".  Since the SP is responsible for managing 6 RAID groups in this configuration, there is potential for clumping.  If I had one SP per 3 RAID groups @ 4MB, the issue would be mute.  If you using 8 MB GPFS block sizes, the per LUN performance for random 8MB IO would increase to ~ 610 MB/sec, and clumping would be more prevalent.

    The "LUN Order Optimization Chart.pdf" is attached.

    "Your Performance will vary" ...

        ...but what is important is the shape of the curves.  In this example, there is no difference after 18 parallel IOs kick in ... based on this configuration ... and the performance ratios of the sub-components.  Also note, that in our case, there is no difference in the single node case ... because there are 4 FC controllers, optimized to scale at 98+% ..which is not typical.  For more typical configurations the single node base with unordered NSDs would be lower.

    In this example, if you have 18 IO threads, it makes no difference, and GPFS can easily generate that much read ahead/write behind.  So who cares?  In this case, 18 threads requires a file size of 18 x 4 = 72 MB or larger ... if you are running a sequential job.

    What if the file smaller than 72 MB in this example?  What about a file that is 6 blocks long?  Will the default ordering avoid sending logically adjacent IOs to the same storage processor?  Somewhat ... but the probably of micro-clumping and micro starvation is greater.

    There are many ways to generate the list of LUNs (and NSDs) to be presented to GPFS when making the file system.  On a given NSD server, the resulting lists are often related to the order that the LUNs were discovered by Linux upon startup.  This Linux discovery process results in a "depth-first-mostly" ordering.  The discovery process is multi-threaded, with the individual threads using depth first. 

    Depth-first results in all the LUNs on the first storage processor being discovered together (causing a clump), and then all the LUNs on the next storage array's storage processor, and so forth.

    The second attachment "Example LUN Ordering Diagram.pdf" shows what we actually discovered on our 6 array, 72 LUN configuration. The default ordering we might have used was the listing generated from "ls /dev/dm-*".  It is not as bad as the theoretical worst case, but it is not that good.

    In the attached ordering diagram on page 2, the left-most column is the worst case.  The middle column is the actual ordering based on "ls /dev/dm-*", and the rightmost column is the crafted ordering that we used.  We have an automated script that can generate the topology-ware crafted ordering. 

    "Your Performance will vary" ...

     

     

    Updated on 2014-05-21T20:08:30Z at 2014-05-21T20:08:30Z by db808
  • ufa
    ufa
    145 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-22T08:13:01Z  
    • db808
    • ‏2014-05-21T20:06:25Z

    Ufa brings up a good point.

    If your workload, (access patterns, file sizes, number of concurrent jobs, rate at which data can be "consumed" or "generated"), when combined with intelligent, stride-aware, read-ahead and write-behind algorithms which amplify application-level parallelism, is capable of sustaining a large enough number of parallel IOs to keep all the disks busy ...then NSD ordering is noise.  Typically the number of concurrent IOs needed to achieve this goal is 1.3 times the number of IO devices on the low end, perhaps up to 3x on the high end. I used the term "IO devices" to incorporate the idea that the virtual LUN itself may be made up of multiple sub-components.  A 10-disk R6 LUN, may only need 1.3 IOs to stay busy in full-stripe mode, but could easily need 10+ IOs if running in individual disk mode.  The new GPFS 3.5 io scheduler with small and large queues allows you to handle this.

    Whatever the number of concurrent IOs ... the big question is can this activity level be sustained.  If yes ... NSD ordering will have little impact. This is often true in HPC use cases.

    WARNING, WARNING "Your Performance will vary" ...

    With that big caveat, I wanted to illustrate how NSD ordering can make a difference.  It may not be significant for many configurations, but for our typical GPFS workloads, and the non-traditional manner we configure the GPFS cluster, it made a measurable difference.

    In our case, we are using GPFS as a file server for media files, running simple batch jobs that perform input-process-output workflows.  Individual workflows are not well threaded, and are often single threaded.  These serial-oriented workflows greatly benefit from GPFS read-ahead and write behind.

    But what happens of your file is only a few GPFS blocksizes in length?  The GPFS read-ahead would be limited, because the file is medium size, and the impact of a momentary micro-clump or micro-starvation would be more pronounced.

    I will offer the GPFS forum readers the actual measured performance profiles running sequential workloads ... on one of our GPFS cluster configurations.  The GPFS file system is striped across 6 x DCS3700 (base controllers), each with 120 x 2TB disks.  The disks are configured with (8+2) RAID6 with 512kb stripe segment size, resulting in a 4 MB hardware stripe.  The GPFS blocksize is also 4MB. 

    The GPFS NSD servers are SAN connected to ALL LUNs, using 4 FC host controllers per server.  This results in 4 active and 4 passive paths per LUN using Linux dm-multipath.  We also performed techniques to optimize large IO efficiency for quad controllers on the host, beyond the scope of this email. Many customers may not be able to scale 98+% across 4 FC controllers on a node.

    What is significant in this example is the relationship of the IO performance of a single LUN, and the throughput of the storage array processor.  We rank storage processors as "weak", where they can handle only a few LUN's worth of IO, or "strong" where they can landle a larger number.  In this example, the base DCS3700 storage processor is good for 800-900 MB/sec each, and the LUN performance @ 4MB random IO is ~ 260 MB/sec, so the SP can handle about 2.5 IOs before "clumping".  Since the SP is responsible for managing 6 RAID groups in this configuration, there is potential for clumping.  If I had one SP per 3 RAID groups @ 4MB, the issue would be mute.  If you using 8 MB GPFS block sizes, the per LUN performance for random 8MB IO would increase to ~ 610 MB/sec, and clumping would be more prevalent.

    The "LUN Order Optimization Chart.pdf" is attached.

    "Your Performance will vary" ...

        ...but what is important is the shape of the curves.  In this example, there is no difference after 18 parallel IOs kick in ... based on this configuration ... and the performance ratios of the sub-components.  Also note, that in our case, there is no difference in the single node case ... because there are 4 FC controllers, optimized to scale at 98+% ..which is not typical.  For more typical configurations the single node base with unordered NSDs would be lower.

    In this example, if you have 18 IO threads, it makes no difference, and GPFS can easily generate that much read ahead/write behind.  So who cares?  In this case, 18 threads requires a file size of 18 x 4 = 72 MB or larger ... if you are running a sequential job.

    What if the file smaller than 72 MB in this example?  What about a file that is 6 blocks long?  Will the default ordering avoid sending logically adjacent IOs to the same storage processor?  Somewhat ... but the probably of micro-clumping and micro starvation is greater.

    There are many ways to generate the list of LUNs (and NSDs) to be presented to GPFS when making the file system.  On a given NSD server, the resulting lists are often related to the order that the LUNs were discovered by Linux upon startup.  This Linux discovery process results in a "depth-first-mostly" ordering.  The discovery process is multi-threaded, with the individual threads using depth first. 

    Depth-first results in all the LUNs on the first storage processor being discovered together (causing a clump), and then all the LUNs on the next storage array's storage processor, and so forth.

    The second attachment "Example LUN Ordering Diagram.pdf" shows what we actually discovered on our 6 array, 72 LUN configuration. The default ordering we might have used was the listing generated from "ls /dev/dm-*".  It is not as bad as the theoretical worst case, but it is not that good.

    In the attached ordering diagram on page 2, the left-most column is the worst case.  The middle column is the actual ordering based on "ls /dev/dm-*", and the rightmost column is the crafted ordering that we used.  We have an automated script that can generate the topology-ware crafted ordering. 

    "Your Performance will vary" ...

     

     

    one remark though: the poor "clumping" NSD order in the example you attached could be mitigated by assigning the NSDs of each SP  to another FG as suggested by dlmcnabb before. However, as that is not the original aim of FGs, this might not be possible for other reasons.

    ufa

     

  • db808
    db808
    87 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-22T14:37:18Z  
    • ufa
    • ‏2014-05-22T08:13:01Z

    one remark though: the poor "clumping" NSD order in the example you attached could be mitigated by assigning the NSDs of each SP  to another FG as suggested by dlmcnabb before. However, as that is not the original aim of FGs, this might not be possible for other reasons.

    ufa

     

    Hi Uwe,

    Thank you for the additional comment. 

    We explored using failure groups for disk-order-management, but we need to use failure groups for high-availability purposes also.  We mirror all metadata, and in some high-value GPFS clusters we also mirror the data, either locally, or to a second data center about 1/2 mile away across company-owned dark fiber.  It seemed too complex to manage and we needed dozens of failure groups on our largest configurations.  Of course, .. not that our NSD-ordering technique is easy ... but we had already built the automation supporting the NSD-ordering for another purpose.

    We would value the capability to use the topology vector (xx:yy:zz) form of the failure domain with non-FPO configurations.  It would help on some of the "ordering" issues, but also allow us to better balance replica IO.

    Does anyone know if it is possible?  We already have the tools/scripts/techniques and automation to generate the topology vector, and we have a 100% mirrored configuration being installed in the next month that we test with.  We would also be very interested in using the storage pool "group factor" in a non-FPO environment (another discussion).

    We have been using our own "topology vector" concept for over four years.  We built the tools to create a cluster-wide persistent LUN naming system that was topology-aware to help manage, configure, and troubleshoot individual paths of a multipath device.  In doing so, we discovered the micro-clumping and micro-starvation issues between active paths to a single LUN.  The artifact of the topology-aware crafted ordering of paths within a multipath group micro-balanced not only the ports on the NSD servers, but improved the IO balanced on the storage array's host ports.  We found the same concept also applied at the LUN-level, when creating a list of LUNs that was used for "striping", be it GPFS, or other stripe-capable file system or LVM.  The LUN-ordering  impact was not as significant as the paths-order-within-multipath, but it was measurable and was effectively "free" because we already had the automation and tools.  For us, generating an optimized NSD ordering is a simple right-to-left sort of our internal topology vector generated by our persistent LUN naming tool. 

    BTW:  We have offered to share our persistent LUN naming tool., the topology-aware multipath-ordering and NSD-ordering with IBM/Storage and IBM/GPFS with little interest.  We have installed the scripts and demonstrated their use during a visit to the IBM Gaithersburg labs several years ago.  Using the tools, we created a multi-PB GPFS file system with both optimized multipath and NSD ordering, across multiple storage arrays in under two hours, not including disk format times. 

     

  • yuri
    yuri
    277 Posts

    Re: change order of round-robin nsd striped writes

    ‏2014-05-22T19:00:14Z  
    • db808
    • ‏2014-05-22T14:37:18Z

    Hi Uwe,

    Thank you for the additional comment. 

    We explored using failure groups for disk-order-management, but we need to use failure groups for high-availability purposes also.  We mirror all metadata, and in some high-value GPFS clusters we also mirror the data, either locally, or to a second data center about 1/2 mile away across company-owned dark fiber.  It seemed too complex to manage and we needed dozens of failure groups on our largest configurations.  Of course, .. not that our NSD-ordering technique is easy ... but we had already built the automation supporting the NSD-ordering for another purpose.

    We would value the capability to use the topology vector (xx:yy:zz) form of the failure domain with non-FPO configurations.  It would help on some of the "ordering" issues, but also allow us to better balance replica IO.

    Does anyone know if it is possible?  We already have the tools/scripts/techniques and automation to generate the topology vector, and we have a 100% mirrored configuration being installed in the next month that we test with.  We would also be very interested in using the storage pool "group factor" in a non-FPO environment (another discussion).

    We have been using our own "topology vector" concept for over four years.  We built the tools to create a cluster-wide persistent LUN naming system that was topology-aware to help manage, configure, and troubleshoot individual paths of a multipath device.  In doing so, we discovered the micro-clumping and micro-starvation issues between active paths to a single LUN.  The artifact of the topology-aware crafted ordering of paths within a multipath group micro-balanced not only the ports on the NSD servers, but improved the IO balanced on the storage array's host ports.  We found the same concept also applied at the LUN-level, when creating a list of LUNs that was used for "striping", be it GPFS, or other stripe-capable file system or LVM.  The LUN-ordering  impact was not as significant as the paths-order-within-multipath, but it was measurable and was effectively "free" because we already had the automation and tools.  For us, generating an optimized NSD ordering is a simple right-to-left sort of our internal topology vector generated by our persistent LUN naming tool. 

    BTW:  We have offered to share our persistent LUN naming tool., the topology-aware multipath-ordering and NSD-ordering with IBM/Storage and IBM/GPFS with little interest.  We have installed the scripts and demonstrated their use during a visit to the IBM Gaithersburg labs several years ago.  Using the tools, we created a multi-PB GPFS file system with both optimized multipath and NSD ordering, across multiple storage arrays in under two hours, not including disk format times. 

     

    We would value the capability to use the topology vector (xx:yy:zz) form of the failure domain with non-FPO configurations. 

    No, this form of FG specification isn't really applicable outside of FPO.  The entire concept is based around the concept of placing one replica on disks local to the node, one replica on a disk inside the same rack, and one replica elsewhere.  This requires using FPO-style allocation maps.

    We would also be very interested in using the storage pool "group factor" in a non-FPO environment (another discussion).

    Again, this is not easy.  In principle, the BGF concept could be viewed as applicable outside of FPO.  In reality, it's tied to the use of FPO-style allocation maps, and a fundamental assumption that each disk is seen by exactly one node.  The current code won't work with twin-tailed disks or SANs.

    yuri