Topic
2 replies Latest Post - ‏2013-07-09T16:59:05Z by db808
botemout
botemout
68 Posts
ACCEPTED ANSWER

Pinned topic Metadata performance with Raid 10 NDSs

‏2013-05-01T19:24:31Z |

Greetings all,

I'm planning to move our metadata disks from a collection of JBODs to raid 10 sets.  Presently the filesystem looks like this:

12 metadata drives (each 600G 15K SAS) for 12 dataOnly LUNs (about 250TB total).  The 12 drives live on 4 physical raid chassis.

I've always been very underwhelmed with our metadata performance and I'm wondering whether instead of 12 JBODs, I make 4 raid10 arrays. I'd lose some space but I'm only using about 13% of the total metadata disk so have plenty to spare.

Of course, I also will like the fact that our filesystem couldn't be damaged by the simultaneous fail of only 2 drives, one from each of our two failure group.

As for performance, can anyone address what I might see when moving to 4 R10 NDS vs. 12 JOB NDS?

Thanks much,

JR

  • dichung
    dichung
    9 Posts
    ACCEPTED ANSWER

    Re: Metadata performance with Raid 10 NDSs

    ‏2013-05-02T14:45:38Z  in response to botemout

    R10 is a combination of two RAID-5 volumes in a mirrored pair - just to be sure we're on the same topic. There will clearly be more "overhead" writing a mirrored I/O to two RAID-5 volumes both which need to perform their own RAID-5 calculations and depending on the other choices in the disk subsystem - may have to write a partial block into an existing data stripe (perhaps suffering a read-modify-write penalty) -- this compared to writing a unmirrored I./O to a dedicated JBOD -- RAID 10 will be more "expensive". Beware of choosing options in the disk controller that allow the I/O to complete before both mirrors are written - while this has the opportunity to increase performance, if one controller happens to fail before the other mirror is updated this will leave your file system inconsistent. Having stated this I do think this is worthwhile pursuing. As is mentioned in the question, data availability (or more to the point, loss there of) will likely be a more critical factor to this file system then absolute performance so I'd encourage this.

    Another observation here is your "under whelmed" view of meta data performance. You'll be moving from 12 NSD's for metadata to 4 - this won't allow the same degree of parallelism but is probably still a worth pursing when thinking of availability.

    You might want to compare relative performances of random I/O patterns between these two device types and get some real world performance numbers for your setup before committing yourself.

  • db808
    db808
    61 Posts
    ACCEPTED ANSWER

    Re: Metadata performance with Raid 10 NDSs

    ‏2013-07-09T16:59:05Z  in response to botemout

    Hi John,

    If you are interested in high metadata performance under GPFS, you might want to consider simple RAID0 hardware mirrors, vs the larger RAID 10 topology.  Under GPFS, there is an implied assumption that the performance characteristics of ALL the LUNs are relatively the same.  In that, it expects to use the same number of IO threads for all LUNs, and does not distinguish much between the LUNs used for data and those used for metadata.  Homogenous-ness was the overall scheme.

    As GPFS evolved over the years, additional features were added that allowed different and/or specialized handling of IO for data and metadata .... but in general, all data LUNs are assumed equal, and all metadata LUNs are assumed equal.  With GPFS 3.5.x, you can have independent blocksizes for data and metadata, for example.

    Also underlying GPFS is the concept of a maximum number of IOs per LUN, and now many of these IOs can be used for "background" operations.  There is a new GPFS parameter (ignorePrefetchLUNCount) to disable this per-LUN IO limitation.

    So in general ....it can be difficult to fully exploit a multi-disk LUN (or SSD) without the risk of over-driving a LUN with lesser resources.  For example, a 10-disk RAID6 LUN could do a full-stripe read, or as many as 10 individual small IO reads.  The GPFS default of 4 IOs, under-utilizes the LUN when operating in small-IO mode, but could stress the LUN when operating in large IO mode.  This effect is amplified when you consider the LUN could be getting IO from multiple GPFS NSD servers concurrently.

    Many storage systems allow very wide RAID10 RAID groups with 10-30 disks representing a stripe of 5 to 15 mirrored pairs.  While a 10-disk R10 group is easier to manage than 5 mirrored R1 pairs, you have to be certain you can enqueue enough IO to the single larger R10 LUN to keep the disks busy.  With GPFSs excellent striping capability, you can gain some flexibility by using multiple R1 mirrors and let GPFS stripe across them.

    The other thing to remember, is that even though the new GPFS parameter ignorePrefetchLUNCount may stop GPFS throttling of prefetchs, there may be other such restrictions lower in the Linux IO stack and/or storage array which may prevent the level of concurrent IO that you need to fully use the multi-disk raid group.

    Let me use a simple example.  You have a 4 disk RAID10 SSD array, implemented as a 2-way stripe of mirrors, on a storage system with dual active/passive controllers, each with an 8gbit FC interface.

    With a single LUN, the LUN will be "owned" by a single storage controller, and the maximum bandwidth would be limited to the single 8gbit FC connection.  If the SSD was a 6 gbit SAS SSD, with read performance of ~ 550 MB/sec, a single 2-way full-stripe read could run at 1100 MB/sec from the SSDs, but only 800 MB/sec to the host.

    If you broke the 4 disks R10 LUN into two separate 2-disk R1 mirrors, what would happen?  You could assign "ownership" of each LUN to a different storage controller, and use a different FC bus to the host server.  Then you could issue 2 host IOs, with each IO running at 550 MB/sec for an aggregate of 1100 MB/sec.  The resulting IO activity is better balanced across the storage processors and FC channels.

    This is a contrived example, but illustrates the point.

    I also like to have many individual GPFS metadata LUNs rather than a smaller number of large metadata LUNs ... to a degree.  GPFS does a very good job distributing the IO across multiple metadata LUNs, but hot spots can and do exist ... often the metadata journal file. (there is a new GPFS 3.5 feature to stripe the metadata journal file).  If you should find a persistent metadata hot spot on a specific metadata LUN, you have the flexibility to on-line move the hot spot to faster storage (such as an SSD) using the GPFS mmrpldisk command.  I have a multi-PB GPFS system with 120 x 60 GB metadata LUNs on mechanical disks.  This is over 7 TB of metadata total space, but there are only 8 metadata hot spots that account for over 50% of the total metadata activity.  These metadata LUNs contain the replicated metadata journal file for 4 GPFS servers.  The total size of these hotspots is only 8 x 60 GB = 480 GB, and could easily be moved to SSD storage at a much lower cost than the total 7.2 TB metadata space.

    I would suggest using simple 2-disk hardware R1 mirrors and having multiple metadata LUNs.  I would enable durable write caching on the storage array, and DISABLE read-ahead on the storage array and Linux block IO layer for the LUNs.  GPFS implements a much more intelligent read ahead. 

    If you are on GPFS 3.5+, I would also investigate the other metadata advancements, such as independent metadata block size, metadata logfile striping, and larger inodes (especially if you have a lot of small directories).  These features can either reduce the amount of metadata activity or better balance the IO activity.

    Also ... beware of potential storage array scaling issues if you use abnormally large RAID groups.  Many disk arrays can easily handle distributing host IO to a 2 to 10 disk RAID group without hitting internal IO queuing limits.  RAID groups of this 2-to-10 disk size are "normal", and the code paths are mature and often well optimized.  These same disk arrays could have difficulties handling a 30-to-40 disk RAID group due to internal queuing limitations.  In these cases using multiple smaller RAID groups yields better performance, with some additional complexity.

    Hope this helps.

    Dave B