Topic
  • 2 replies
  • Latest Post - ‏2018-06-21T10:34:49Z by Igor Minkovskiy
J. Estrada
J. Estrada
1 Post

Pinned topic DRAID Layout & Resiliency

‏2016-01-30T22:54:21Z | draid storwize

Can anyone comment on the logical layout of the data stripes within a DRAID array? I have seen the illustration documented online (http://www-01.ibm.com/support/knowledgecenter/ST3FR7_7.6.0/com.ibm.storwize.v7000.760.doc/svc_distributedRAID.html?lang=en). It's clear how the rebuild areas are distributed; however, it's not exactly clear how the data stripes are laid-out throughout the remainder of the drive packs.

 

Given the following command:

 

mkdistributedarray -level raid6 -driveclass 1 -drivecount 12 -stripewidth 5 -rebuildareas 2 Mdiskgrp1

 

Would result in two logical arrays sets, A and B, with each drive pack consisting of 5x10 striped strips plus 5x2 rebuild strips (60 strips per pack),

 

For simplicity, this example assumes the stripes are aligned and there is no internal wrapping of the stripes within the drive packs. Each column is a drive in the array, each row is a stripe, and the hashes represent the rebuild strips. The first drive pack would internally would look something like:

 

1 2 3 4 5 6 7 8 9 10 11 12
A1 A2 A3 Ap Aq B1 B2 B3 Bp Bq # #
A1 A2 Ap Aq A3 B1 B2 Bp Bq B3 # #
A1 Ap Aq A2 A3 B1 Bp Bq B2 B3 # #
Ap Aq A1 A2 A3 Bp Bq B1 B2 B3 # #
Aq A1 A2 A3 Ap Bq B1 B2 B3 Bp # #

 

Now, with each row representing a drive pack, which would be the resulting layout of the DRAID?

 

A) Array strips are "block-striped" across a subset of drives:

 

1 2 3 4 5 6 7 8 9 10 11 12
A A A A A B B B B B # #
A A A A A B B B B # # B
A A A A A B B B # # B B
A A A A A B B # # B B B
A A A A A B # # B B B B
A A A A A # # B B B B B
A A A A # # A B B B B B
A A A # # A A B B B B B
A A # # A A A B B B B B
A # # A A A A B B B B B
# # A A A A A B B B B B
# A A A A A B B B B B #

 

or, B) Array strips are "circular-striped" across all available drives:

 

1 2 3 4 5 6 7 8 9 10 11 12
A A A A A B B B B B # #
A A A A B B B B B # # A
A A A B B B B B # # A A
A A B B B B B # # A A A
A B B B B B # # A A A A
B B B B B # # A A A A A
B B B B # # A A A A A B
B B B # # A A A A A B B
B B # # A A A A A B B B
B # # A A A A A B B B B
# # A A A A A B B B B B
# A A A A A B B B B B #

 

Or is it something different altogether?

 

The former would be more akin to having two TRAID-6 mdisks wide-striped within a storage pool; the array could, at most, survive four simultaneous drive failures before data is rebuilt. The latter would have the stripes distributed in the same manner as the rebuild areas making the entire DRAID act a as a single large-scale RAID-6 as each disk will contain stripes from all array sets; the array could, at most, survive two simultaneous drive failures before data is rebuilt. Both contain the same amount of parity but have different concurrent drive-failure resiliency. The difference could be an important factor in deciding whether to implement TRAID or DRAID besides rebuild times alone.

 

  • IanBoden
    IanBoden
    4 Posts
    ACCEPTED ANSWER

    Re: DRAID Layout & Resiliency

    ‏2016-02-01T16:54:29Z  

    Hi,

     

    The answer is that it's something different altogether, the examples you use I've seen described as stripe grouping but I'm not sure if it has a technical term. The problem is that it doesn't give you the rebuild times required. If we take a very simple example of circular striping across all available drives where we have a stripe width of 3 and 20 drives, then assume drive 9 fails and look at which drives have to be read from to perform the rebuild, we have:

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
    A A A B B B C C C D D D E E E F F F # #
    A A B B B C C C D D D E E E F F F # # A
    A B B B C C C D D D E E E F F F # # A A
    B B B C C C D D D E E E F F F # # A A A
    B B C C C D D D E E E F F F # # A A A B
    B C C C D D D E E E F F F # # A A A B B
    C C C D D D E E E F F F # # A A A B B B

     

    You can see both neighboring drives are read from most and then the next ones out are read slightly less. So you have got the benefit of distributed sparing meaning you are writing to more drives, but the reads are very badly distributed. It also limits your flexibility, as the number of drives used in the set minus the rebuild areas must be a whole number of stripes.

     

    So the layouts we use are a little more complicated, they try to ensure as even a distribution as possible for both reads and writes in the case of a failure. The exact pattern changes completely based on the stripe width and the number of drives. What we do is slice the array vertically into packs, each pack consists of the stripe width's worth of strips on each drive, within a pack the rebuild areas don't rotate so assuming you have 2 rebuild areas then each pack has 2 drives that are used solely for spare data. The strips are then spread with the following conditions needing to be true:

    • Each strip of a stride must be on a different drive.
    • Moving down a drive each strip must belong to a later stride (ie if you have strips for stride 2, 5 and 9 on a drive then they must appear in that order as you read the drive)
    • Each drive has each type of strip (so if your stride has 3 data strips a P parity and a Q parity then each drive will contain D0, D1, D2, P and Q strips but they do not need to appear in that order) (excluding the rebuild areas which don't have any defined strips on them when not in use).

    We then try to ensure that when a drive fails the reads are as evenly distributed as possible without violating any of the above rules. In some cases an array will use a small number of layouts and rotate between them, in other cases it will use a large number of layouts, there are even some cases where we can use a single layout and it gives us perfect distribution (the trivial example of that would be stripe width of 5 over 6 drives with 1 rebuild area).

    All of this can be worked out from doing lba lookups on the cli, so is safe for me to tell you, going into any more details may involve upsetting lawyers so I better not. If anyone is ever in Hursley for an event under NDA, and wants me to bore you with all the intricate details, I'm sure I could talk about it for a few hours.

     

    Our implementation of distributed raid 6 does not allow for any extra simultaneous failures, it is designed to rebuild as quickly as possible to allow for further failures.

  • IanBoden
    IanBoden
    4 Posts

    Re: DRAID Layout & Resiliency

    ‏2016-02-01T16:54:29Z  

    Hi,

     

    The answer is that it's something different altogether, the examples you use I've seen described as stripe grouping but I'm not sure if it has a technical term. The problem is that it doesn't give you the rebuild times required. If we take a very simple example of circular striping across all available drives where we have a stripe width of 3 and 20 drives, then assume drive 9 fails and look at which drives have to be read from to perform the rebuild, we have:

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
    A A A B B B C C C D D D E E E F F F # #
    A A B B B C C C D D D E E E F F F # # A
    A B B B C C C D D D E E E F F F # # A A
    B B B C C C D D D E E E F F F # # A A A
    B B C C C D D D E E E F F F # # A A A B
    B C C C D D D E E E F F F # # A A A B B
    C C C D D D E E E F F F # # A A A B B B

     

    You can see both neighboring drives are read from most and then the next ones out are read slightly less. So you have got the benefit of distributed sparing meaning you are writing to more drives, but the reads are very badly distributed. It also limits your flexibility, as the number of drives used in the set minus the rebuild areas must be a whole number of stripes.

     

    So the layouts we use are a little more complicated, they try to ensure as even a distribution as possible for both reads and writes in the case of a failure. The exact pattern changes completely based on the stripe width and the number of drives. What we do is slice the array vertically into packs, each pack consists of the stripe width's worth of strips on each drive, within a pack the rebuild areas don't rotate so assuming you have 2 rebuild areas then each pack has 2 drives that are used solely for spare data. The strips are then spread with the following conditions needing to be true:

    • Each strip of a stride must be on a different drive.
    • Moving down a drive each strip must belong to a later stride (ie if you have strips for stride 2, 5 and 9 on a drive then they must appear in that order as you read the drive)
    • Each drive has each type of strip (so if your stride has 3 data strips a P parity and a Q parity then each drive will contain D0, D1, D2, P and Q strips but they do not need to appear in that order) (excluding the rebuild areas which don't have any defined strips on them when not in use).

    We then try to ensure that when a drive fails the reads are as evenly distributed as possible without violating any of the above rules. In some cases an array will use a small number of layouts and rotate between them, in other cases it will use a large number of layouts, there are even some cases where we can use a single layout and it gives us perfect distribution (the trivial example of that would be stripe width of 5 over 6 drives with 1 rebuild area).

    All of this can be worked out from doing lba lookups on the cli, so is safe for me to tell you, going into any more details may involve upsetting lawyers so I better not. If anyone is ever in Hursley for an event under NDA, and wants me to bore you with all the intricate details, I'm sure I could talk about it for a few hours.

     

    Our implementation of distributed raid 6 does not allow for any extra simultaneous failures, it is designed to rebuild as quickly as possible to allow for further failures.

  • Igor Minkovskiy
    Igor Minkovskiy
    1 Post

    Re: DRAID Layout & Resiliency

    ‏2018-06-21T10:34:49Z  
    • IanBoden
    • ‏2016-02-01T16:54:29Z

    Hi,

     

    The answer is that it's something different altogether, the examples you use I've seen described as stripe grouping but I'm not sure if it has a technical term. The problem is that it doesn't give you the rebuild times required. If we take a very simple example of circular striping across all available drives where we have a stripe width of 3 and 20 drives, then assume drive 9 fails and look at which drives have to be read from to perform the rebuild, we have:

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
    A A A B B B C C C D D D E E E F F F # #
    A A B B B C C C D D D E E E F F F # # A
    A B B B C C C D D D E E E F F F # # A A
    B B B C C C D D D E E E F F F # # A A A
    B B C C C D D D E E E F F F # # A A A B
    B C C C D D D E E E F F F # # A A A B B
    C C C D D D E E E F F F # # A A A B B B

     

    You can see both neighboring drives are read from most and then the next ones out are read slightly less. So you have got the benefit of distributed sparing meaning you are writing to more drives, but the reads are very badly distributed. It also limits your flexibility, as the number of drives used in the set minus the rebuild areas must be a whole number of stripes.

     

    So the layouts we use are a little more complicated, they try to ensure as even a distribution as possible for both reads and writes in the case of a failure. The exact pattern changes completely based on the stripe width and the number of drives. What we do is slice the array vertically into packs, each pack consists of the stripe width's worth of strips on each drive, within a pack the rebuild areas don't rotate so assuming you have 2 rebuild areas then each pack has 2 drives that are used solely for spare data. The strips are then spread with the following conditions needing to be true:

    • Each strip of a stride must be on a different drive.
    • Moving down a drive each strip must belong to a later stride (ie if you have strips for stride 2, 5 and 9 on a drive then they must appear in that order as you read the drive)
    • Each drive has each type of strip (so if your stride has 3 data strips a P parity and a Q parity then each drive will contain D0, D1, D2, P and Q strips but they do not need to appear in that order) (excluding the rebuild areas which don't have any defined strips on them when not in use).

    We then try to ensure that when a drive fails the reads are as evenly distributed as possible without violating any of the above rules. In some cases an array will use a small number of layouts and rotate between them, in other cases it will use a large number of layouts, there are even some cases where we can use a single layout and it gives us perfect distribution (the trivial example of that would be stripe width of 5 over 6 drives with 1 rebuild area).

    All of this can be worked out from doing lba lookups on the cli, so is safe for me to tell you, going into any more details may involve upsetting lawyers so I better not. If anyone is ever in Hursley for an event under NDA, and wants me to bore you with all the intricate details, I'm sure I could talk about it for a few hours.

     

    Our implementation of distributed raid 6 does not allow for any extra simultaneous failures, it is designed to rebuild as quickly as possible to allow for further failures.

    Hello!

    Can you clarify, how stripe length in DRAID affects performance (if any)?