Replacing failed disks in a Power 775 Disk Enclosure recovery group: a sample scenario

The scenario presented here shows how to detect and replace failed disks in a recovery group built on a Power® 775 Disk Enclosure.

Detecting failed disks in your enclosure

Assume a fully-populated Power 775 Disk Enclosure (serial number 000DE37) on which the following two recovery groups are defined:
  • 000DE37TOP containing the disks in the top set of carriers
  • 000DE37BOT containing the disks in the bottom set of carriers
Each recovery group contains the following:
  • one log declustered array (LOG)
  • four data declustered arrays (DA1, DA2, DA3, DA4)
The data declustered arrays are defined according to Power 775 Disk Enclosure best practice as follows:
  • 47 pdisks per data declustered array
  • each member pdisk from the same carrier slot
  • default disk replacement threshold value set to 2

The replacement threshold of 2 means that GNR will only require disk replacement when two or more disks have failed in the declustered array; otherwise, rebuilding onto spare space or reconstruction from redundancy will be used to supply affected data.

This configuration can be seen in the output of mmlsrecoverygroup for the recovery groups, shown here for 000DE37TOP:
# mmlsrecoverygroup 000DE37TOP -L

                    declustered
 recovery group       arrays     vdisks  pdisks
 -----------------  -----------  ------  ------
 000DE37TOP                   5       9     192

 declustered   needs                            replace                scrub       background activity
    array     service  vdisks  pdisks  spares  threshold  free space  duration  task   progress  priority
 -----------  -------  ------  ------  ------  ---------  ----------  --------  -------------------------
 DA1          no            2      47       2          2    3072 MiB   14 days  scrub       63%  low
 DA2          no            2      47       2          2    3072 MiB   14 days  scrub       19%  low
 DA3          yes           2      47       2          2       0   B   14 days  rebuild-2r  48%  low
 DA4          no            2      47       2          2    3072 MiB   14 days  scrub       33%  low
 LOG          no            1       4       1          1     546 GiB   14 days  scrub       87%  low

                                         declustered
 vdisk               RAID code              array     vdisk size  remarks
 ------------------  ------------------  -----------  ----------  -------
 000DE37TOPLOG       3WayReplication     LOG            4144 MiB  log
 000DE37TOPDA1META   4WayReplication     DA1             250 GiB
 000DE37TOPDA1DATA   8+3p                DA1              17 TiB
 000DE37TOPDA2META   4WayReplication     DA2             250 GiB
 000DE37TOPDA2DATA   8+3p                DA2              17 TiB
 000DE37TOPDA3META   4WayReplication     DA3             250 GiB
 000DE37TOPDA3DATA   8+3p                DA3              17 TiB
 000DE37TOPDA4META   4WayReplication     DA4             250 GiB
 000DE37TOPDA4DATA   8+3p                DA4              17 TiB

 active recovery group server                     servers
 -----------------------------------------------  -------
 server1                                          server1,server2

The indication that disk replacement is called for in this recovery group is the value of yes in the needs service column for declustered array DA3.

The fact that DA3 (the declustered array on the disks in carrier slot 3) is undergoing rebuild of its RAID tracks that can tolerate two strip failures is by itself not an indication that disk replacement is required; it merely indicates that data from a failed disk is being rebuilt onto spare space. Only if the replacement threshold has been met will disks be marked for replacement and the declustered array marked as needing service.

GNR provides several indications that disk replacement is required:
  • entries in the AIX® error report or the Linux® syslog
  • the pdReplacePdisk callback, which can be configured to run an administrator-supplied script at the moment a pdisk is marked for replacement
  • the POWER7 cluster event notification TEAL agent, which can be configured to send disk replacement notices when they occur to the POWER7 cluster EMS
  • the output from the following commands, which may be performed from the command line on any GPFS cluster node (see the examples that follow):
    1. mmlsrecoverygroup with the -L flag shows yes in the needs service column
    2. mmlsrecoverygroup with the -L and --pdisk flags; this shows the states of all pdisks, which may be examined for the replace pdisk state
    3. mmlspdisk with the --replace flag, which lists only those pdisks that are marked for replacement
Note: Because the output of mmlsrecoverygroup -L --pdisk for a fully-populated disk enclosure is very long, this example shows only some of the pdisks (but includes those marked for replacement).

# mmlsrecoverygroup 000DE37TOP -L --pdisk

                    declustered
 recovery group       arrays     vdisks  pdisks
 -----------------  -----------  ------  ------
 000DE37TOP                   5       9     192

 declustered   needs                            replace                scrub       background activity
    array     service  vdisks  pdisks  spares  threshold  free space  duration  task   progress  priority
 -----------  -------  ------  ------  ------  ---------  ----------  --------  -------------------------
 DA1          no            2      47       2          2    3072 MiB   14 days  scrub       63%  low
 DA2          no            2      47       2          2    3072 MiB   14 days  scrub       19%  low
 DA3          yes           2      47       2          2       0   B   14 days  rebuild-2r  68%  low
 DA4          no            2      47       2          2    3072 MiB   14 days  scrub       34%  low
 LOG          no            1       4       1          1     546 GiB   14 days  scrub       87%  low

                    n. active,   declustered                 user     state,
pdisk               total paths     array     free space   condition  remarks
-----------------   -----------  -----------  ----------  ----------- -------
[...]
c014d1                2,  4      DA1              62 GiB  normal      ok
c014d2                2,  4      DA2             279 GiB  normal      ok
c014d3                0,  0      DA3             279 GiB  replaceable dead/systemDrain/noRGD/noVCD/replace
c014d4                2,  4      DA4              12 GiB  normal      ok
[...]
c018d1                2,  4      DA1              24 GiB  normal      ok
c018d2                2,  4      DA2              24 GiB  normal      ok
c018d3                2,  4      DA3             558 GiB  replaceable dead/systemDrain/noRGD/noVCD/noData/replace
c018d4                2,  4      DA4              12 GiB  normal      ok
[...]
The preceding output shows that the following pdisks are marked for replacement:
  • c014d3 in DA3
  • c018d3 in DA3

The naming convention used during recovery group creation indicates that these are the disks in slot 3 of carriers 14 and 18. To confirm the physical locations of the failed disks, use the mmlspdisk command to list information about those pdisks in declustered array DA3 of recovery group 000DE37TOP that are marked for replacement:


# mmlspdisk 000DE37TOP --declustered-array DA3 --replace
pdisk:
   replacementPriority = 1.00
   name = "c014d3"
   device = "/dev/rhdisk158,/dev/rhdisk62"
   recoveryGroup = "000DE37TOP"
   declusteredArray = "DA3"
   state = "dead/systemDrain/noRGD/noVCD/replace"
   .
   .
   .

pdisk:
   replacementPriority = 1.00
   name = "c018d3"
   device = "/dev/rhdisk630,/dev/rhdisk726"
   recoveryGroup = "000DE37TOP"
   declusteredArray = "DA3"
   state = "dead/systemDrain/noRGD/noVCD/noData/replace"
   .
   .
   .

The preceding location code attributes confirm the pdisk naming convention:

Disk Location code Interpretation
pdisk c014d3 78AD.001.000DE37-C14-D3 Disk 3 in carrier 14 in the disk enclosure identified by enclosure type 78AD.001 and serial number 000DE37
pdisk c018d3 78AD.001.000DE37-C18-D3 Disk 3 in carrier 18 in the disk enclosure identified by enclosure type 78AD.001 and serial number 000DE37

Replacing the failed disks in a Power 775 Disk Enclosure recovery group

Note: In this example, it is assumed that two new disks with the appropriate Field Replaceable Unit (FRU) code, as indicated by the fru attribute (74Y4936 in this case), have been obtained as replacements for the failed pdisks c014d3 and c018d3.
Replacing each disk is a three-step process:
  1. Using the mmchcarrier command with the --release flag to suspend use of the other disks in the carrier and to release the carrier.
  2. Removing the carrier and replacing the failed disk within with a new one.
  3. Using the mmchcarrier command with the --replace flag to resume use of the suspended disks and to begin use of the new disk.
GNR assigns a priority to pdisk replacement. Disks with smaller values for the replacementPriority attribute should be replaced first. In this example, the only failed disks are in DA3 and both have the same replacementPriority.
Disk c014d3 is chosen to be replaced first.
  1. To release carrier 14 in disk enclosure 000DE37:
    # mmchcarrier 000DE37TOP --release --pdisk c014d3
      [I] Suspending pdisk c014d1 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D1.
      [I] Suspending pdisk c014d2 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D2.
      [I] Suspending pdisk c014d3 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D3.
      [I] Suspending pdisk c014d4 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D4.
      [I] Carrier released.
    
        - Remove carrier.
        - Replace disk in location 78AD.001.000DE37-C14-D3 with FRU 74Y4936.
        - Reinsert carrier.
        - Issue the following command:
    
            mmchcarrier 000DE37TOP --replace --pdisk 'c014d3'
    
          Repair timer is running.  Perform the above within 5 minutes
          to avoid pdisks being reported as missing.
    
    
    

    GNR issues instructions as to the physical actions that must be taken. Note that disks may be suspended only so long before they are declared missing; therefore the mechanical process of physically performing disk replacement must be accomplished promptly.

    Use of the other three disks in carrier 14 has been suspended, and carrier 14 is unlocked. The identify lights for carrier 14 and for disk 3 are on.

  2. Carrier 14 should be unlatched and removed. The failed disk 3, as indicated by the internal identify light, should be removed, and the new disk with FRU 74Y4936 should be inserted in its place. Carrier 14 should then be reinserted and the latch closed.
  3. To finish the replacement of pdisk c014d3:
    # mmchcarrier 000DE37TOP --replace --pdisk c014d3    
    [I] The following pdisks will be formatted on node server1:
        /dev/rhdisk354   
    [I] Pdisk c014d3 of RG 000DE37TOP successfully replaced.   
    [I] Resuming pdisk c014d1 of RG 000DE37TOP.   
    [I] Resuming pdisk c014d2 of RG 000DE37TOP.   
    [I] Resuming pdisk c014d3#162 of RG 000DE37TOP.   
    [I] Resuming pdisk c014d4 of RG 000DE37TOP.   
    [I] Carrier resumed. 

When the mmchcarrier --replace command returns successfully, GNR has resumed use of the other 3 disks. The failed pdisk may remain in a temporary form (indicated here by the name c014d3#162) until all data from it has been rebuilt, at which point it is finally deleted. The new replacement disk, which has assumed the name c014d3, will have RAID tracks rebuilt and rebalanced onto it. Notice that only one block device name is mentioned as being formatted as a pdisk; the second path will be discovered in the background.

This can be confirmed with mmlsrecoverygroup -L --pdisk:


# mmlsrecoverygroup 000DE37TOP -L --pdisk

                    declustered
 recovery group       arrays     vdisks  pdisks
 -----------------  -----------  ------  ------
 000DE37TOP                   5       9     193

 declustered   needs                            replace                scrub       background activity
    array     service  vdisks  pdisks  spares  threshold  free space  duration  task   progress  priority
 -----------  -------  ------  ------  ------  ---------  ----------  --------  -------------------------
 DA1          no            2      47       2          2    3072 MiB   14 days  scrub       63%  low
 DA2          no            2      47       2          2    3072 MiB   14 days  scrub       19%  low
 DA3          yes           2      48       2          2       0   B   14 days  rebuild-2r  89%  low
 DA4          no            2      47       2          2    3072 MiB   14 days  scrub       34%  low
 LOG          no            1       4       1          1     546 GiB   14 days  scrub       87%  low

                    n. active,   declustered                 user     state,
pdisk               total paths     array     free space   condition  remarks
-----------------   -----------  -----------  ----------  ----------- -------
[...]
c014d1                2,  4      DA1              23 GiB  normal      ok
c014d2                2,  4      DA2              23 GiB  normal      ok
c014d3                2,  4      DA3             550 GiB  normal      ok
c014d3#162            0,  0      DA3             543 GiB  replaceable dead/adminDrain/noRGD/noVCD/noPath
c014d4                2,  4      DA4              23 GiB  normal      ok
[...]
c018d1                2,  4      DA1              24 GiB  normal      ok
c018d2                2,  4      DA2              24 GiB  normal      ok
c018d3                0,  0      DA3             558 GiB  replaceable dead/systemDrain/noRGD/noVCD/noData/replace
c018d4                2,  4      DA4              23 GiB  normal      ok
[...]

Notice that the temporary pdisk c014d3#162 is counted in the total number of pdisks in declustered array DA3 and in the recovery group, until it is finally drained and deleted.

Notice also that pdisk c018d3 is still marked for replacement, and that DA3 still needs service. This is because GNR replacement policy expects all failed disks in the declustered array to be replaced once the replacement threshold is reached. The replace state on a pdisk is not removed when the total number of failed disks goes under the threshold.

Pdisk c018d3 is replaced following the same process.
  1. Release carrier 18 in disk enclosure 000DE37:
    # mmchcarrier 000DE37TOP --release --pdisk c018d3
      [I] Suspending pdisk c018d1 of RG 000DE37TOP in location 78AD.001.000DE37-C18-D1.
      [I] Suspending pdisk c018d2 of RG 000DE37TOP in location 78AD.001.000DE37-C18-D2.
      [I] Suspending pdisk c018d3 of RG 000DE37TOP in location 78AD.001.000DE37-C18-D3.
      [I] Suspending pdisk c018d4 of RG 000DE37TOP in location 78AD.001.000DE37-C18-D4.
      [I] Carrier released.
    
        - Remove carrier.
        - Replace disk in location 78AD.001.000DE37-C18-D3 with FRU 74Y4936.
        - Reinsert carrier.
        - Issue the following command:
    
            mmchcarrier 000DE37TOP --replace --pdisk 'c018d3'
    
          Repair timer is running.  Perform the above within 5 minutes
          to avoid pdisks being reported as missing.
  2. Unlatch and remove carrier 18, remove and replace failed disk 3, reinsert carrier 18, and close the latch.
  3. To finish the replacement of pdisk c018d3:
    # mmchcarrier 000DE37TOP --replace --pdisk c018d3
    
      [I] The following pdisks will be formatted on node server1:
          /dev/rhdisk674
      [I] Pdisk c018d3 of RG 000DE37TOP successfully replaced.
      [I] Resuming pdisk c018d1 of RG 000DE37TOP.
      [I] Resuming pdisk c018d2 of RG 000DE37TOP.
      [I] Resuming pdisk c018d3#166 of RG 000DE37TOP.
      [I] Resuming pdisk c018d4 of RG 000DE37TOP.
      [I] Carrier resumed.
    
    
Running mmlsrecoverygroup again will confirm the second replacement:

# mmlsrecoverygroup 000DE37TOP -L --pdisk

                    declustered
 recovery group       arrays     vdisks  pdisks
 -----------------  -----------  ------  ------
 000DE37TOP                   5       9     192

 declustered   needs                            replace                scrub       background activity
    array     service  vdisks  pdisks  spares  threshold  free space  duration  task   progress  priority
 -----------  -------  ------  ------  ------  ---------  ----------  --------  -------------------------
 DA1          no            2      47       2          2    3072 MiB   14 days  scrub       64%  low
 DA2          no            2      47       2          2    3072 MiB   14 days  scrub       22%  low
 DA3          no            2      47       2          2    2048 MiB   14 days  rebalance   12%  low
 DA4          no            2      47       2          2    3072 MiB   14 days  scrub       36%  low
 LOG          no            1       4       1          1     546 GiB   14 days  scrub       89%  low

                    n. active,   declustered                 user     state,
pdisk               total paths     array     free space   condition  remarks
-----------------   -----------  -----------  ----------  ----------- -------
 [...]
 c014d1               2,  4       DA1              23 GiB  normal      ok
 c014d2               2,  4       DA2              23 GiB  normal      ok
 c014d3               2,  4       DA3             271 GiB  normal      ok
 c014d4               2,  4       DA4              23 GiB  normal      ok
 [...]
 c018d1               2,  4       DA1              24 GiB  normal      ok
 c018d2               2,  4       DA2              24 GiB  normal      ok
 c018d3               2,  4       DA3             542 GiB  normal      ok
 c018d4               2,  4       DA4              23 GiB  normal      ok
 [...]

Notice that both temporary pdisks have been deleted. This is because c014d3#162 has finished draining, and because pdisk c018d3#166 had, before it was replaced, already been completely drained (as evidenced by the noData flag). Declustered array DA3 no longer needs service and once again contains 47 pdisks, and the recovery group once again contains 192 pdisks.