IBM Support

Recovery procedures for failed drives - Servers

Troubleshooting


Problem

RETAIN tip H201156: The Netfinity Fibre Channel RAID Controller has at least one failed drive.

Resolving The Problem

Source
RETAIN tip H201156

Symptom
The Netfinity Fibre Channel RAID Controller has at least one failed drive.
 
Affected configurations
The system is any of the following IBM Netfinity servers:  

  • a Netfinity 8500R server, Type 8681, any Model.
  • a Netfinity 7000-M10 server, Type 8680, any Model.
  • a Netfinity 7000 server, Type 8651, any Model.
  • a Netfinity 5500-M20 server, Type 8662, any Model.
  • a Netfinity 5500-M10 server, Type 8661, any Model.
  • a Netfinity 5500 server, Type 8660, any Model.
  • a Netfinity 5000 server, Type 8659, any Model.
  • a Netfinity 4000R server, Type 8652, any Model.
  • a Netfinity 3500-M10 server, Type 8655, any Model.
  • a Netfinity 3500 server, Type 8644, any Model.
  • a Netfinity 3000 server, Type 8476, any Model.
  • a Netfinity 1000 server, Type 8477, any Model.
The system is configured with the following option(s):
  • Fibre Channel RAID Controller, Type 3526, any Model.

The system has the described symptom.

Solution

NOTE: The following information applies only to drives that are part of the same LUN. It is not always obvious when using fiber which drives are part of which LUN. This can be compounded when there is more than one failed drive on the system. If there is any question whatsoever, contact your local IBM Support Center before taking any corrective action.

Important Guidelines

Drive replacement (rebuilding a defunct drive):  

  • Physically replace ANY single failed drive.
  • Do not power either the fibre enclosures or the drive enclosures off when multiple drives have failed. Potentially valuable information about the order in which the drives failed will be lost if any of the units are powered down.
  • Ensure proper power up and power down sequences at all times. Failure to do so may result in offline drives, an invalid configuration or even lost data. The fibre controller stores its configuration to the drives, so if they are not available at any time the controller is powered on, errors may occur. The proper sequences are:
Power Up: Power on the drive enclosures first, then the fibre controller, then the host system.
Power Down: Power Down the host system first, then the fibre controller, then the drive enclosures.  
  • Determine exactly what the drive and RAID configuration is.
  • Do not revive or reconstruct any drives until you are absolutely certain, that you are doing it in the correct order.
  • Get help if you are not certain.
  • Never revive the last drive on any RAID-1 or RAID-5 array; always reconstruct it.
  • If a drive is replaced always wait at least 20 seconds before installing the new one.
NOTE: The controller scans every 10 seconds to see whether the drive is there, so 20 seconds would insure that the controller does not miss the change.
  • Always follow proper drive handling procedures.
  • When a hard disk drive goes defunct (failed), a rebuild operation is required to reconstruct the data for the device in its respective disk array. The Netfinity Fibre controllers can reconstruct RAID level-1 and RAID level-5 logical drives, but they cannot reconstruct data stored in RAID level-0 logical drives.
NOTE: Before you rebuild a drive, review the following guidelines and general information.
 
Guidelines for the rebuild operation
Remember, the replacement hard disk drive must have a capacity equal to or greater than the failed drive.

General information about the rebuild operation

A physical hard disk drive can enter the rebuild state if:
  • You physically replace a failed drive that is part of the critical logical drive.
  • When you physically replace a failed drive in a critical logical drive, the Netfinity Fibre controller rebuilds the data on the new physical drive before it changes the logical drive state back to Optimal.
Recovering from a multiple drive failure
The Netfinity Fibre controller will rebuild a defunct drive automatically when all of the following conditions exist:
  • A hot-spare drive with a capacity equal to or greater than the capacity of the defunct drive is available the moment the drive fails.
  • A drive of the same or equal capacity is replaced in the same location and the LUN is in a sufficient state (degraded) to rebuild the data.

When the Netfinity Fibre controller communicates with the hardfile and receives an unexpected response, the controller will mark the drive failed in order to avoid any potential data loss. For example, this could occur in the event of a power loss to any of the components in the SCSI Netfinity Fibre subsystem. In this case, the Netfinity Fibre controller will err on the side of safety and will no longer write to that drive, although the drive may not be defective in any way.

For multiple failed drives in a RAID 1 and RAID 5 array, data is lost. Data recovery may be attempted by bringing all but the first drive that was marked failed back to the Optimal state.

Please contact the IBM Support Center for help in determining the failure order of the drives.

1. Undefine any HSP's. An automatic rebuild may destroy the data if the drives are revived in the wrong order.

2. Revive all but the first failed drive (timewise) in the array. The LUN should be in a degraded state. If this cannot be determined, please contact the IBM Support Center for assistance.

3. From the operating system, perform a CHKDSK in read only mode. This will ensure that the drives have been revived in the correct order, and that the data is intact.
 
NOTE: If the CHKDSK does not complete successfully, then you may have incorrectly revived the drives in the wrong order.
 
Do NOT remove power from the controller or the drives. Contact the IBM Support Center immediately for assistance in attempting to recover the data.

4. Perform a backup of the now known good data. This will ensure that if the problem happens again before root cause of the failure is found, there will be a minimal amount of data lost.
 
5. Once the backup of data has been completed, reconstruct the last drive in the LUN. This should return the LUN to an Optimal status, and will better protect the data until root cause can be found.

6. Once the rebuild completes, a cause for the failed drives such as a bad cable, backplane, etc. must be identified. If the reconstruction fails, the last drive may be the source of the error.

NOTE: If you do not know the order that the Drives went off line, do not Revive any drives until you have established the order that the drives failed. Do NOT remove power from the controller or the drives.

7. If no cause can be found, contact the IBM Support Center for assistance in determining root cause.

Document Location

Worldwide

Operating System

Older System x:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN01","label":"Older System x->Netfinity 7000"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN02","label":"Older System x->Netfinity 3500"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN03","label":"Older System x->Netfinity 3000"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN04","label":"Older System x->Netfinity 5500"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN05","label":"Older System x->Netfinity 5000"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN06","label":"Older System x->Netfinity 1000"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN07","label":"Older System x->Netfinity 8500R"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN09","label":"Older System x->Netfinity 4000R"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN16","label":"Older System x->Netfinity 3500 M10"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN18","label":"Older System x->Netfinity 5500 M10"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN19","label":"Older System x->Netfinity 5500 M20"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HWN20","label":"Older System x->Netfinity 7000 M10"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
28 January 2019

UID

ibm1MIGR-4F4SC8