Lost access to logical drive

Troubleshooting

Problem

SCSI ServeRAID controllers can become unstable or hang when a hard disk drive behaves in a specific way. This may cause one or more of the following symptoms to occur: - The SCSI ServeRAID controller becomes unresponsive during operation. - After a warmsystem restart, the ServeRAID POST banner does not appear while the system is starting. If the ServeRAID POST banner does appear, a counter is shown that counts down from six minutes to zero and then fails to locate a boot device. - IPSSEND commands directed at the controller always fail with errors. - Logical drives associated with the controller are unavailable/offline. - The server becomes sluggish and unresponsive resulting in an operating system crash. - Following a drive failure, a rebuild appearsto start but cannot complete. - There is a low probability that after a failure, bad stripes may increment on the logical drive.

Resolving The Problem

Source

RETAIN tip: H191768

Symptom

SCSI ServeRAID controllers can become unstable or hang when a hard disk drive behaves in a specific way. This may cause one or more of the following symptoms to occur:

The SCSI ServeRAID controller becomes unresponsive during operation.
After a warm system restart, the ServeRAID POST banner does not appear while the system is starting. If the ServeRAID POST banner does appear, a counter is shown that counts down from six minutes to zero and then fails to locate a boot device.
IPSSEND commands directed at the controller always fail with errors.
Logical drives associated with the controller are unavailable/offline.
The server becomes sluggish and unresponsive resulting in an operating system crash.
Following a drive failure, a rebuild appears to start but cannot complete.
There is a low probability that after a failure, bad stripes may increment on the logical drive.

Affected configurations

The system is configured with one or more of the following IBM Options:

ServeRAID-4H Ultra160 SCSI Controller - Cache Daughter Card replacement part number 37L6902
ServeRAID-4H Ultra160 SCSI Controller, Option 37L6889, replacement part number 37L6892
ServeRAID-4L Ultra160 SCSI Controller, Option 37L6091, replacement part number 09N9540
ServeRAID-4Lx Ultra160 SCSI Controller, Option 06P5740, replacement part number 06P5741
ServeRAID-4M Ultra160 SCSI Controller (Japan), Option 19K0565, replacement part number 00N9543
ServeRAID-4M Ultra160 SCSI Controller, Option 37L6080, replacement part number 37L7258
ServeRAID-4Mx Ultra160 SCSI Controller, Option 06P5736, replacement part number 06P5737
ServeRAID-5i Controller, Option 25P3492, replacement part number 02R0970 replaces 32P0016
ServeRAID-6M Controller (128MB Cache), Option 32P0033, replacement part number 02R0985
ServeRAID-6M Controller (256MB Cache), Option 13n2197, replacement part number 02R0998
ServeRAID-6i Controller, Option 71P8595, replacement part number 71P8627
ServeRAID-6i+ Controller, Option 13N2190, replacement part number 13N2195
ServeRAID-7k Controller, Option 39R8800 replaces 71P8642, replacement part number 71P8644

This tip is not software specific.

Solution

Use the methods described in the Details section to isolate the failure mode.

SCSI ServeRAID BIOS and Firmware version 7.12.13 is planned to be released in Fourth Quarter 2007. In the interim, please contact the RTP Support Group.

Contact your appropriate Support Center for your geography. In the United States, contact 1-800-IBM-SERV (1-800-426-7378).

The "IBM Directory of Worldwide Contacts" is available from the following URL:

http://www.ibm.com/planetwide/

The referenced file will be available from the "Servers - ServeRAID Software Matrix".

Workaround

To recover the server into a stable condition after experiencing one or more of the symptoms above, shutdown the server and power the system off. If the SCSI hard drives are in an external enclosure unit, like an EXP400, power off the external enclosure following the server. Wait approximately 60 seconds, then power on the external enclosure first, followed by a server power on.

This process is necessary to force a power reset to the disk drives since SCSI hard disk drives do not always reset during a warm power cycle. If the hard drives are not reset in this manner, the system may not detect the drive properly resulting in the "No boot devices found" symptom described above. After a power reset, the system should resume operating normally for an period of time. This time should be used to do problem determination to isolate the failure mode as described in the Details section.

Additional information

A very small number of SCSI hard drives have been found to fail in an unusual manner in that they become "very slow" to process commands from the controller. At the drive level, the disk accepts commands, but results in a command time-out at the controller. While in this state, the drive may not POST any errors or "check conditions" that the controller would normally use to take specific error recovery actions to remedy the issue.

By design, SCSI ServeRAID controllers do not mark physical drives defunct for command time-outs. If a drive fails to respond to a command, the controller stops taking on new work load, and prepares to reset the SCSI bus. Prior to the reset, all incomplete transactions must be accounted for and aborted.

These aborted commands are logged as "Misc" events in the device log.

The symptoms noted above occur when the ServeRAID controller repeatedly deals with a drive that is "very slow" to respond and either runs out of resources to perform error recovery and hangs, or the controller cannot get access to the SCSI bus because another device on the channel is hung.

In order to properly troubleshoot this issue, the ServeRAID firmware must be at 7.12.13 or higher. IBM has confirmed that code levels 7.10.18 to 7.12.12 inadvertently did not properly count miscellaneous events in the ServeRAID device log. This is a critical components to identifying the drive condition within the subsystem. ServeRAID ParserLite access is not required to read device logs, but it is very helpful to use parsed information from the soft logs to corroborate data in the device log. The device log can be obtained by using the following IPSSEND command:

IPSSEND GETEVENT controller# DEVICE

To collect all the ServeRAID logs typically used to diagnose this issue, capture a ServeRAID Support Archive using ServeRAID Manager or run the following ServeRAID CLI commands within the operating system and output them into a text file:

IPSSEND GETVERSION:

Collects code versions

IPSSEND GETCONFIG controller# AL:

Collects configuration data

IPSSEND GETEVENT controller# ALL:

Collects event logs

IPSSEND GETBST controller#:

Collects bad stripe table

IPSSEND GETSTATUS controller#:

Collects current status

IPSSEND GETSUBSTAT controller#:

Collects data rate info

The following device log shows what can appear when a physical drive becomes "very slow" resulting in command timeouts and resets on the SCSI bus.

Sample Device event table:

Channel

SCSI ID

Parity

Soft

Hard

PFA

Misc

1

0

0

0

0

No

0

1

1

0

0

0

No

0

1

2

0

0

0

No

0

1

3

0

0

0

No

0

1

4

0

0

0

No

0

1

5

0

0

0

No

0

1

6

0

0

0

No

0

1

7

0

0

0

No

0

1

8

0

0

0

No

88

1

9

0

0

0

No

0

1

10

0

0

0

No

0

1

11

0

0

0

No

0

1

12

0

0

0

No

0

1

13

0

0

0

No

0

1

14

0

0

0

No

0

1

15

0

0

0

No

0

Channel	SCSI ID	PFA	Misc
1	0	No	0
1	1	No	0
1	2	No	0
1	3	No	0
1	4	No	0
1	5	No	0
1	6	No	0
1	7	No	0
1	8	No	88
1	9	No	0
1	10	No	0
1	11	No	0
1	12	No	0
1	13	No	0
1	14	No	0
1	15	No	0

In this example, an EXP400 was used with 14 hard drives attached to channel 1. The drive at channel 1 SCSI ID 8 has a miscellaneous count of 88, indicating that this drive has gone through several cycles of aborting commands and resets. If the soft log is parsed, the parsed log should corroborate repeating SCSI bus resets among other possible contributing errors.

If miscellaneous events are spread across multiple drives, or other errors or conditions significantly different from the above device log sample, the final solution may be something other than a hard drive issue. Please see "Analyzing SCSI ServeRAID Device Logs" below for more information.

To properly diagnose the "very slow" hard disk drive issue, the device log will clearly have indicators showing significantly greater numbers of miscellaneous events than other drives in the system. The SCSI ServeRAID controller and server will also experience one or more of the symptoms above.

After concluding that a drive is contributing to failures in this manner, the drive should be manually marked defunct and replaced under warranty terms and conditions as they apply.

See the following URL's for additional information:

SCSI ServeRAID Playbook:

https://w3-03.ibm.com/services/technology/datacase/primary/docs/scadocs/pss/xSeries/ServeRAID/index.html

Analyzing SCSI ServeRAID Device Logs:

https://w3-03.ibm.com/services/technology/datacase/primary/docs/scadocs/pss/xSeries/ServeRAID/Ref_Log_Device.html

Predefined Action Plan for bad stripe table entries:

https://w3-03.ibm.com/services/technology/datacase/primary/docs/scadocs/pss/xSeries/ServeRAID/PD_Action-BST.html

Using Dumplog to collect SCSI ServeRAID logs:

http://www.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5000008&lndocid=MIGR-4UD223

Document Location

Worldwide

Operating System

System x Hardware Options:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU01RAY","label":"System x Hardware Options->ServeRAID->ServeRAID-4x"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"QU01RBY","label":"System x Hardware Options->ServeRAID->ServeRAID-5x"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"QU01RIC","label":"System x Hardware Options->ServeRAID->ServeRAID-6x"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"QU01RKE","label":"System x Hardware Options->ServeRAID->ServeRAID-7x"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}}]

Tips

Lost access to logical drive - IBM ServeRAID