Troubleshooting
Problem
SCSI ServeRAID controllers can become unstable or hang when a hard disk drive behaves in a specific way. This may cause one or more of the following symptoms to occur: - The SCSI ServeRAID controller becomes unresponsive during operation. - After a warmsystem restart, the ServeRAID POST banner does not appear while the system is starting. If the ServeRAID POST banner does appear, a counter is shown that counts down from six minutes to zero and then fails to locate a boot device. - IPSSEND commands directed at the controller always fail with errors. - Logical drives associated with the controller are unavailable/offline. - The server becomes sluggish and unresponsive resulting in an operating system crash. - Following a drive failure, a rebuild appearsto start but cannot complete. - There is a low probability that after a failure, bad stripes may increment on the logical drive.
Resolving The Problem
Source
RETAIN tip: H191768
Symptom
SCSI ServeRAID controllers can become unstable or hang when a hard disk drive behaves in a specific way. This may cause one or more of the following symptoms to occur:
- The SCSI ServeRAID controller becomes unresponsive during operation.
- After a warm system restart, the ServeRAID POST banner does not appear while the system is starting. If the ServeRAID POST banner does appear, a counter is shown that counts down from six minutes to zero and then fails to locate a boot device.
- IPSSEND commands directed at the controller always fail with errors.
- Logical drives associated with the controller are unavailable/offline.
- The server becomes sluggish and unresponsive resulting in an operating system crash.
- Following a drive failure, a rebuild appears to start but cannot complete.
- There is a low probability that after a failure, bad stripes may increment on the logical drive.
Affected configurations
The system is configured with one or more of the following IBM Options:
- ServeRAID-4H Ultra160 SCSI Controller - Cache Daughter Card replacement part number 37L6902
- ServeRAID-4H Ultra160 SCSI Controller, Option 37L6889, replacement part number 37L6892
- ServeRAID-4L Ultra160 SCSI Controller, Option 37L6091, replacement part number 09N9540
- ServeRAID-4Lx Ultra160 SCSI Controller, Option 06P5740, replacement part number 06P5741
- ServeRAID-4M Ultra160 SCSI Controller (Japan), Option 19K0565, replacement part number 00N9543
- ServeRAID-4M Ultra160 SCSI Controller, Option 37L6080, replacement part number 37L7258
- ServeRAID-4Mx Ultra160 SCSI Controller, Option 06P5736, replacement part number 06P5737
- ServeRAID-5i Controller, Option 25P3492, replacement part number 02R0970 replaces 32P0016
- ServeRAID-6M Controller (128MB Cache), Option 32P0033, replacement part number 02R0985
- ServeRAID-6M Controller (256MB Cache), Option 13n2197, replacement part number 02R0998
- ServeRAID-6i Controller, Option 71P8595, replacement part number 71P8627
- ServeRAID-6i+ Controller, Option 13N2190, replacement part number 13N2195
- ServeRAID-7k Controller, Option 39R8800 replaces 71P8642, replacement part number 71P8644
This tip is not software specific.
Solution
Use the methods described in the Details section to isolate the failure mode.
SCSI ServeRAID BIOS and Firmware version 7.12.13 is planned to be released in Fourth Quarter 2007. In the interim, please contact the RTP Support Group.
Contact your appropriate Support Center for your geography. In the United States, contact 1-800-IBM-SERV (1-800-426-7378).
The "IBM Directory of Worldwide Contacts" is available from the following URL:
http://www.ibm.com/planetwide/
The referenced file will be available from the "Servers - ServeRAID Software Matrix".
Workaround
To recover the server into a stable condition after experiencing one or more of the symptoms above, shutdown the server and power the system off. If the SCSI hard drives are in an external enclosure unit, like an EXP400, power off the external enclosure following the server. Wait approximately 60 seconds, then power on the external enclosure first, followed by a server power on.
This process is necessary to force a power reset to the disk drives since SCSI hard disk drives do not always reset during a warm power cycle. If the hard drives are not reset in this manner, the system may not detect the drive properly resulting in the "No boot devices found" symptom described above. After a power reset, the system should resume operating normally for an period of time. This time should be used to do problem determination to isolate the failure mode as described in the Details section.
Additional information
A very small number of SCSI hard drives have been found to fail in an unusual manner in that they become "very slow" to process commands from the controller. At the drive level, the disk accepts commands, but results in a command time-out at the controller. While in this state, the drive may not POST any errors or "check conditions" that the controller would normally use to take specific error recovery actions to remedy the issue.
By design, SCSI ServeRAID controllers do not mark physical drives defunct for command time-outs. If a drive fails to respond to a command, the controller stops taking on new work load, and prepares to reset the SCSI bus. Prior to the reset, all incomplete transactions must be accounted for and aborted.
These aborted commands are logged as "Misc" events in the device log.
The symptoms noted above occur when the ServeRAID controller repeatedly deals with a drive that is "very slow" to respond and either runs out of resources to perform error recovery and hangs, or the controller cannot get access to the SCSI bus because another device on the channel is hung.
In order to properly troubleshoot this issue, the ServeRAID firmware must be at 7.12.13 or higher. IBM has confirmed that code levels 7.10.18 to 7.12.12 inadvertently did not properly count miscellaneous events in the ServeRAID device log. This is a critical components to identifying the drive condition within the subsystem. ServeRAID ParserLite access is not required to read device logs, but it is very helpful to use parsed information from the soft logs to corroborate data in the device log. The device log can be obtained by using the following IPSSEND command:
IPSSEND GETEVENT controller# DEVICE
To collect all the ServeRAID logs typically used to diagnose this issue, capture a ServeRAID Support Archive using ServeRAID Manager or run the following ServeRAID CLI commands within the operating system and output them into a text file:
The following device log shows what can appear when a physical drive becomes "very slow" resulting in command timeouts and resets on the SCSI bus.
- IPSSEND GETVERSION:
- Collects code versions
- IPSSEND GETCONFIG controller# AL:
- Collects configuration data
- IPSSEND GETEVENT controller# ALL:
- Collects event logs
- IPSSEND GETBST controller#:
- Collects bad stripe table
- IPSSEND GETSTATUS controller#:
- Collects current status
- IPSSEND GETSUBSTAT controller#:
- Collects data rate info
Sample Device event table:
Channel SCSI ID Parity Soft Hard PFA Misc 1 0 0 0 0 No 0 1 1 0 0 0 No 0 1 2 0 0 0 No 0 1 3 0 0 0 No 0 1 4 0 0 0 No 0 1 5 0 0 0 No 0 1 6 0 0 0 No 0 1 7 0 0 0 No 0 1 8 0 0 0 No 88 1 9 0 0 0 No 0 1 10 0 0 0 No 0 1 11 0 0 0 No 0 1 12 0 0 0 No 0 1 13 0 0 0 No 0 1 14 0 0 0 No 0 1 15 0 0 0 No 0
In this example, an EXP400 was used with 14 hard drives attached to channel 1. The drive at channel 1 SCSI ID 8 has a miscellaneous count of 88, indicating that this drive has gone through several cycles of aborting commands and resets. If the soft log is parsed, the parsed log should corroborate repeating SCSI bus resets among other possible contributing errors.
If miscellaneous events are spread across multiple drives, or other errors or conditions significantly different from the above device log sample, the final solution may be something other than a hard drive issue. Please see "Analyzing SCSI ServeRAID Device Logs" below for more information.
To properly diagnose the "very slow" hard disk drive issue, the device log will clearly have indicators showing significantly greater numbers of miscellaneous events than other drives in the system. The SCSI ServeRAID controller and server will also experience one or more of the symptoms above.
After concluding that a drive is contributing to failures in this manner, the drive should be manually marked defunct and replaced under warranty terms and conditions as they apply.
See the following URL's for additional information:
SCSI ServeRAID Playbook:
https://w3-03.ibm.com/services/technology/datacase/primary/docs/scadocs/pss/xSeries/ServeRAID/index.html
Analyzing SCSI ServeRAID Device Logs:
https://w3-03.ibm.com/services/technology/datacase/primary/docs/scadocs/pss/xSeries/ServeRAID/Ref_Log_Device.html
Predefined Action Plan for bad stripe table entries:
https://w3-03.ibm.com/services/technology/datacase/primary/docs/scadocs/pss/xSeries/ServeRAID/PD_Action-BST.html
Using Dumplog to collect SCSI ServeRAID logs:
http://www.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5000008&lndocid=MIGR-4UD223
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
29 January 2019
UID
ibm1MIGR-5072562