Use the following to perform SAS fabric problem isolation
for a PCI-X or PCIe controller.
Considerations:
- Remove power from the system before connecting and disconnecting
cables or devices, as appropriate, to prevent hardware damage or erroneous
diagnostic results.
- Some systems have SAS and PCI-X or PCIe bus interface logic integrated
onto the system boards and use a pluggable RAID enablement card (a
non-PCI form factor card) for these integrated-logic buses, See the
feature comparison tables for PCIe and PCI-X cards.
For these configurations, replacement of the RAID enablement card
is unlikely to solve a SAS-related problem because the SAS interface
logic is on the system board.
- Some systems have the disk enclosure or removable media enclosure
integrated in the system with no cables. For these configurations,
the SAS connections are integrated onto the system boards and a failed
connection can be the result of a failed system board or integrated
device enclosure.
- Some systems have SAS RAID adapters integrated onto the system
boards and use a Cache RAID - Dual IOA Enablement Card (for example,
FC5662) to enable storage adapter Write Cache and Dual Storage IOA
(HA RAID mode). For these configurations, replacement of the Cache
RAID - Dual IOA Enablement Card is unlikely to solve a SAS-related
problem because the SAS interface logic is on the system board. Additionally,
appropriate service procedures must be followed when replacing the
Cache RAID - Dual IOA Enablement Card because removal of this card
can cause data loss if incorrectly performed and can also result in
a non-Dual Storage IOA (non-HA) mode of operation.
- Some adapters, known as RAID and SSD adapters, contain
SSDs, which are integrated on the adapter. See the feature comparison
tables for PCIe cards.
For these configurations, FRU replacement to solve SAS-related problems
is limited to replacing either the adapter or the integrated SSDs
because the entire SAS interface logic is contained on the adapter.
Attention: When SAS fabric problems exist, obtain
assistance from your hardware service provider before performing any
of the following actions:
- Obtain assistance before you replace a RAID adapter because the
adapter might contain nonvolatile write cache data and configuration
data for the attached disk arrays, and additional problems might be
created by replacing an adapter.
- Obtain assistance before you remove functioning disks in a disk
array because the disk array might become degraded or might fail,
and additional problems might be created if functioning disks are
removed from a disk array.
Attention: Removing functioning disks in a
disk array is not recommended without assistance from your hardware
service support organization. A disk array might become degraded or
might fail if functioning disks are removed, and additional problems
might be created.
Step 3150-2
The possible
causes for SRN nnnn-3020 are:
- More devices are connected to the adapter than the adapter supports.
Change the configuration to reduce the number of devices below what
is supported by the adapter.
- A SAS device has been improperly moved from one location to another.
Either return the device to its original location or move the device
while the adapter is powered off or unconfigured.
- A SAS device has been improperly replaced by a SATA device. A
SAS device must be used to replace a SAS device.
The possible causes for SRN
nnnn-FFFE are:
- One or more SAS devices were moved from a PCIe2 controller to
a PCI-X or PCIe controller. If the device was moved from a PCIe2
controller to a PCI-X or PCIe controller, the Detail Data section
of the hardware error log contains a reason for failure of Payload
CRC Error. For this case, the error can be ignored and the
problem is resolved if the devices are moved back to a PCIe2 controller
or if the devices are formatted on the PCI-X or PCIe controller.
- For all other causes, Go to Step 3150-3
When the problem is resolved, see the removal and replacement
procedures topic for the system unit on which you are working and
do the "Verifying the repair" procedure.
Step 3150-3
Determine if
any of the disk arrays on the adapter are in a Degraded state
as follows:
- Start the IBM® SAS Disk Array Manager.
- Start Diagnostics and select Task Selection on
the Function Selection display.
- Select .
- Select .
- Select the identified in the hardware error log.
Does any disk array have a state of Degraded?
- No
- Go to Step 3150-5.
- Yes
- Go to Step 3150-4.
Step 3150-4
Other errors
should have occurred related to the disk array being in a Degraded state.
Take action on these errors to replace the failed disk and restore
the disk array to an Optimal state.
When the problem is resolved, see the removal and replacement
procedures topic for the system unit on which you are working and
do the "Verifying the repair" procedure.
Step 3150-5
Have other errors
occurred at the same time as this error?
- No
- Go to Step 3150-7.
- Yes
- Go to Step 3150-6.
Step 3150-6
Take action
on the other errors that have occurred at the same time as this error.
When the problem is resolved, see the removal and replacement
procedures topic for the system unit on which you are working and
do the "Verifying the repair" procedure.
Step 3150-8
Ensure device,
device enclosure, and adapter microcode levels are up to date.
Did
you update to newer microcode levels?
- No
- Go to Step 3150-10.
- Yes
- Go to Step 3150-9.
Step 3150-9
When the problem is resolved, see the removal and replacement
procedures topic for the system unit on which you are working and
do the "Verifying the repair" procedure.
Step 3150-10
Identify the
adapter SAS port associated with the problem by examining the hardware
error log. The hardware error log might be viewed as follows:
- Follow the steps in Examining the hardware error log and return here.
- Select the hardware error log to view. Viewing the hardware error
log, under the Disk Information heading, the Resource field
can be used to identify which controller port the error is associated
with.
Note: If you do not see the
Disk Information heading in
the error log, obtain the Resource field from the
Detail Data /
PROBLEM DATA section as illustrated in the following example:
Detail Data
PROBLEM DATA
0000 0800 0004 FFFF 0000 0000 0000 0000 0000 0000 1910 00F0 0408 0100 0101 0000
^
|
Resource is 0004FFFF
Go to
Step 3150-11.
Step 3150-11
Using the resource found
in the previous step, see SAS resource locations to
understand how to identify the controller's port to which the
device, or device enclosure, is attached.
For example, if
the resource were equal to 0004FFFF, port 04 on the adapter is used
to attach the device, or device enclosure that is experiencing the
problem.
The resource found in the previous step can also be
used to identify the device. To identify the device, you can attempt
to match the Rresource with one found on the display, which is displayed
by completing the following steps.
- Start the IBM SAS Disk Array
Manager:
- Start the diagnostics program and select Task Selection from
the Function Selection display.
- Select .
- Select .
Step 3150-12
Because the
problem persists, some corrective action is needed to resolve the
problem. Using the port or device information found in the previous
step, proceed by doing the following steps.
- Power off the system or logical partition.
- Perform only one of the following corrective actions, which are
listed in the order of preference. If one of the corrective actions
has been attempted, then proceed to the next action in the list.
Note: Prior
to replacing parts, consider using a complete power down of the entire
system, including any external device enclosures, to reset all possible
failing components. This might correct the problem without replacing
parts.
- Power on the system or logical partition.
Note: In some situations,
it might be acceptable to unconfigure and reconfigure the adapter
instead of powering off and powering on the system or logical partition.
Step 3150-13
Does the problem
still occur after performing the corrective action?
- No
- Go to Step 3150-14.
- Yes
- Go to Step 3150-12.
Step 3150-14
When the problem is resolved, see the removal and replacement
procedures topic for the system unit on which you are working and
do the "Verifying the repair" procedure.