Troubleshooting
Problem
When a Spectrum Virtualize Storage Product (such as SVC, V7K, FS9100) node is rebooted, half the paths to the disks should be available via a redundant SVC node/controller. However, for large configurations with many AIX LPARs or clustered configurations with extremely high I/O loads, there are times when all paths to the storage may become inaccessible. This is true whether the node reboot was planned (as when the storage firmware is updated) or unplanned (as in a node assert on the storage). In rare cases, any interruption on one or more paths to the storage (such as a VIO reboot or some issue at the switch which causes path(s) to fail) may produce the issue.
The problem has only been seen with the following configurations:
AIX (or VIOS) 4 Gb or 8 Gb or 16 Gb fibre adapters --> Broadcom fibre channel switches --> Spectrum Virtualize Product Storage Arrays
This is true for:
1. AIX (all versions)
2. VIOS/PowerVM (all versions)
3. Spectrum Virtualize Storage Products (all versions of SVC/V7K/FS9100 firmware)
4. All 4 Gb, 8 Gb, 16 Gb physical adapters
5. All AIX NPIV (virtual) adapters
6. All 8Gb and 16 Gb Broadcom fibre channel switches
Symptom
1. Multiple hosts (not necessarily ALL hosts) impacted at the same time. If these symptoms occur only on a single host - it is unlikely to be related to the Broadcom defect (detailed below)
2. Loss of all paths to LUNs when, in fact, some paths 'should have' been
2. Loss of all paths to LUNs when, in fact, some paths 'should have' been
available. *
3. Extremely slow performance
4. Applications or LPARs may hang
5. Possible application or database crashes
6. Some AIX crashes, where the stack trace will indicate a crash due to
3. Extremely slow performance
4. Applications or LPARs may hang
5. Possible application or database crashes
6. Some AIX crashes, where the stack trace will indicate a crash due to
database or cluster DMS timeouts
7. Large numbers of disk/path/FC-fscsi errors in the AIX error logs -
7. Large numbers of disk/path/FC-fscsi errors in the AIX error logs -
indicating that the hosts are in extended error recovery on these paths.
On the storage-side; SVC/V7K/FS9100 support will see a number of SCSI check conditions and ACA active conditions where the ACA's remain
active/uncleared for an extended period of time (10 seconds or longer).
* By design, AIX MPIO will never mark the last path as "Failed." However, if
you see something like 7 of 8 paths failed, AND the disk is not accessible
via the last remaining path, it's safe to say the last path has also failed.
Cause
The problem arises due to a Broadcom (formerly Brocade) switch firmware
defect. The defect was discovered in the June/July 2019 timeframe. Details
of the defect are:
The defect applies only to Broadcom Gen 4 (8 Gb) and Gen 5 (16 Gb) switches.
They have stated that Gen 6 (32 Gb) switches are not subject to this defect.
It applies to ALL switch FOS (firmware) versions on the 8Gb and 16 Gb
switches.
A high-level description of the defect is:
A trap is enabled when there is a zone miss for certain commands such as
PLOGI. (Port Login) The trap is designed to trap these commands and forward them to the switch's CPU for processing.
This trap should only be active for a few milliseconds and trap a relatively
small number of commands. During the time the trap is active all task
management** commands within the scope of the trap will be forwarded to the CPU.
The defect occurs when a port comes online while a trap is active. Some of
these ports that are coming online can end up getting added to the scope of
the trap setup. Once they are added they remain within the trap setup. Over
time additional ports can get added to the trap setup adding to the scope of
the trap each time the trap becomes active. The trap setup only gets reset
when the ASICs are reinitialized which requires a reboot of the switch. ***
The net effect of this defect is the number of frames that are trapped and
The net effect of this defect is the number of frames that are trapped and
sent to the switch CPU when the trap becomes active for a zone miss goes up beyond the original design point since a large number (most) of the frames should -not- have been trapped.
This results in overrunning the switch throttling mechanism which causes
fibre channel frames to be discarded. That, in turn, requires host error
recovery to recover the condition. **
** It's critical to note that AIX and VIO servers use SCSI task management
commands for error recovery on failed paths.
*** IBM found that sometimes it's possible to reboot the LPARs in succession to recover/clear the condition on the switches. IBM also found that if all LPARs were shut down, the condition in the switches would be cleared.
Environment
AIX or VIOS hosts (all AIX levels - all VIOS levles)
Physical OR virtual adapters (16 Gb or slower)
Broadcom (formerly Brocade) FC switches (4 Gb or 8 Gb or 16 Gb)
Spectrum Virtualize Products Storage (all firmware versions)
Diagnosing The Problem
1. Symptoms seen as described above.
2. There is exactly one way for Broadcom to know whether a customer has
encountered this defect: a tracedump and supportsave data MUST be collected on the switches-while-the-problem-is-happening- and BEFORE any corrective actions are taken. Therefore it's critical to collect switch data during the problem.
.
From the command line on the switches:
tracedump -n
supportsave
NOTE: Output is LONG: Use a connection (such as putty) which will allow full capture of all data scrolling on the screen!
Resolving The Problem
The fix for this issue is in a switch firmware release from Broadcom:
FOS v8.2.2b
Customers below that level may request a CVR from Broadcom. HOWEVER, they (Broadcom) will only provide the CVR fix if/when they have been able to confirm that a customer actually encountered this defect.
After supportsave data is collected and service to their hosts has been restored, the supportsave data can then be provided to Broadcom. That's done by opening a hardware case with the appropriate switch vendor ... against the switches.(i.e., if the customer purchased their switches from IBM -- open a hardware case with IBM. If the customer purchased their switches through a third party, open a hardware case with the third party.)
Request the switch vendor to send the supportsave data to Broadcom for verification.
CVR details are:
The CVR is v8.1.2g_cvr_812883_01 and the defect number is 812883.
Again, the permanent fix is in FOS v8.2.2b Updating switches to this firmware version fixes this problem.
Related Information
Document Location
Worldwide
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"HW1A1","label":"IBM Power Systems"},"Component":"AIX;VIOS;PowerVM;SVC;V7K","Platform":[{"code":"PF002","label":"AIX"}],"Version":"ALL","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]
Product Synonym
AIX; SVC; V7K, FS9100, Spectrum Virtualize
Was this topic helpful?
Document Information
Modified date:
03 May 2021
UID
ibm11098987