How To
Summary
Certain PCIe events that occur between the P8 or P9 system processor and the PCIe switch located in the fanout module results in that PCIe link being reset and recovered. This event logs a platform event SRC B7006A22. When this event occurs, all I/O adapters in the affected fanout module perform EEH recovery separately. This document describes best practices for improving EEH detection and recovery effectiveness, and describes tunable values that can be used to affect recovery criteria. This document expects the server to be configured with resource redundancy.
This document is applicable to both Virtual I/O Server (VIOS) and AIX. For implementation within VIOS, oem_setup_env must be used.
Objective
- First, we need to understand the behavior of the system during a fanout module reset.
- Next, we need to understand tuning values that can affect recovery of the fanout module slots, their priorities.
- Certain legacy fibre channel HBAs are slow to recognize an EEH event. We need to understand those adapters and what can be done to improve the situation for those adapters.
- Finally, we need to understand how AIX performs recovery of disk I/O operations that are in queue or in progress once the fanout module reset is completed or the EEH condition has been cleared.
Steps
Behavior during fanout module reset
Tuning for best EEH recovery (pcibus_eeh_perm_timeout)
- AIX will mark the hardware in the fanout module as failed once the pcibus_eeh_perm_timeout expires. The hardware will not be usable again until DLPAR, repair, or re-IPL of the LPAR.
- Adequate redundancy must exist to maintain the performance needed by the applications until a recovery or repair action can be taken. Further degradation or operating system failure can occur if a subsequent event were to occur that also exceeds the timeout value before performing the recover or repair action for the first event.
- IBM recommends repairing and recovering a failed fanout module as soon as possible after a fanout module reset event.
- IBM recommends AIX 7.2 TL4 as the recommended AIX level to attain the maximum benefit from improvements in operating system resiliency. The use of lower values than 30s for pcibus_eeh_perm_timeout should not be attempted on AIX levels below AIX 7.1 TL4.
- Due to the time for hardware recovery under optimal circumstances, the pcibus_eeh_perm_timeout should not be set below 10s.
- IBM recommends performing any configuration changes in a test environment before applying in production. Risk of unexpected system outages increases as the pcibus_eeh_perm_timeout is set below recommended values.
Legacy HBA EEH Detection
The AIX device driver for the legacy 8Gbps and slower FC HBAs detects an EEH event in a limited number of locations within the driver. For the best possible good path performance, the driver only attempts to detect an EEH freeze in certain error paths. When using an older FC adapter, there can be a delay between the actual EEH event and the time when the EEH event is detected by the AIX device driver. EEH detection is improved for these adapters in AIX 7.2 TL4.
The device driver for the newer 16GB or faster FC HBAs includes additional EEH detection code to allow it to detect EEH events more quickly. An EEH event is normally detected by this device driver within 5 seconds.
The next section contains more information about SCSI error recovery and EEH detection and includes a possible strategy for causing an EEH error to be detected more quickly for legacy FC HBAs on older levels of AIX.
AIX I/O operation recovery after EEH completion
Any time that a SCSI command times out, the AIX FC device driver must perform certain error recovery steps before returning the command to the SCSI disk MPIO code to retry the command on an alternate path. The driver attempts three different recovery steps until one of them works or until they all fail. If all three recovery actions fail, the device driver performs a reset on the adapter. With the legacy 8Gbps and slower adapters, an EEH event is only detected once the device driver attempts the adapter reset. When there is an EEH freeze, the SCSI command times out and all three of the recovery actions time out as well. The SCSI command uses the rw_timeout attribute for its timer value; the recovery commands use an 8 second timer. So when rw_timeout is 30, EEH detection could have a 54 second delay (30 seconds rw_timeout plus 3 times 8 recovery action timeouts).
After the EEH event is detected, the AIX FC device driver begins the adapter EEH recovery process. Only after the adapter is recovered (or after EEH recovery fails), does the FC device driver return the SCSI commands to the SCSI disk MPIO driver. At that point the MPIO driver can retry the SCSI commands on an alternate path.
A method for making the legacy driver detect EEH events in a manner similar to the newer adapters is described here. With this method, the EEH detection can be shortened from approximately 54 seconds down to the same 5 second delay when using the newer adapters. Any outstanding I/Os are still held by the AIX FC device driver until the EEH recovery completes successfully or fails. The strategy for improving the EEH detection is to run a program that periodically queries the adapter using an FC device driver I/O control operation that accesses the adapter hardware. That hardware access can detect an EEH freeze.
One such program called fcstat2 is available in the PERFPMR set of tools. This package of tools can be downloadable from ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr .
It is recommended to run the tool with a 5-second interval option as follows:
nohup ./fcstat2 -A -m 5 &
This will issue a very low overhead I/O control operation to one active port of each adapter. There will be no noticeable overhead of running this tool in this manner.Related Information
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
01 August 2023
UID
ibm10959295