IBM Support

Improving EEH error handling for fiber adapters during link recovery of EMX0 fanout module

How To


Summary

Certain PCIe events that occur between the P8 or P9 system processor and the PCIe switch located in the fanout module results in that PCIe link being reset and recovered. This event logs a platform event SRC B7006A22. When this event occurs, all I/O adapters in the affected fanout module perform EEH recovery separately. This document describes best practices for improving EEH detection and recovery effectiveness, and describes tunable values that can be used to affect recovery criteria. This document expects the server to be configured with resource redundancy.

This document is applicable to both Virtual I/O Server (VIOS) and AIX. For implementation within VIOS, oem_setup_env must be used.

Objective

The goal of this discussion is to understand options available to clients with features such as EMX0 for wanted availability of applications when a fanout module reset happens. 
  • First, we need to understand the behavior of the system during a fanout module reset.
  • Next, we need to understand tuning values that can affect recovery of the fanout module slots, their priorities.
  • Certain legacy fibre channel HBAs are slow to recognize an EEH event.  We need to understand those adapters and what can be done to improve the situation for those adapters.
  • Finally, we need to understand how AIX performs recovery of disk I/O operations that are in queue or in progress once the fanout module reset is completed or the EEH condition has been cleared.

Steps

Behavior during fanout module reset

An EMX0 fanout module reset typically logs an SRC of B7006A22.  This indicates that the PCIe switch in one of the two fanout modules of the drawer was reset and reinitialized by the server.  As a result of the reset, the fanout module slots will also undergo a reset and reinitialization by the driver as a part of AIX Enhanced Error Handling.  During the reset, all I/O to adapters affected by the reset are "frozen" or made inactive.  When the fanout module reset completes the I/O adapters are un-frozen and the I/O adapter driver begins reinitialization.  Recovery then begins on outstanding and queued I/O operations.  While a fanout module reset typically takes only a few seconds to complete,  on rare occasions the time can be longer, affecting recovery times for the installed I/O adapters.  The freeze begins at the hardware level when the fanout module begins resetting.  EEH recovery starts separately for each affected slot because it is a function controlled by the driver for each adapter.
AIX begins EEH recovery for a given slot when the I/O driver detects that the slot either is going through or has gone through a reset.  Certain I/O adapter drivers detect an EEH event more quickly than others.  AIX exits EEH recovery when the I/O adapter driver has reinitialized the card after the EEH event.  The amount of time seen by AIX recovery is longer than the time taken to reset just the fanout module and I/O adapters.

Tuning for best EEH recovery (pcibus_eeh_perm_timeout)

If EEH recovery during a fanout module reset is taking too long, AIX does have options that can affect the behavior of EEH recovery.  AIX tunable pcibus_eeh_perm_timeout can be used to set the time in seconds allowed by AIX to recovery an I/O adapter going through EEH.  If the affected I/O adapter has not recovered by the time the pcibus_eeh_perm_timeout value is reached AIX marks the I/O adapter as failed.  The hardware can continue to undergo recovery internally, but AIX will not be able to use the recovered functions without a reboot of the partition, a DLPAR operation to remove and then re-add the affected I/O adapter, or a repair action affecting the adapter slot.  In the case where a fanout module fails and does not recover the hardware, only a repair action can be used to recover the failed adapter.
The tunable pcibus_eeh_perm_timeout has a default value of 300 seconds or 5 minutes.  This value is intended to allow adequate time for a failed PCIe bus domain to undergo recovery and resume normal operations before failing the I/O adapters in the fanout module.  Typically there is no reason to increase this value, but some clients have expressed desire to reduce the value to better meet business objectives related to recovery scenarios.
Most fanout module resets take less than 10 seconds to complete.  In this scenario, the setting for pcibus_eeh_perm_timeout is irrelevant since the adapters recover in a short period of time.  If the actual recovery time increases from the short duration normally experienced to a longer duration, some I/O operations are being delayed, and an impact to users can be experienced.  In certain circumstances, application-based cluster time limits could expire before hardware has recovered, and can take an abnormal action such as expulsion of the node  to recover at the cluster level.  For this scenario, it might be necessary to restrict recovery of the PCI bus within AIX in order to allow I/O adapter drivers to begin recovery of outstanding and queued operations earlier.
IBM has reviewed default settings for pcibus_eeh_perm_timeout.  As discussed, there can be scenarios where the value might be set too large for a given environment.  Although there is no lower limit defined for the setting, IBM does not normally recommend this value be set below 30 seconds.  Consider the following if values below 30 seconds are needed.
  1. AIX will mark the hardware in the fanout module as failed once the pcibus_eeh_perm_timeout expires.  The hardware will not be usable again until DLPAR, repair, or re-IPL of the LPAR.
  2. Adequate redundancy must exist to maintain the performance needed by the applications until a recovery or repair action can be taken.  Further degradation or operating system failure can occur if a subsequent event were to occur that also exceeds the timeout value before performing the recover or repair action for the first event.
  3. IBM recommends repairing and recovering a failed fanout module as soon as possible after a fanout module reset event.
  4. IBM recommends AIX 7.2 TL4 as the recommended AIX level to attain the maximum benefit from improvements in operating system resiliency.  The use of lower values than 30s for pcibus_eeh_perm_timeout should not be attempted on AIX levels below AIX 7.1 TL4.
  5. Due to the time for hardware recovery under optimal circumstances, the pcibus_eeh_perm_timeout should not be set below 10s.
  6. IBM recommends performing any configuration changes in a test environment before applying in production.  Risk of unexpected system outages increases as the pcibus_eeh_perm_timeout is set below recommended values.
The pcibus_eeh_timeout can be displayed or set using the ioo AIX command, documented here: 

Legacy HBA EEH Detection

The AIX device driver for the legacy 8Gbps and slower FC HBAs detects an EEH event in a limited number of locations within the driver.   For the best possible good path performance, the driver only attempts to detect an EEH freeze in certain error paths.  When using an older FC adapter, there can be a delay between the actual EEH event and the time when the EEH event is detected by the AIX device driver.   EEH detection is improved for these adapters in AIX 7.2 TL4.

The device driver for the newer 16GB or faster FC HBAs includes additional EEH detection code to allow it to detect EEH events more quickly.  An EEH event is normally detected by this device driver within 5 seconds.

The next section contains more information about SCSI error recovery and EEH detection and includes a possible strategy for causing an EEH error to be detected more quickly for legacy FC HBAs on older levels of AIX.

AIX I/O operation recovery after EEH completion

Any time that a SCSI command times out, the AIX FC device driver must perform certain error recovery steps before returning the command to the SCSI disk MPIO code to retry the command on an alternate path.   The driver attempts three different recovery steps until one of them works or until they all fail.  If all three recovery actions fail, the device driver performs a reset on the adapter.  With the legacy 8Gbps and slower adapters, an EEH event is only detected once the device driver attempts the adapter reset.   When there is an EEH freeze, the SCSI command times out and all three of the recovery actions time out as well.  The SCSI command uses the rw_timeout attribute for its timer value; the recovery commands use an 8 second timer.   So when rw_timeout is 30, EEH detection could have a 54 second delay (30 seconds rw_timeout plus 3 times 8 recovery action timeouts).

After the EEH event is detected, the AIX FC device driver begins the adapter EEH recovery process.   Only after the adapter is recovered (or after EEH recovery fails), does the FC device driver return the SCSI commands to the SCSI disk MPIO driver.  At that point the MPIO driver can retry the SCSI commands on an alternate path.

A method for making the legacy driver detect EEH events in a manner similar to the newer adapters is described here.  With this method, the EEH detection can be shortened from approximately 54 seconds down to the same 5 second delay when using the newer adapters.  Any outstanding I/Os are still held by the AIX FC device driver until the EEH recovery completes successfully or fails.  The strategy for improving the EEH detection is to run a program that periodically queries the adapter using an FC device driver I/O control operation that accesses the adapter hardware.  That hardware access can detect an EEH freeze.

One such program called fcstat2 is available in the PERFPMR set of tools. This package of tools can be downloadable from ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr .

It is recommended to run the tool with a 5-second interval option as follows:

nohup ./fcstat2 -A -m 5 &
This will issue a very low overhead I/O control operation to one active port of each adapter. There will be no noticeable overhead of running this tool in this manner.

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"Component":"Fiber Attach Storage Adapters","Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
01 August 2023

UID

ibm10959295