IBM Support

FW920.20 and Newer Workaround for SRCB7006A8D and SRCB7006A8E EEH Errors

Troubleshooting


Problem

Clients with PCIe Gen3 I/O Expansion Drawer feature EMX0 can experience Enhanced Error Handling (EEH) errors or reset reloads on the I/O adapters in the installed fanout modules.  Internally, an informational system reference code (SRC) B7006A8D or B7006A8E is logged indicating a reset of the Field Programmable Gate Array (FPGA) occurred. 

Symptom

SRC B7006A8D is logged when the FPGA in the PCIe3 cable adapter is reset.
SRC B7006A8E is logged when the FPGA in the PCIe3 6-slot fanout module is reset.
Most adapters in the PCIe3 6-slot fanout module recover automatically.  However, some adapters will not.  For more information, see I/O adapters without Enhanced Error Handling Support for more information.

Cause

Both of these SRCs create a condition where PCIe3 switch located in the fanout module must be reset in order to recover from the error.  The I/O adapters in the fanout module are then reset to recover.

An FPGA reset is performed when the checksum for SRAM contained in the FPGA mismatches with that of the source logic program contained in flash memory.  A mismatch is due to a transient soft-error in the chip.  A soft-error is fully recoverable and is not a hardware error in the chip.  Therefore, no hardware is replaced for this condition.

SRAM contains the programming for the FPGA chip.  For the PCIe3 cable adapter, and for the EMXG PCIe 6-slot fanout module, the FPGA is used only for monitoring, control functions and also carries the PCIe reference clock needed by the fanout module.  No customer data is handled or affected by the function of the FPGA.

Environment

This issue can affect POWER8 and POWER9 servers having feature EMX0 PCIe3 module expansion drawer with PCIe3 6-slot fanout module features EMXF or EMXG.  Design enhancements have been made in feature EMXH and compatible PCIe3 cable adapters for POWER9 to prevent impact from soft-errors in the FPGA. 
The procedure documented here can be used on POWER9 servers having server firmware release FW920 only with level FW920.20 (Vx920_075) or newer.
Do not use this document for any other systems or firmware levels.

Resolving The Problem

A temporary workaround to prevent the FPGA reset is available.  This workaround must be performed after each server platform IPL, and also after concurrent maintenance where a component the EMX0 expansion drawer is replaced.  These sections describe how to use the Hardware Management Console (HMC) Enhanced User Interface, the Advanced System Management Interface (ASMI), and restricted shell command line to enable the workaround. 

This procedure turns off the ability of the FPGA to reset.  It does not prevent the soft-errors that trigger the reset.  A slight risk that a hardware failure can be reported exists that is due to a soft-error.  However, there is no possibility to trace the failure back to a soft-error.

FW920.20 and Newer Workaround by using the HMC Enhanced User Interface

  1. Under Resources -> All Systems, select the server to enable the workaround on.
  2. In the navigator, find the serviceability section and click  Serviceability.
  3. In the Serviceability menu, under View and Collect, select Manage Dumps.
  4. On the Manage Dumps dialog, Verify the server at the top then select Action -> Initiate Resource DumpDo not attempt to use any other options from this menu.
  5. Enter "xmsvc -DISABLECCSER" in the resource selector field.
    • Resource Dump Example
  6. Click the OK button to disable FPGA resets for the server.  If the system indicates that the dump request was successfully initiated.  No further action is needed.
    • image-20200317105005-2
Successful completion of the request indicates that FPGA resets are disabled.  Repeating the procedure does not have any further effect.  The dump does not need to be looked at to confirm success of the procedure.
***** THIS COMPLETES THE PROCEDURE FOR THE HMC INTERFACE *****
 


FW920.20 and Newer Workaround by using the Advanced System Management Interface (ASMI)

  1. Log in to ASMI as admin or greater authority.
  2. In the navigator, expand System Service Aids, then select Resource Dump.
  3. Enter "xmsvc -DISABLECCSER"  in the resource selector field:
    • image-20200325163927-7
  4. Click the Initiate Resource Dump button. If the request is successful, text similar to the following is displayed.
    • image-20200317142605-2

***** THIS COMPLETES THE PROCEDURE FOR THE ASMI INTERFACE *****

FW920.20 and Newer Workaround by using HMC Restricted Shell

  1. Log in to HMC Restricted Shell via SSH, PuTTY, or from Restricted Shell on the HMC GUI with admin level authority.
  2. Use the command "lssyscfg -r sys -Fname,serial_num" to identify the system name in the first field of output.
  3. Run the command "startdump" replacing {managed server} with the server name identified previously.
  4. startdump -m {managed server} -t resource -r "xmsvc -DISABLECCSER"
  5. If the command completes successfully, no further action is required.  FPGA resets are now disabled for the selected server.

***** THIS COMPLETES THE PROCEDURE FOR THE HMC RESTRICTED SHELL INTERFACE *****

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"HW1A1","label":"IBM Power Systems"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF012","label":"IBM i"},{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"TI0005E","label":"Power System S914 Server"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF012","label":"IBM i"},{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"TI0005G","label":"Power System S922 Server (9009-22A)"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF012","label":"IBM i"},{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"TI0005H","label":"IBM Power System L922 (9008-22L)"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"TI0007E","label":"Power System E950 Server (9040-MR9)"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF012","label":"IBM i"},{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"TI0007I","label":"IBM Power System E980 (9080-M9S)"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF012","label":"IBM i"},{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Document Information

Modified date:
07 December 2021

UID

ibm16114076