IBM Support

CPU x VRD fault with C-States enabled - IBM System x and BladeCenter

Troubleshooting


Problem

The server begins to power on then suddenly powers off or unexpectedly powers off or does not power on with one (1) or more of the following observed conditions (product dependent): LightPath display panel: Checkpoint alternates between FA and XX. (where: "XX" is another set of characters) BRD lightPath LED may be illuminated. Fault Light Emitting Diode (LED) may be illuminated. Integrated Management Module (IMM) event: Sensor 'CPU x VRD' has transitioned to non-recoverable. (where: x can be 1,2, 3, or 4) Sensor 'CPU CACHE VRD' has transitioned to non-recoverable. Sensor 'system board Fault' has transitioned to critical from a less severe state. Sensor 'CPU x y VIO' has transitioned to non-recoverable. (where: x, y can be 2, 3or 3 , 4 as seen on the IBM System x3850 X5 and x3950 X5) Advanced Management Module (AMM) event: system board voltage fault *** IMPORTANT (when symptom is encountered) *** Do not cycle the power, or reseat blades and attempt a power onwithou

Resolving The Problem

Source

RETAIN tip: H207008

Symptom

The server begins to power on then suddenly powers off or unexpectedly powers off or does not power on with one (1) or more of the following observed conditions (product dependent):

LightPath display panel:

  • Checkpoint alternates between FA and XX. (where: "XX" is another set of characters)
  • BRD lightPath LED may be illuminated.
  • Fault Light Emitting Diode (LED) may be illuminated.
  • Integrated Management Module (IMM) event:
  • Sensor 'CPU x VRD' has transitioned to non-recoverable. (where: x can be 1, 2, 3, or 4)
  • Sensor 'CPU CACHE VRD' has transitioned to non-recoverable.
  • Sensor 'system board Fault' has transitioned to critical from a less severe state.
  • Sensor 'CPU x y VIO' has transitioned to non-recoverable.

(where: x, y can be 2, 3 or 3, 4 as seen on the IBM System x3850 X5 and x3950 X5)

Advanced Management Module (AMM) event:

  • system board voltage fault

*** IMPORTANT (when symptom is encountered) ***

Do not cycle the power, or reseat blades and attempt a power on without Product Engineering approval.

Affected Configurations

The system can be any of the following IBM servers:

  • BladeCenter HS22V, Type 1949, any model
  • BladeCenter HS22V, Type 7871, any model
  • BladeCenter HX5, Type 1909, any model
  • BladeCenter HX5, Type 1910, any model
  • BladeCenter HX5, Type 7872, any model
  • BladeCenter HX5, Type 7873, any model
  • System x3550 M3, Type 4254, any model
  • System x3550 M3, Type 7944, any model
  • System x3650 M3, Type 4255, any model
  • System x3650 M3, Type 7945, any model
  • System x3850 X5, Type 7143, any model
  • System x3850 X5, Type 7145, any model
  • System x3850 X5, Type 7146, any model
  • System x3850 X5, Type 7191, any model
  • System x3950 X5, Type 7143, any model
  • System x3950 X5, Type 7145, any model

This tip is not software specific.

This tip is not option specific.

The system has the symptom described above.

Solution

If the server fails to power on and continues to exhibit the documented symptoms, replace the micro-processor board or system board.

Contact the IBM Service Provider or the appropriate Support Center for the corresponding geography:

For instance, in the U.S., contact 800-IBM-SERV at 800-426-7378.

Note: The IBM Directory of Worldwide Contacts is available from the following URL:

http://www.ibm.com/planetwide/

Workaround

To help prevent the symptoms from occurring:

  1. When possible, reduce the number of AC and DC power cycles. In addition for x3850 X5 and HX5, and when possible, reduce the number of restarts.
  2. Prevent the 'intel_idle' driver from loading (if following operating systems will be, or are being used - see following note):

    Some versions of Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES) distributions have a built in driver ('intel_idle') which by default will ignore any C-state limits set by or in Basic Input/Output System (BIOS) or Unified Extensible Firmware Interface (UEFI).

    Add/Edit the kernel statement shown in the following to the bootloader configuration file to prevent the 'intel_idle' driver from loading and to use the UEFI settings for C-State limit:

    intel_idle.max_cstate=0

    For more details, refer to RETAIN Tip H207000 (MIGR-5091901) at the following URL:
    http://www.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5091901

  3. Update the Unified Extensible Firmware Interface (UEFI) per the following product list:

    Note: Updating UEFI firmware by itself will not change the Advanced Configuration and Power Interface (ACPI) C-State limits needed to resolve this issue (see Details section). To insure the ACPI C-state is set correctly after updating UEFI to the new level, either load Defaults, or set the proper ACPI C-state Limit by continuing the workaround actions.

    • IBM BladeCenter HS22V (1949, 7871): Version 1.19 Build ID: P9E158A
    • IBM BladeCenter HX5 (1909, 1910, 7872, 7873): Version 1.77 Build ID: HIE177A
    • IBM System x3850 X5 (7143, 7145, 7146, 7191): Version 1.77 Build ID: G0E177A
    • IBM System x3550 M3 (4254, 7944, 4255, 7945): Version 1.17 Build: D6E159A
    • IBM System x3650 M3 (4254, 7944, 4255, 7945): Version 1.17 Build: D6E159A

      The file is available by selecting the appropriate Product Group, type of System, Product name, Product machine type, and Operating system on IBM Support's Fix Central web page, at the following URL: http://www.ibm.com/support/fixcentral/

  4. Set the ACPI C2-State limit using one (1) of the following two (2) methods:

    Method 1 - F1 Setup:
    1. Enter UEFI Setup by pressing F1 after the IBM System x Server Firmware logo screen appears when the system is powered on or restarted.
    2. Select System Settings --> Operating Modes.
    3. Change the Operating Mode to Custom Mode.
    4. Select System Settings --> Processors.
    5. Set the ACPI C-state Limit to ACPI C2.
    6. Press Escape (Esc) three (3) times, press 'Y' to save the settings and restart the server.

      Note: Servers that do not have the 'ACPI C-state Limit' menu selection effectively have the ACPI C-state limited to ACPI C2 by default. Although no action is needed in F1 Setup, the 'intel_idle' driver should be prevented from running if applicable as described previously.

    Method 2 - Advanced Settings Utility (ASU):

    1. Install IBM ASU locally (alternately: run ASU remotely).
      http://www.ibm.com/support/entry/portal/docdisplay?lndocid=tool-asu

    2. Execute the following ASU commands from a command prompt:
      1. asu64 set UEFI.OperatingModes "Custom Mode"
        Some systems may use an alternate ASU command:
        • asu64 set OperatingModes.ChooseOperatingMode "Custom Mode"

      2. asu64 set UEFI.PackageCState "ACPI C2"
        Some systems may use an alternate ASU command:
        • asu64 set Processors.PackageACPIC-StateLimit "ACPI C2"

      Note: If the the Workaround steps are performed on a UEFI version previous to what is listed in Step 3 and UEFI default settings are ever reloaded, the
      Workaround steps will have to be repeated.

Additional Information

In rare cases, processor Voltage Regulator Device (VRD) faults have been observed when a processor transitions between C-state 0 (full power) and deep C-states.

CPU VRD faults that occur due to state transitions can be reduced greatly or eliminated by having UEFI limit how deep of a C-state is allowed.

The listed UEFI versions change the default settings to set ACPI C-state limit to C2 as the default, for processors that support this setting/function.

Notes:

1. IBM Servers are designed to perform optimally in a steady state power environment. Excessive AC and DC power cycles can stress system components which may lead to pre-mature failure of the VRDs.

IBM System x3850 X5 and HX5 Blades perform DC power cycles during any system restart or warm boot. For example, a DC power cycle occurs following a 'Ctrl+Alt+Del' or operating system (OS) restart. In addition to AC and DC power cycles, system restarts (warm boots) should be avoided where possible on IBM System x3850 X5 and HX5 Blades.

2. Enabling C-states in UEFI Setup maps operating system ACPI C-state requests to Intel idle states to reduce idle processor power consumption. On IBM X5 family servers with Intel E7 processors, ACPI C1, C2, C3 map to Intel C1, C3, C6 states. On IBM X5 family servers with Intel 6500/7500 processors, ACPI C1, C3 map to Intel C1, C3 states and ACPI C2 is not available. OS software may over-ride the UEFI ACPI mapping, e.g. the intel_idle driver in Linux kernels invokes Intel idle states directly.

3. Just flashing the UEFI to the new level will not change the ACPI C-state limit needed because the UEFI settings are preserved between flash updates by design, and new defaults do not get automatically loaded after a UEFI update. To insure the ACPI C-state limit is set correctly after updating UEFI to the new level, either Load Defaults, or use the Workaround procedure to set the proper ACPI C-state.

4. Some newer versions of Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES) distributions have a built in driver ('intel_idle') which will ignore any C-state limits imposed by Basic Input/Output System (BIOS)/Unified Extensible Firmware Interface (UEFI).

Add the kernel statement shown in the quotes below to the bootloader configuration file to prevent the 'intel_idle' driver from loading and to use the UEFI settings for C-State limit:

  intel_idle.max_cstate=0

For more details, refer to RETAIN Tip H207000 (MIGR-5091901) at the following URL:

http://www.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5091901

 

Document Location

Worldwide

Operating System

BladeCenter:Operating system independent / None

System x:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04IKX","label":"BladeCenter->BladeCenter HS22V->1949"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04ILI","label":"BladeCenter->BladeCenter HS22V->7871"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04SLL","label":"System x->System x3650 M3->7945"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04SMA","label":"System x->System x3550 M3->7944"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"QU04SPC","label":"System x->System x3550 M3->4254"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04SPI","label":"System x->System x3650 M3->4255"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04SRF","label":"System x->System x3850 X5->7146"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04SRH","label":"BladeCenter->BladeCenter HX5->7872"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04SRO","label":"System x->System x3850 X5->7145"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04SZB","label":"System x->System x3950 X5->7145"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU04UNG","label":"BladeCenter->BladeCenter HX5->1909"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU90ABO","label":"System x->System x3850 X5->7191"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU90ABU","label":"BladeCenter->BladeCenter HX5->1910"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU90ABX","label":"System x->System x3850 X5->7143"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU90ACW","label":"BladeCenter->BladeCenter HX5->7873"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU90ADT","label":"System x->System x3950 X5->7143"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
30 January 2019

UID

ibm1MIGR-5091926