IBM Support

CPU faults, I-Errors problem determination - IBM BladeCenter HS40

Troubleshooting


Problem

IBM BladeCenter HS40 has been experiencing CPU faults as indicated by the Management Module logs. These CPU faults are mainly caused by I-Errors from the CPU.

Resolving The Problem

Source

RETAIN tip: H183824

Symptom

IBM BladeCenter HS40 has been experiencing CPU faults as indicated by the Management Module logs. These CPU faults are mainly caused by I-Errors from the CPU.
 
Affected configuration

The system may be any of the following IBM eServers:

  • BladeCenter HS40, type 8839, any model

The BIOS levels affected are:

  • IBM BladeCenter HS40 (8839)- Flash BIOS update, version 1.32 or below

The Firmware levels affected are:

  • IBM BladeCenter HS40 (8839)- Blade Server Integrated System Management Processor Firmware update, version 1.18 or below
Solution

Ensure the system is configured with BIOS version 1.44 and BMC version 1.24 or above.
 
If the system has completed POST and a "8510 Processor P01 IERR occurred" appears during normal operation or running diagnostics then you should replace the processor.
 
If a "8510 Processor P01 IERR occurred due to platform" appears you should follow the problem determination steps below for an I/O Timeout.

Additional information

IBM has discovered and fixed the causes of certain I-Errors. These fixes are contained in the following code

  • 8839 IBM eServer BladeCenter HS40 - Flash BIOS update, version 1.44 or later
  • 8839 IBM eServer BladeCenter HS40 - Blade Server Integrated System Management Processor Firmware update, version 1.24 or later

All referenced files are available from IBM's support web site.

Recommended problem determination steps for CPU faults
  1. Determine BIOS and ISMP code levels. If the code levels are below the code levels indicated above, then the fault logged will not correctly identify the true cause of the failure. The failure can be due to a CPU fault or an I/O timeout. It is therefore recommended that you update the BIOS and ISMP code to the levels above. The system will then correctly identify the true cause of the Error as either a CPU fault or an I/O time out.
  2. If BIOS and ISMP code levels are at or above the code levels indicated above, the system will correctly identify a CPU fault. If a CPU fault occurs and is logged, then a normal FRU replacement should be done.

The following is an example of a CPU fault format
 
Internal Error CPU 1 Fault(s)
 
Index Sev Source Date Time
----- ---- -------- -------- --------
1 ERR BLADE_01 06/27/04 12:09:12
Text
---------------------
(McCarran 0317) POSTBIOS: 8110 Processor P01 detected IERR
 
2 ERR BLADE_01 06/27/04 12:08:49
Text
---------------------
(McCarran 0317) POSTBIOS: 8510 Processor P01 IERR occurred
 
3 ERR BLADE_01 06/27/04 12:08:40
Text
---------------------
(McCarran 0317) Internal Error CPU 1 Fault(s)
 
If the above error occurs then replace the flagged CPU.

Recommended problem determination steps for I/O timeouts

Note: The P01 IERR occurred due to platform message only indicates that an I/O timeout occurred. It does not isolate the failing device that caused the event.

  1. Check to see that the onboard Ethernet, Ethernet daughter cards, Fibre daughter cards are at the current level of firmware and using the latest device drivers. Pay particular attention to Linux device drivers, ensuring that they are tested, supported versions and at the latest level.
  2. Reseat any daughter cards installed, and or Blade Storage Expansion (BSE) unit if installed.
  3. Ensure that BSE terminator that ships from the factory is installed, if a BSE unit is not installed.
  4. Check the operating system logs for hard disk drive errors. If there are hard disk drives errors at the time of the I/O timeout, compare the timestamp of the operation system log and the time stamp of the Management Module log. If they are at the same time, then it is possible that the fault was caused by the hard disk drive.
  5. Run Diagnostics.
  6. Observe what the system is doing at the time of the failure. If you have a timeout and the IERR is due to an I/O timeout that occurs before the POST completes, there will be no message about the IERR that occurred due to platform.
  7. If you have completed the previous steps and are still seeing an I/O timeout as indicated by the MachineCheck in the Management Module log, open up a case with the support center. Document what was happening with the system at the time of the failure, for example high network traffic, database indexing, and
    idle... etc.

The following is an example of I/O timeout format:

Index Sev Source Date Time
----- ---- -------- -------- --------
1 ERR BLADE_01 06/27/04 12:09:12
Text
--------------------
(McCarran 0317) POSTBIOS: 8110 Processor P01 detected IERR
 
2 ERR BLADE_01 06/27/04 12:08:49 (
Text
--------------------
McCarran 0317) POSTBIOS: 8510 Processor P01 IERR occurred due
to platform
 
3 ERR BLADE_01 06/27/04 12:08:40
Text
--------------------
(McCarran 0317) Internal Error CPU 1 Fault(s)

Summary

It has been observed that most of the time I-Errors are not caused by a CPU fault (defective CPU). Therefore we have made a change to our BIOS and ISMP for the IBM eServer BladeCenter HS40 (Type 8839) to first identify if the problem is caused by a CPU fault. If the fault was caused by the CPU, then the CPU will be
flagged as needing to be replaced.
 
If the condition was not caused by a CPU fault, then the system will log the error, reboot the system, and no CPUs are disabled. This will allow problem determination to be done at a scheduled time.
 
Note: CPU faults are mainly caused by I-Errors from the CPU.
 
An I-Error is a generic error and can be caused by the following (in no particular order):

  • CPU failure
  • PCI timeouts, that can be caused by firmware, device drivers, or options such as Fiber or Ethernet daughter card if not seated correctly.
  • Hard disk drive drive failure
  • Power supply
  • BSE terminator not being installed if no BSE is connected to the blade. (The BSE terminator ships standard with the blade and should
    only be removed if adding a Blade Storage Expansion kit.)
  • The system is booting the EFI Shell and is attempting to PXE boot from the onboard Intel NIC.

Document Location

Worldwide

Operating System

BladeCenter:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20G","label":"BladeCenter->BladeCenter HS40"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"LOB18","label":"Miscellaneous LOB"}}]

Document Information

Modified date:
29 January 2019

UID

ibm1MIGR-59449