Troubleshooting
Problem
IBM BladeCenter HS40 has been experiencing CPU faults as indicated by the Management Module logs. These CPU faults are mainly caused by I-Errors from the CPU.
Resolving The Problem
| Source |
|---|
RETAIN tip: H183824
| Symptom |
|---|
IBM BladeCenter HS40 has been experiencing CPU faults as indicated by the Management Module logs. These CPU faults are mainly caused by I-Errors from the CPU.
| Affected configuration |
|---|
The system may be any of the following IBM eServers:
- BladeCenter HS40, type 8839, any model
The BIOS levels affected are:
- IBM BladeCenter HS40 (8839)- Flash BIOS update, version 1.32 or below
The Firmware levels affected are:
- IBM BladeCenter HS40 (8839)- Blade Server Integrated System Management Processor Firmware update, version 1.18 or below
| Solution |
|---|
Ensure the system is configured with BIOS version 1.44 and BMC
version 1.24 or above.
If the system has completed POST and a "8510 Processor P01 IERR
occurred" appears during normal operation or running diagnostics
then you should replace the processor.
If a "8510 Processor P01 IERR occurred due to platform" appears
you should follow the problem determination steps below for an I/O
Timeout.
| Additional information |
|---|
IBM has discovered and fixed the causes of certain I-Errors. These fixes are contained in the following code
- 8839 IBM eServer BladeCenter HS40 - Flash BIOS update, version 1.44 or later
- 8839 IBM eServer BladeCenter HS40 - Blade Server Integrated System Management Processor Firmware update, version 1.24 or later
All referenced files are available from IBM's support web site.
| Recommended problem determination steps for CPU faults |
|---|
- Determine BIOS and ISMP code levels. If the code levels are
below the code levels indicated above, then the fault logged will
not correctly identify the true cause of the failure. The failure
can be due to a CPU fault or an I/O timeout. It is therefore
recommended that you update the BIOS and ISMP code to the levels
above. The system will then correctly identify the true cause of
the Error as either a CPU fault or an I/O time out.
- If BIOS and ISMP code levels are at or above the code levels indicated above, the system will correctly identify a CPU fault. If a CPU fault occurs and is logged, then a normal FRU replacement should be done.
The following is an example of a CPU fault format
Internal Error CPU 1 Fault(s)
Index Sev Source Date Time
----- ---- -------- -------- --------
1 ERR BLADE_01 06/27/04 12:09:12
Text
---------------------
(McCarran 0317) POSTBIOS: 8110 Processor P01 detected IERR
2 ERR BLADE_01 06/27/04 12:08:49
Text
---------------------
(McCarran 0317) POSTBIOS: 8510 Processor P01 IERR occurred
3 ERR BLADE_01 06/27/04 12:08:40
Text
---------------------
(McCarran 0317) Internal Error CPU 1 Fault(s)
If the above error occurs then replace the flagged CPU.
| Recommended problem determination steps for I/O timeouts |
|---|
Note: The P01 IERR occurred due to platform message only indicates that an I/O timeout occurred. It does not isolate the failing device that caused the event.
- Check to see that the onboard Ethernet, Ethernet daughter
cards, Fibre daughter cards are at the current level of firmware
and using the latest device drivers. Pay particular attention to
Linux device drivers, ensuring that they are tested, supported
versions and at the latest level.
- Reseat any daughter cards installed, and or Blade Storage
Expansion (BSE) unit if installed.
- Ensure that BSE terminator that ships from the factory is
installed, if a BSE unit is not installed.
- Check the operating system logs for hard disk drive errors. If
there are hard disk drives errors at the time of the I/O timeout,
compare the timestamp of the operation system log and the time
stamp of the Management Module log. If they are at the same time,
then it is possible that the fault was caused by the hard disk
drive.
- Run Diagnostics.
- Observe what the system is doing at the time of the failure. If
you have a timeout and the IERR is due to an I/O timeout that
occurs before the POST completes, there will be no message about
the IERR that occurred due to platform.
- If you have completed the previous steps and are still seeing
an I/O timeout as indicated by the MachineCheck in the Management
Module log, open up a case with the support center. Document what
was happening with the system at the time of the failure, for
example high network traffic, database indexing, and
idle... etc.
The following is an example of I/O timeout format:
Index Sev Source Date Time
----- ---- -------- -------- --------
1 ERR BLADE_01 06/27/04 12:09:12
Text
--------------------
(McCarran 0317) POSTBIOS: 8110 Processor P01 detected IERR
2 ERR BLADE_01 06/27/04 12:08:49 (
Text
--------------------
McCarran 0317) POSTBIOS: 8510 Processor P01 IERR occurred due
to platform
3 ERR BLADE_01 06/27/04 12:08:40
Text
--------------------
(McCarran 0317) Internal Error CPU 1 Fault(s)
| Summary |
|---|
It has been observed that most of the time I-Errors are not
caused by a CPU fault (defective CPU). Therefore we have made a
change to our BIOS and ISMP for the IBM eServer BladeCenter HS40
(Type 8839) to first identify if the problem is caused by a CPU
fault. If the fault was caused by the CPU, then the CPU will be
flagged as needing to be replaced.
If the condition was not caused by a CPU fault, then the system
will log the error, reboot the system, and no CPUs are disabled.
This will allow problem determination to be done at a scheduled
time.
Note: CPU faults are mainly caused by I-Errors
from the CPU.
An I-Error is a generic error and can be caused by the following
(in no particular order):
- CPU failure
- PCI timeouts, that can be caused by firmware, device drivers, or options such as Fiber or Ethernet daughter card if not seated correctly.
- Hard disk drive drive failure
- Power supply
- BSE terminator not being installed if no BSE is connected to
the blade. (The BSE terminator ships standard with the blade and
should
only be removed if adding a Blade Storage Expansion kit.) - The system is booting the EFI Shell and is attempting to PXE boot from the onboard Intel NIC.
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
29 January 2019
UID
ibm1MIGR-59449