Troubleshooting
Problem
Troubleshooting "SP COMM" or "KERNEL MODE" errors on the BladeCenter system.
Resolving The Problem
|
Source
|
|---|
RETAIN tip: H19321
|
Symptom
|
|---|
Troubleshooting "SP COMM" or "KERNEL MODE" errors on the BladeCenter system.
|
Affected configurations
|
|---|
The system may be any of the following IBM servers:
- BladeCenter HS20, Type 1883, any model
- BladeCenter HS20, Type 1884, any model
- BladeCenter HS20, Type 7981, any model
- BladeCenter HS20, Type 8678, any model
- BladeCenter HS20, Type 8832, any model
- BladeCenter HS20, Type 8843, any model
- BladeCenter HS21, Type 8853, any model
- BladeCenter HS40, Type 8839, any model
- BladeCenter LS20, Type 8850, any model
- BladeCenter LS21, Type 7971, any model
- BladeCenter LS41, Type 7972, any model
- BladeCenter JS21, Type 8844, any model
- BladeCenter JS21, Type 7988, any model
This tip is not software specific.
This tip is not option specific.
|
Solution
|
|---|
No fix. This tip is information only.
|
Workaround
|
|---|
None.
|
Additional information
|
|---|
Troubleshooting "SP COMM" or "KERNEL MODE" errors on the BladeCenter system.
This document uses the abbreviation "MM" to refer to the Management Module and Advanced Management Module. The abbreviation "SP" is used to denote any embedded Service Processor on any blade. This document also uses the MM reset procedure explained in "Management Module Connectivity Issues." Please be familiar with it before proceeding.
There are two RS485 buses on the midplane. One connects the MM in slot 1 to all the blades, and a redundant one connects the MM in slot 2 to all the blades. Those buses are separate on the midplane and do not share any components. The primary MM initiates a conversation with the SP when a blade is inserted, and periodically initiates conversations with every blade in the chassis. If the Management Module cannot complete a conversation with a Service Processor, either during the initial insertion of the blade or during any other conversation, the MM logs a "SP comm" error in the MM Event Log. An, an "SP comm" error means "Service Processor communication failure," and is almost always indicative of a firmware problem or configuration problem on the MM or blade SP. It is possible for bad MM or SP hardware to cause SP comm failures, but this is quite rare.
Because packet collisions are possible, periodic "SP comm" errors will occur due too large amounts of traffic on the RS485 bus. Occasional SP comm errors that resolve without user intervention are to be expected and are not indicative of a problem. This document is meant to troubleshoot persistent SP comm errors.
When a blade is inserted into a BladeCenter chassis, the MM initiates a conversation with the Service Processor on the blade. In that conversation, the MM finds out the VPD of the blade, which slot the blade is in, and other information about that blade. This initial conversation corresponds to the fast blinking of the LED of the blade, lasting up to 30 seconds. The rate slows down after the successful completion of the initial conversation to about once per second. If the LED continues to blink quickly or turns off, the initial SP/MM conversation has ended catastrophically, and the blade will not function normally.
|
MM reports "SP comm" errors on initial blade insertion into a chassis that previously had no errors:
|
|---|
- Verify whether there are working blades with the same four digit machine type (e.g. 8850) and SP firmware level in this chassis. If there are, then one can be confident that the MM firmware and SP firmware are at supported levels for this machine type. If there are not any working blades of the same machine type and SP firmware level in the same chassis, go to the IBM support website and look at the change history for the current version of MM and SP firmware to verify that they support the M/T of the blade. If they do not, the MM and SP firmware will need to be flashed to a version that supports the machine type being inserted.
- Remove the newly inserted blade and examine the female connectors for damage. If damage is evident, replace the planar and examine the midplane for bent pins. Bent pins can usually be seen by shining a flashlight in the empty slot.
- Insert the blade into a known good slot in this chassis or another chassis (preferred). If it works in the other slot or chassis, that confirms that the SP hardware and firmware is functional, and you should move to section 3. Section 3 steps through MM and chassis troubleshooting.
- If the blade still does not work in the known good slot/chassis, flash the SP of this blade to a known working firmware level, using either the version on other blades in this chassis or the current version on the IBM website. If a failure occurs during the flash process, or the problem persists after flashing the SP, check the IBM web site to see if updated MM or SP firmware addresses the error.
- If the failure continues, contact IBM support to have the planar replaced.
|
"kernel mode" error and possible "SP comm" errors on multiple blades
|
|---|
A related failure to "SP comm" is for a blade to say "kernel mode" in the system status screen of the MM. "Kernel mode" means that the SP has booted, but has encountered some type of firmware corruption. A single blade in "kernel mode" is often accompanied by multiple blades having "SP comm" errors. Once a "kernel mode" error is seen on the chassis, do not flash SP firmware on any other blade. Do the following to resolve these errors:
- Remove all blades reporting either error. Examine the female connectors on each removed blade to ensure they have not been damaged. Check the event logs for any other errors and resolve them before proceeding.
-
Insert one blade back into the chassis and flash the SP firmware to the version desired using the web browser of the MM. You cannot use a boot diskette when the blade is in this state. Also, do not use any scripts or other tools to flash the SP firmware, as the tool might have caused the problem initially. Another option is to consider updating the version of SP firmware, as they may be resolved in updates. Take the following steps to flash the SP code via the MM:
- Download the DOS image(s) for the blade having errors and extract the image(s) to diskette(s).
- Log in to the MM and go to Blade Tasks --> Firmware Update. Select the target blade from the drop down menu. Click the "browse" button and point to the *.pkt file on the diskette that was extracted in step A.
- Click the "update" button, then follow the prompts to complete the firmware flash. Check the MM event log to verify a successful update.
- Download the DOS image(s) for the blade having errors and extract the image(s) to diskette(s).
- If the blade works after this flash, repeat the above process for all the blades that had "SP comm" or "kernel mode" errors. Remember to only insert one bad blade at a time. If this process does not clear up all errors, contact IBM support.
|
"SP comm" errors on multiple blades in a chassis that has been up and running.
|
|---|
Tracking down the source of multiple "SP comm" errors on a bladecenter chassis usually requires simultaneous access to the
MM Event Log and the physical chassis.
If two MMs are installed in the chassis:
- Fail over to the redundant MM. If the errors resolve, move to the next step. If the errors continue after the failover, save the MM configuration, then restore the MM to its defaults. If the errors persist, restore the saved MM and follow the steps in section 2 to isolate the problem blades. If resetting the MM to defaults resolves the errors, recreate the MM configuration manually.
- If the errors resolve after failing over to the redundant MM, fail back to the primary. If the problems occur again, remove the primary MM from the chassis, mark it "suspect" and move to the next step. If they do not come back, continue monitoring the chassis for the next few days.
- Test the remaining MM in both slots. If the remaining MM works in both slots, contact IBM support for possible replacement of the suspect MM. If the remaining MM fails in one slot but not the other, move to the next step.
- Reset the MM to defaults and try the MM in both slots again. If the failure occurs again with the MM in one particular slot, contact IBM support. If not, put the "suspect" MM back in the chassis. If any errors occur then, contact IBM support for possible replacement of the MM.
If only one MM is installed in the chassis:
- Restart the MM. If the errors continue, move to the next step. If they resolve, monitor the chassis for the next few days.
- Move the MM to the other slot. If the errors resolve, move the MM back to the original slot. If they return, move to the next step.
- If the errors continue with the MM in both slots, save the MM config, then restore the MM config to defaults. If that resolves the errors, recreate the configuration manually. If the errors continue with the MM at defaults, contact IBM support.
| Support forums |
| Submit a technical question |
| Before you call IBM Service |
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
18 April 2023
UID
ibm1MIGR-5070123