Troubleshooting
Problem
How to troubleshoot a problem with the BladeCenter (Type 8677) chassis.
Resolving The Problem
| Source |
|---|
RETAIN tip: H19889
| Issue |
|---|
How to troubleshoot a problem with the IBM BladeCenter (Type 8677) chassis.
| Affected configurations |
|---|
The system may be any of the following IBM servers:
- IBM BladeCenter Chassis, type 8677, any model
This tip is not option specific.
This tip is not software specific.
| Additional information |
|---|
This procedure is intended to supplement and follow other BladeCenter (Type 8677) component debug procedures. The procedure may require downtime for multiple chassis components and should only be used after all other less invasive problem determination techniques have failed to isolate the problem.
|
References
|
|---|
- Hardware Maintenance Manual and Troubleshooting Guide - IBM BladeCenter (Type 8677)
- Planning and Installation Guide - IBM BladeCenter (Type 8677)
- IBM BladeCenter Deployment Guide, IBM Doc ID WP100564
|
Technical overview
|
|---|
The BladeCenter (Type 8677) chassis consisting of an outer chassis assembly, an inner chassis (called the SPC), at least one Management Module (MM), a Media Tray, 2 Blowers, and a Midplane. The Midplane routes all electrical signals between all of the BladeCenter blade slots and module bays. Since the Midplane is a single part that is integral to all chassis functions, it is divided into a top and bottom half and all power and bus signals are duplicated on the two halves.
Look at the back of an assembled chassis and you will see that all pluggable modules (blowers, MM's, I/O modules and power supplies) are organized into top and bottom redundant pairs.
The bay assignments on the back of the chassis are (from right to left) MM 1 top, redundant MM 2 bottom, power supply 1 top, redundant power supply 2 bottom, blower 1 top, redundant blower 2 bottom, power supply 3 top, redundant power supply 4 bottom, I/O module 1 top, redundant I/O module 2 bottom, I/O module 3 top and finally I/O module 4 bottom. The server blades, which plug into slots through the front of the chassis, each have duplicate top and bottom connectors that allow them to connect to each of the primary (top) and redundant (bottom) modules in the back of the chassis through the chassis midplane. Although the principle reason for this chassis design is to enhance server fault tolerance, it is also helpful to know how primary and redundant modules work together when debugging chassis problems. The bottom modules have been identified as redundant for clarity, however the top and bottom connections are completely symmetrical. Nothing about the chassis restricts the configuration where only a bottom module is installed without a top module and vice versa.
Every blade server comes with two Ethernet ports (integrated on the planar) that connect through the midplane to two I/O switch modules in chassis Bays 1 and 2 (top and bottom). Although it is possible to insert Ethernet switch modules from two different vendors into Bays 1 and 2, most BladeCenter configurations have identical switches installed into those two bays to simplify setup. I/O module Bays 3 and 4 are optional and require optional expansion adapters in the blades to work.
These two modules, when used, are typically identical. If a second MM is installed in the bottom MM slot, it must be identical to the MM in the top slot. Power Supply 2 must match Power Supply 1, if installed, and the same goes for Power Supplies 3 and 4. Power Supplies 1 and 3 or 2 and 4 do not have to match since they feed separate power domains in the chassis, however they usually do match. The top and bottom blowers are also identical.
The reason for the two power domains is to limit the load on any one chassis power supply. The first power domain is fed by Power Supplies 1 and 2 and supplies power to the blowers, MM's, I/O modules, media tray and blades in slots 1-6. Power Supplies 3 and 4 feed the second power domain for blades only in slots 7 through 14. The second power domain can power more blades because it does not have to power any of the rear modules in the chassis. So, at least one power supply must be installed in power supply Bay 1 or 2 to bring up basic chassis functions including management, cooling and external connectivity.
If a BladeCenter problem symptom affects more than one blade or I/O module then look at the common parts shared by them, for example, the MM, switch modules, power supplies and media tray. For example, there is a common notification bus, called the I2C bus, that runs between the MM's and all of the other modules, blowers, power supplies and the media tray in the chassis. The bus is actually divided into five segments. A fault on one of these segments can cause multiple errors and even erroneous errors to show up in the MM log. Some chassis components can force a signal fault on an I2C segment due to a faulty I2C interface. Erroneous faults and hang conditions can also be caused by faults on the shared RS485 management information and the shared USB media device interfaces. Often the only way to isolate the cause of these types of faults is to start removing chassis components until the fault goes away. Removing the media tray is usually a good place to start since it connects to the I2C and USB shared interfaces and is usually not needed for normal blade operation (when the blade server is not in the middle of booting or accessing data on a removable media device).
| Chassis checkout procedure |
|---|
- Verify AC power is good to the chassis, all power supplies are good and all DC outputs are good . If you can log in to the MM then you know that at least one power supply in power domain 1 is good. Open the Fuel Gauge page in the MM and verify that power domain 1 or power domain 2 is not over- subscribed. Verify the blade power numbers look good (see the Fuel Gauge section of the MM User's Guide and RETAIN tip H067860). Refer to the document Troubleshooting power issues - IBM BladeCenter (8677) if you think you have a power problem.
-
Verify the chassis temperature is below the warning threshold and both blowers are working . The MM will prevent blades from powering up if it detects a critical over temperature condition. Check the front panel LED s to see if the amber chassis over-temperature LED is lit. Log-in to the MM if possible and check the system status page for chassis or blade temperature warnings. Check the chassis blower speeds to make sure both blowers are running and both are below 100%. Check the MM event log for overtemp messages.
- If one blower is running but the other blower is not running, replace the blower that is not running. Using a flashlight, inspect the corresponding connectors on the midplane behind the blowers for damage.
- Make sure all chassis slots and bays are filled with components or filler blades or modules.
- If only one blade is overtemp and the chassis temperature looks nominal, then suspect the blade CPU heatsink or the system board.
- If multiple blades are getting overtemp messages and the chassis temperature and blower speed looks nominal then there may be a MM and service processor communication issue. Refer to the troubleshooting documents Troubleshooting BladeCenter SP comm, kernel mode errors and Troubleshooting I2C errors - IBM BladeCenter (Type 8677) for help.
- If the chassis temperature is above the warning limit and both blowers are operating at or near 100% then suspect the environment temperature outside the chassis.
- If the temperature outside the chassis is nominal then the chassis might have a sensor problem. Refer to the document Troubleshooting media tray problems - IBM BladeCenter (Type 8677) to resolve it. If the chassis temperature is above the warning limit and the blowers have not ramped up then suspect a communication path problem between the MM, the Media Tray and the Blowers. Refer to the document Troubleshooting I2C errors - IBM BladeCenter (Type 8677) bus 4, for more debug steps.
- Verify the MM is not preventing the I /O modules or blades from powering up. Check the MM system status page for "Unknown module" or "Incompatible module" errors. If there are I/O modules in Bay 3 or Bay 4 and there is a daughter card plugged into a blade in the same chassis that is not compatible with the I/O module type then the MM will not allow either the I/O module or the blade to power up until the "Incompatible module" condition is corrected. For example, the blade fibre expansion card is not compatible with any Ethernet switch modules and the Infiniband HCA daughter card is not compatible with the OPM.
- If a single blade or module appears dead pull the blade or other suspected component from the chassis and inspect the back of it for connector damage . A blade or module chassis connector must be inspected first before moving it to other slots or chassis for debug. If blade or module connector damage is found then immediately inspect the chassis midplane using a flashlight to look for midplane connector damage.
| Dead blade issue |
|---|
- Start here if you can not get one or more blades to power up and or boot. If the blades boot but cannot communicate through one or more I/O modules, go to the section in this document titled "I/O Module Communication Issue."
- If the blade will not power up, refer to the document Troubleshooting blades that will not power on for more help.
-
Try to use a local KVM session instead of a remote terminal to monitor the blade . If you are not seeing any video display while the blade is booting then refer to the document Troubleshooting video issues - IBM BladeCenter for more help. If the blade fails with a POST error message and/or checkpoint code then follow the instructions for that error indication (See the Hardware Maintenance Manual or Serviceability and Troubleshooting Guide for the blade type). If the blade boots up but keyboard and or mouse does not appear to work then try a different blade.
- If keyboard or mouse does not work for multiple blades, suspect the MM. Verify the MM part number per tip H185739. Check the service processor firmware levels on the blades. Replace the MM.
- If keyboard or mouse only fails for one blade then suspect the blade or the slot. Try another known working blade in the same slot.
- If keyboard or mouse only fails for one blade and is not slot dependent then suspect the blade. Update or re-flash the blade service processor. Replace the blade system board.
- Pull out the media tray and try to boot the blade . If the blade now works then suspect a bad connection or component in the media tray. Use the document Troubleshooting media tray problems - IBM BladeCenter (Type 8677) to determine which part is bad.
- Boot the blade to on -board diagnostics by pressing F2 during POST and run diagnostics. If any errors are returned then follow blade troubleshooting procedures as identified in the HMM or the SandTG for the blade type.
- Is this a remote boot configuration where one or all the blades are failing to boot from an external NAS or SAN attached device ? If so then the integrity of the connection to the remote NAS or SAN needs to be verified before suspecting the BladeCenter chassis. If the system is using a fibre attached SAN then refer to the Troubleshooting Fibre connectivity - IBM BladeCenter (Type 8677) for help.
- At this point we know the blade is passing POST and Diagnostics but is failing somewhere else in the boot process . This is most likely not a chassis problem. The problem could be caused by a faulty boot image on the primary boot device or, if the blade is booting into the operating system and then hanging, a bad driver, service module or application. Do not use this document to continue problem determination.
| I/O Module communication issue |
|---|
- Start here if the blade boots up okay but cannot communicate with the external network . First, verify "External Ports" are enabled in the MM for the I/O module. If this is an I/O connectivity issue affecting one or more blades then refer to the appropriate I/O Module troubleshooting guide for more help.
-
If possible swap I /O modules within the same channel domain, i.e. Bays 1 and 2 or Bays 3 and 4. This works best if you have two pass through type I/O modules in the chassis, i.e. two OPM's or two CPM s. If the two modules are active switches then you must verify that the switches are configured the same before swapping them. Swap the I/O modules and see if the connectivity problem follows the module or stays with the slot. Do not move the external cables with the I/O modules. For example, if you are swapping the I/O modules in Bays 1 and 2, the same set of cables should be plugged in the Bay 1 module after the swap as before the swap.
- If the problem stays with the bay or slot then verify the cables and upstream ports connected to the module in that bay are good. If you know the cables and upstream ports are good them move on to the Extended chassis checkout procedure.
- If the problem follows the module and you know the upstream cable connection and port configuration is good, then suspect the I/O module. Refer to the "I/O Module Troubleshooting guide" for more help.
| Extended chassis checkout procedure |
|---|
- This part of the troubleshooting procedure has been reached most likely because of a critical management communication bus failure in the chassis, i .e. I2C or SP COMM. This debug procedure may require a complete teardown of the chassis to isolate the root cause of the failure. Be prepared to power down all blade servers and I/O modules to create a minimum configuration. The following steps refer to only the original components in the chassis. Do not use new parts that may have been shipped to the customer site until explicitly told to do so.
- If the problem has already been isolated to a single module or blade then use a flashlight to inspect the suspect module bay or blade slot connector in the chassis for bent pins . If you see a bent bin, call IBM BladeCenter support to dispatch the appropriate part. Inspect the rear connector of any I/O module that was plugged into this bay or any blade that was plugged into this slot for connector damage. Check connectors for any component that is removed from the chassis throughout the remaining steps of this procedure. If a socket on an I/O module or blade connector looks damaged then call IBM BladeCenter support. If you see any damaged pins or connectors in the chassis, call IBM BladeCenter support.
- It is now time to strip down the chassis to a minimum configuration that works . Unless otherwise instructed, do not completely remove a module or blade from the chassis so that we can keep track of which component was plugged in to which slot. Power down and unplug all blades from the chassis about one inch.
- Make sure there is a working power supply in slot 1 then unplug power supplies 2, 3, and 4 about one inch. Unplug all of the I/O modules one inch. If there is a redundant MM in the chassis, unplug it. Unplug the media tray and one of the two blowers. Note that the blower(s) will ramp up to full speed whenever the media tray, a power supply module or a blower is removed. If the remaining MM and the blower are not receiving power then plug in power module 2, verify its AC/DC check LED s look good and unplug power module 1.
-
Verify the MM is working . Follow the steps in Troubleshooting Management Module connectivity issues to debug a MM problem if necessary". Login to the MM and check system status and then check the MM event log for new errors. Note: any messages indicating non-redundant modules are okay at this point since components have been unplugged from the chassis. Any non- recovered I2C or SP COMM error messages in the log are significant. Verify the status for all components including the MM, power supply and blower looks good. If one of the components is showing a warning or error message then start swapping components until you get a minimum configuration that works per the following steps. Always wait at least two minutes after performing an action to give the MM time to scan and react to the change.
- If the problem persists, plug in the second blower and remove the first one.
- If the problem persists, plug in power supply 2, verify both AC and DC LED s are lit, verify the power supply shows up on the fuel gauge page in the MM browser then remove power supply 1.
- Unplug the MM in slot 1 and plug the redundant MM into slot 2 if there is one, otherwise move the single MM from slot 1 to slot 2. In either case allow five minutes for the MM to complete POST and connect to the network before trying to log in.
- Remove AC power from all power modules in the chassis. Remove all power modules and blowers from the rear of the chassis and remember what bays they were in. Remove the MM from the rear of the chassis and remember what bay it was in. Pull out the SPC chassis and disconnect the Rear Customer Interface Card. Make sure the SPC cables are not going to get pinched and then plug the SPC Chassis back in to the frame. Plug the Power Module, Blower and MM back in. Apply AC to the Power Module. Allow five minutes for the MM to complete POST and then check system status again.
- If the above steps do not result in a minimum configuration that works, call IBM BladeCenter Support.
- Repeat step d. and reconnect the Rear Customer Interface Card before continuing.
- At this point the chassis is working but the media tray is not installed and there are no blades plugged in . Plug in the media tray and re-check system status in the MM. If everything still looks good continue. If new errors have appeared then refer to the "Media Tray Troubleshooting guide", connectivity section for more help.
- The next step is to bring up a functioning blade . If the blade requires a remote NAS or SAN drive to boot, then a module in Bay 3 will have to be brought up first before booting the blade. If necessary plug the module in Bay 3 back in and verify the module completes POST with no new errors in MM system status or the event log.
- Plug a single blade into chassis slot 1. Power it up and use the local BladeCenter KVM connection to watch it complete POST and boot to the operating system. If the blade will not power up, refer to the document Troubleshooting blades that will not power on for more help. Verify no new errors in MM system status or the MM event log.
- Plug in the Ethernet module into slot 1 and connect it to the upstream network. Check MM system status to verify the I/O module completes POST with no error messages. Verify no new error messages in the MM event log.
- At this point we should have one MM, one blade, one power supply, one blower, at least one I/O module and the media tray plugged into the chassis with no new errors in MM system status and the MM event log. Now start plugging components back in, one at a time, until you see the failure symptom again. Start with the blower, the power supplies, then the other I/O modules, the redundant MM and then the blades. Note the blowers should ramp back down as soon as both blowers and all power supplies are plugged back in.
- If the failure symptom comes back after replacing a module or blade, call IBM BladeCenter support for the next steps. IBM support personnel may ask you to swap redundant components or try new components to further isolate the problem.
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
29 January 2019
UID
ibm1MIGR-5071185