IBM Support

Troubleshooting I2C errors - IBM BladeCenter (Type 8677)

Troubleshooting


Problem

How to troubleshoot I2C errors on the BladeCenter Chassis (Type 8677).

Resolving The Problem

Source

RETAIN tip: H19858

Symptom

How to troubleshoot I2C errors on the BladeCenter Chassis (Type 8677).

Affected configurations

The system may be any of the following IBM servers:

  • BladeCenter Chassis, Type 8677, any model

- This Tip is not software specific.

- This Tip is not option specific.

Solution

In the blade chassis, the Management Module (MM) interacts with every component. This interactions happens via the RS485 bus as well as the I2C bus. Each component identifies itself by responding to messages over the RS485 or I2C bus. The access is to nonvolatile storage that contains vital product data (vpd) fields. The MM checks the components within the chassis to make sure that the requirements will allow it to work properly (e.g. power consumption).

Redundant I2C buses are used to communicate with other I2C devices in the Chassis. Those devices are the media tray, blowers, power modules, and I/O Switch modules.

Taking advantage of the redundancy in the blade chassis hardware will allow for better problem determination.

If an I2C bus error message occurs for a chassis with two MMs installed then chassis control will automatically switch over to the second MM to try to correct the problem. Three messages may be posted to the MM event log in this case, a failover MM 1 to MM 2 message, an I2C bus error recovery message and an I2C bus error identified in MM1 message. This sequence does not indicate that MM1 is bad. In order to debug this situation, chassis control must be switched back to the MM that was active when the failure occurred. The other MM needs to be removed from the chassis before continuing with debug. Below is a list of error messages that might appear in the MM Event log and steps to take if they are seen.

MM Event Log Message:

Failure reading I2C device. Check devices on bus 1. (Local Bus)

Actions:

  1. Wait two minutes to allow the MM to reset the I2C bus and check for shorts. If an I2C bus error recovery message doesn't POST to the log after two minutes, restart the MM. Wait five minutes then check the MM log again. If the error still hasn't recovered after the restart of the MM, move the MM to the redundant MM slot.
  2. If the failure is corrected by moving it to the redundant slot, move the MM back to slot 1. If the failure appears again, allow the MM to reset the I2C bus and check for shorts. At this point the MM may take up to five minutes to reset the I2C bus depending on how many times the failure has occurred.
  3. If the failure is still present, and there is a second MM in slot 2. The MM in slot 2 is now active. Remove MM 1 and install MM 2 into slot 1. If the failure is corrected, replaced the MM that was originally installed in slot 1.
  4. If the failure has not recovered after installing the second MM in slot 1. Install original MM 1 into MM slot 2. If the failure is corrected, call the technical support center for further assistance.
Failure reading I2C device. Check devices on bus 2.

Actions:

  1. Wait two minutes to allow the MM to reset the I2C bus and check for shorts. If the failure has not recovered after two minutes, restart the MM. Wait five minutes before checking the MM log again. If the failure is still present after the restart of the MM, move the MM to the second MM slot. If the failure is corrected, move MM back to slot 1.
  2. If the failure appears again, allow the MM at least five minutes to reset the I2C bus and check for shorts. At this point the MM may take up to five minutes to reset the I2C bus depending on how many times the failure has occurred.
  3. If failure still has not recovered after five minutes, refer to the document "Troubleshooting the 8677 chassis" for the next steps to take.
Failure reading I2C device. Check devices on bus 3.

Actions:

  1. Wait two minutes to allow the MM to reset the I2C bus and check for shorts. If an I2C bus error recovery message does not get posted after two minutes, restart the MM. Wait five minutes then check the MM log again. If the failure has not recovered, check fuel gauge status to make sure that the power domains are not over subscribed. WARNING: If a power domain is oversubscribed (more power required than can be supplied by one power module) then, depending on the configuration of the MM and the chassis, pulling a power module could cause one or more blades to throttle or lose power.
  2. Check power module LEDs for both power modules in the power domain. Remove power module 2 only. Wait two minutes then check the MM log for a new I2C error recovery message for bus.
  3. If the I2C error has recovered then reinsert power module 2. Wait two minutes then check the MM log. If the failure comes back, replace power module 2. Verify that the AC/DC LEDs are on for power module 2 and remove power module 1 only. Wait two minutes and check the MM event log. If a recovery message has been posted for I2C bus 3 then reinsert power module 1. Wait two minutes then check the MM log. If the failure comes back, replace power module 1.
  4. If the failure is still present, repeat the above steps for power modules 3 and 4 in the 2nd power domain.

If each power supply has been removed from the chassis but the I2C error has not recovered then more disruptive problem determination is required to isolate the failing component. Schedule downtime for all of the blades and then refer to the Chassis Checkout Troubleshooting guide to continue.

Failure reading I2C device. Check device on bus 4.

Actions:

  1. Restart the MM, allow two minutes for it to complete the I2C short test. If the failure is still present, reseat the MM allowing it five minutes to complete the I2C short test. If the failure has not recovered, remove the media tray. Open a hardware VPD screen in the MM and verify chassis hardware VPD is working. If it is working, then suspect the media tray. Please see the troubleshooting the 8677 media tray for next steps.
  2. If the failure has not recovered, move the MM in slot 1 to slot 2. Wait five minutes to allow the MM to retest I2C buses for shorts. If the I2C error has recovered, then the problem is either with MM 1, MM slot 1 or the redundant bus connected to MM slot 1.
  3. If chassis control was switched from the MM in slot 1 to the second MM in slot 2, then continue the isolation process by moving MM 2 into MM slot 1. Wait five minutes, if the I2C failure recovers, then suspect the MM that was originally in MM slot 1. If the failure is still present, then refer to the document Troubleshooting the 8677 chassis for the next steps.
  4. If the MM was moved from slot 1 to slot 2 and the failure recovered, inspect the slot connector for MM slot 1 for possible bent pins. At the same time inspect the connector on the MM to make sure that it's not damaged. Lastly, call the IBM technical support center for further assistance. 5. At this point the I2C bus failure has been corrected but the media tray is still out of the chassis. Install the media tray, if the I2C bus 4 error returns then refer to the Troubleshooting the 8677 media tray for next steps.
Failure reading I2C device. Check device on bus 5.

Actions:

  1. The following debug procedure requires a chassis maintenance window since blade connectivity to external Ethernet or Fibre may be brought down. If you have configured your chassis and blades to support a redundant network connections between Ethernet modules 1 and 2 and have tested the redundancy then the following procedure can be followed with minimal impact to the blade servers. The same warning applies to the I/O modules in bays 3 and 4, whatever the switch type.
  2. Pull the Ethernet modules from chassis bay 2. Wait two minutes then check the MM event log for an I2C bus five recovery message. If the bus error has recovered, then suspect Ethernet switch module 2. Reinsert Ethernet module 2, wait two minutes. If the I2C failure has returned then replace module 2.
  3. Verify Ethernet module 2 has completed POST and is back online then remove the Ethernet module from bay 1. Wait two minutes then check the MM log to see if the I2C error has recovered. If the error recovered then suspect I/O module 1. Reinsert Ethernet module 1, wait two minutes. If the I2C failure has returned then replace module 1. If the I2C bus five error still occurs then it is time to examine the modules in bays 3 and 4. If there are no modules in bays 3 and 4 then call IBM technical support center for further assistance. If the server connections through the modules in bays 3 and 4 are being used for remote boot (Fibre or iSCSI) then the two SAN paths through modules 3 and 4 must be configured for redundancy and must have been tested for redundancy before continuing. Otherwise all of the blade servers in the chassis will have to be shutdown gracefully before unplugging a module.
  4. Pull the I/O module from bay 4, wait two minutes and then check the MM event log for an I2C bus 5 recovery message. If the I2C bus 5 error recovers, then suspect I/O modules 4. Reinsert I/O module 4, wait two minutes then if the I2C bus error has returned, replace I/O module 4.
  5. Unplug I/O module 3. Wait two minutes, if the I2C bus 5 error recovers then suspect I/O module 3. Reinsert I/O module 3, wait two minutes then if the I2C bus error has returned, replace I/O module 3.

If I2C bus five errors continue to POST in the MM event log after all the I/O modules in the chassis have been removed, then call the IBM technical support center for further assistance.

Failure reading I2C device. More than one message for multiple buses .

Actions:

  1. Troubleshoot the failure using the preceeding steps for a single I2C bus error.
  2. Refer to the document Troubleshooting the 8677 chassis for more isolation steps.
  3. Contact your appropriate Support Center for your geography.

In the United States, contact 1-800-IBM-SERV.

The "IBM Directory of Worldwide Contacts" is available from the following URL:

http://www.ibm.com/planetwide/

Workaround

None.

Additional Information

None.

Document Location

Worldwide

Operating System

BladeCenter:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20T","label":"BladeCenter E Chassis"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
29 January 2019

UID

ibm1MIGR-5071089