subscribe iconSubscribe to this information

Problem analysis for 7874-024, 7874-040, 7874-120, and 7874-240 switches

You can use problem analysis to gather information that helps you determine the nature of a problem encountered on your system.

Use the following table to begin problem analysis and to start service.

In the following table, find the first failure indication that you observed, and then follow the action specified in the right column. After you have completed the actions in that row, that problem should be repaired. If not, continue with the next failure indication.

Table 1. Switch failure analysis and action
Failure indication Description and action
1. Serviceable event in Service Focal Point™ on the Hardware Management Console (HMC).

Description: A hardware system unit, I/O drawer, or frame power problem requires parts or service procedures to correct the failure.

Action: Follow normal service procedures for the failed part. Depending on effects of the serviceable event, this might also fix problems on the InfiniBand™ switch fabric.

2. InfiniBand switch light emitting diodes (LEDs) are all off

Description: There is no power to the switch, or there is a power supply failure, or a fan failure.

Action:
  1. Check the power cables at the switch, and determine if the power is active. If you find a problem, replace the power cable or work with the customer to fix the power problem.
  2. If there is no problem with input power, the problem is the switch. Replace the power supplies one at a time until the problem is fixed.

Go to Download the QLogic remove and replace procedures for the 7874-040, 7874-120, and 7874-240 switches.

Go to Download the QLogic remove and replace procedures for the 7874-024 switch.

3. InfiniBand switch has a red LED that is lit. Some examples are the following items:
  • Chassis Status LED on managed spine
  • Status LED on leaf module
  • Red LED on power or fan module

Description: The red LED indicates a hardware failure.

A red chassis LED indicates one of the following conditions:
  • The system ambient temperature exceeds 60 degree C.
  • No functional fan trays are present.
  • No functional spines are present.
  • No functional leaves are present.
Action:
  • If the red LED is on a managed spine or leaf module:
    1. Reseat this managed spine or leaf module.
    2. If the LED is still red, insert this managed spine or leaf module into another slot.
    3. If the LED is still red, replace this managed spine or leaf module.
  • If the red LED is on a power supply or fan module, replace the power or fan module.

Go to Download the QLogic remove and replace procedures for the 7874-040, 7874-120, and 7874-240 switches.

Go to Download the QLogic remove and replace procedures for the 7874-024 switch.

4. InfiniBand switch has an amber Attention LED that is lit. Some examples are the following items:
  • Attention LED on managed spine
  • Attention LED on leaf module

Description: An amber Attention LED indicates a possible hardware failure. Data needs to be collected for analysis.

An amber chassis LED indicates one of the following conditions:
  • The system ambient temperature exceeds 52 degrees C, but is less than 60 degrees C.
  • There is a fan problem.
  • A power supply AC OK LED is off.
  • A power supply DC OK LED is off.
  • Any spine module Attention LED is on, or any spine is not functioning (even if unable to light the LED).
  • Any leaf module Attention LED is on, or any leaf is not functioning (even if unable to light the LED).

Action: Collect data. Go to Collecting data for InfiniBand switch errors for 7874-024, 7874-040, 7874-120, and 7874-240 switches and perform that procedure.

5. InfiniBand switch port link has a blue LED that is not lit.

Description: A blue link LED on the switch indicates a good physical connection between the switch port and the device at the other end of the cable. If the LED is not lit, there is a problem with the port, the cable, or the InfiniBand host channel adapter.

Action: When the blue link LED is lit at the switch port, the link is physically connected; however, the link might still be experiencing intermittent errors. The customer can monitor and check for intermittent errors on the link. In most cases, intermittent errors result from a bad cable or connection.
6. One of the following logs indicate a loss of InfiniBand switch communication with a server or logical partition:
  • Subnet manager log from fabric management server (or from InfiniBand switch)
  • Switch log (switch chassis)
  • Fast Fabric Health Check result from fabric management server
  • Fast Fabric Report from fabric management server (Iba_report)

Description: The loss of InfiniBand switch connections can result from different failures, including server, logical partition, host channel adapter, cable, InfiniBand switch failures, partitioning configuration errors, or operating system configuration problems.

Isolation:
  1. Collect data. Go to Collecting data for InfiniBand switch errors for 7874-024, 7874-040, 7874-120, and 7874-240 switches and perform that procedure.
  2. If multiple link errors are reported, look for patterns to the failures that might help to isolate the failing part, such as the following situations:
    1. All links are connected to a single server.
    2. All links are connected to a single logical partition.
    3. All links are connected to a single host channel adapter (that is, InfiniBand host channel adapter).
    4. All links are connected to a single InfiniBand switch.
    5. All links are connected to a single InfiniBand switch leaf.
      Note: If the InfiniBand switch fabric has more than one independent failure, you might treat them separately.

6. Logs indicate loss of InfiniBand switch communication with a server or logical partition (continued)

Action:
  1. If all links connected to a single server or logical partition are not functioning, complete the following steps:
    1. Check for obvious down or hung conditions on the server or logical partition. If found, the customer should recover the server or the logical partition, or contact IBM® service representative, as necessary. The IBM service representative will then use normal server procedures to fix the problem.
    2. Have the customer check for an InfiniBand switch adapter configuration problem. This could be a host-channel-adapter partitioning problem or an InfiniBand switch interface error in the operating system. If found, the customer corrects the problem.
    3. If the links are from a single host channel adapter, skip to step 4.
  2. If all links connected to a single InfiniBand switch are down, complete the following steps:
    1. Check for a switch power problem and fix, as necessary.
    2. If no power problem is found, collect data as indicated under Isolation and send the data to IBM for analysis.
  3. If all links that are connected to a single InfiniBand switch leaf are down, replace the switch leaf.

    Go to Download the QLogic remove and replace procedures for the 7874-040, 7874-120, and 7874-240 switches.

    Go to Download the QLogic remove and replace procedures for the 7874-024 switch.

  4. If all links are connected to a single host channel adapter are down, complete the following steps:
    1. Have the customer check for an InfiniBand host channel adapter configuration problem. This problem could be a host-channel-adapter partitioning problem or an InfiniBand switch-interface error in the operating system. If found, the customer must correct the problem.
    2. If no other problems are found, replace the host channel adapter.
  5. If no other problems are found with the server or the logical partition, then the problem might be isolated to InfiniBand switch links. Go to Isolating InfiniBand switch link errors for 7874-024, 7874-040, 7874-120, and 7874-240 switches to continue problem determination.
7. Subnet manager log
  • If using host-based subnet manager, the subnet manager log is found on the fabric management server under /var/log/messages.
  • If using the embedded subnet manager, the subnet manager log is found on the switch.

The subnet manager monitors the fabric and manages recovery operations.

Errors should also be logged on the cluster systems management (CSM) server under /var/log/csm/errorlog/CSM MS hostname.

Action: Go to Collecting data from the fabric management server for 7874-024, 7874-040, 7874-120, and 7874-240 switches and perform that procedure.

8. Switch log

Some examples are the following items:
  • Switch (through logShow)
  • Also errors on CSM server in file /var/log/csm/errorlog/CSM MS hostname

The switch log reflects problems within the switch chassis.

9. Fast fabric health check result

Some examples are the following items:
  • Fabric management server in files:
    • /var/opt/iba/analysis/latest/chassis*.diff
    • /var/opt/iba/analysis/latest/chassis*.errors

The Fast Fabric Health Check is used during install, repair, and monitoring of the fabric to find errors and configuration changes that might cause problems in the fabric.

Action: Go to Collecting data for Fast Fabric Health Check for 7874-024, 7874-040, 7874-120, and 7874-240 switches and perform that procedure.

10. Fast Fabric Report

Some examples are the following items:
  • Fabric management server in file /var/opt/iba/analysis/latest/*.stderr

See the Fast Fabric Report

Action:
  1. Collect all health check history data. Go to Collecting data for Fast Fabric Health Check for 7874-024, 7874-040, 7874-120, and 7874-240 switches and perform that procedure.
  2. From the fabric management server, collect that data from the file /var/log/messages.

11. Other error indicators or reporting methods

This problem includes other ways that you might hear about an error, such as a user complaint. Review this table for other failure indications.

For more information about cluster fabric that incorporates InfiniBand switches, see the IBM System p® HPC Clusters Fabric Guide at the IBM clusters with the InfiniBand switch Web site.


Send feedback | Rate this page

Last updated: Fri, Oct 30, 2009