You can use problem analysis to gather information that helps you determine the nature of a problem encountered on your system.
Use the following table to begin problem analysis and to start service.
In the following table, find the first failure indication that you observed, and then follow the action specified in the right column. After you have completed the actions in that row, that problem should be repaired. If not, continue with the next failure indication.
| Failure indication | Description and action |
|---|---|
| 1. Serviceable event in Service Focal Point™ on the Hardware Management Console (HMC). | Description: A hardware system unit, I/O drawer, or frame power problem requires parts or service procedures to correct the failure. Action: Follow normal service procedures for the failed part. Depending on effects of the serviceable event, this might also fix problems on the InfiniBand™ switch fabric. |
| 2. InfiniBand switch light emitting diodes (LEDs) are all off | Description: There is no power to the switch, or there is a power supply failure, or a fan failure. Action:
Go to Download the QLogic remove and replace procedures for the 7874-040, 7874-120, and 7874-240 switches. Go to Download the QLogic remove and replace procedures for the 7874-024 switch. |
3. InfiniBand switch has
a red LED that is lit. Some examples are the following items:
|
Description: The red LED indicates a hardware failure. A red chassis LED indicates one of the
following conditions:
Action:
Go to Download the QLogic remove and replace procedures for the 7874-040, 7874-120, and 7874-240 switches. Go to Download the QLogic remove and replace procedures for the 7874-024 switch. |
4. InfiniBand switch has
an amber Attention LED that is lit. Some examples are the following
items:
|
Description: An amber Attention LED indicates a possible hardware failure. Data needs to be collected for analysis. An amber chassis LED indicates one of the following
conditions:
Action: Collect data. Go to Collecting data for InfiniBand switch errors for 7874-024, 7874-040, 7874-120, and 7874-240 switches and perform that procedure. |
| 5. InfiniBand switch port link has a blue LED that is not lit. | Description: A blue link LED on the switch indicates a good physical connection between the switch port and the device at the other end of the cable. If the LED is not lit, there is a problem with the port, the cable, or the InfiniBand host channel adapter. Action:
|
6. One of the following logs indicate a loss
of InfiniBand switch
communication with a server or logical partition:
|
Description: The loss of InfiniBand switch connections can result from different failures, including server, logical partition, host channel adapter, cable, InfiniBand switch failures, partitioning configuration errors, or operating system configuration problems. Isolation:
|
6. Logs indicate loss of InfiniBand switch communication with a server or logical partition (continued) |
Action:
|
7. Subnet manager log
|
The subnet manager monitors the fabric and manages recovery operations. Errors should also be logged on the cluster systems management (CSM) server under /var/log/csm/errorlog/CSM MS hostname. Action: Go to Collecting data from the fabric management server for 7874-024, 7874-040, 7874-120, and 7874-240 switches and perform that procedure. |
8. Switch log Some examples are the
following items:
|
The switch log reflects problems within the switch chassis. Action:
|
9. Fast fabric health check result Some
examples are the following items:
|
The Fast Fabric Health Check is used during install, repair, and monitoring of the fabric to find errors and configuration changes that might cause problems in the fabric. Action: Go to Collecting data for Fast Fabric Health Check for 7874-024, 7874-040, 7874-120, and 7874-240 switches and perform that procedure. |
10. Fast Fabric Report Some examples
are the following items:
|
See the Fast Fabric Report Action:
|
11. Other error indicators or reporting methods |
This problem includes other ways that you might hear about an error, such as a user complaint. Review this table for other failure indications. |
For more information about cluster fabric that incorporates InfiniBand switches, see the IBM System p® HPC Clusters Fabric Guide at the IBM clusters with the InfiniBand switch Web site.