IBM Support

Troubleshooting Topspin InfiniBand issues

Troubleshooting


Problem

This document provides TopSpin IB Switch Module troubleshooting information.

Resolving The Problem

Note: Topspin InfiniBand Switch Module (26K6454) for BladeCenter

  1. To begin troubleshooting, check the following top issues. If your issue is listed, select the link, otherwise proceed to step 2.

    Can't communicate with switch via management network
    InfiniBand HCA does not support Blade Storage Expansion
    Unable to bring up interfaces ib2 or ib3 when using two InfiniBand Host Channel Adapters (HCAs)

    The Topspin InfiniBand switch module provides InfiniBand fabric connectivity to the BladeCenter. One internal port connects to each blade server and two external ports provide up to four 4x speed connectors to the fabric. This switch module is identical in operation to other Topspin InfiniBand switches. One exception to this is the login for the switch. The login for the IBM InfiniBand switch module will be USERID all caps and Passw0rd with a zero for the o in password.

    This switch is managed by the BladeCenter Management Module. The Topspin management and GUI software, such as the CLI, Element Manager and Chassis Manager, also all work with this switch module. The InfiniBand switch module for IBM eServer BladeCenter can be smoothly integrated into an existing Topspin InfiniBand fabric. It interoperates with standard Topspin switches via its external connectors.

    TopSpin IB Switch Module
    • One external 4X InfiniBand connector
    • One external 12X InfiniBand connector
    • 14 internal 1X connectors
    • Fully non-blocking switching for all 18 ports


    Note: InfiniBand speeds are designated in multiples of 2.5Gbps increments. e.g. a 4x connection is rated at 10Gbps.

  2. If the switch sees IP or FC devices, troubleshoot the InfiniBand host. If the switch sees no IP or FC devices, troubleshoot the switch access to the network or SAN. There are many tools available for troubleshooting the Topspin InfiniBand switch module. In some cases the switch must be rebooted and downtime must be scheduled in order for a particular troubleshooting tool to be used. In some cases it may not be possible to bring the switch down. Other troubleshooting tools can be used while the switch module is up and running. Here is a list of the available troubleshooting tools for the Topspin InfiniBand switch module:
    • Software log: ts_log (can be acquired while the switch is up and running) - this software log stores events
    • Hardware log: hw_if (can be acquired while the switch is up and running) - POST messages can be found in the hwif_log
    • POST (Switch must be rebooted) - the Power On Self Test or Post runs when components initiate to test the integrity of the components
    • CLIshow” and “diagnostic” commands
    • Host-side utilities - vstat, lspci, and the self-test to verify the integrity of the HCA
    • Topology view


    Perform the following steps in all cases:

  3. Verify that the Topspin drivers are loaded. To verify the drivers are loaded in Linux, do the following commands:

    1. # lsmod | grep ts
    2. ts_sdp 151768 0 (autoclean) (unused)
    3. ts_udapl 37000 0 (autoclean) (unused)
    4. ts_ipoib 63964 2 (autoclean) [ts_udapl ts_ip2pr]


    To verify that the correct versions of the drivers are loaded in Linux for the switch module, do the following commands:

    1. # rpm -qa | grep topspin
    2. topspin-ib-mod-rhel3-2.4.21-4.ELsmp-2.5.0-264
    3. topspin-ib-rhel3-2.5.0-264
    4. topspin-ib-mpi-rhel3-2.5.0-264

  4. Download the latest Topspin firmware and software code updates for your system:

    Download Topspin InfiniBand firmware and code
    Download the latest BladeCenter support software


  5. Verify active InfiniBand ports. To check if the InfiniBand ports are active use the vstat utility to check the HCA information. The following example shows that two HCA ports are connected to the InfiniBand fabric. Port 1 is assigned to the ib0 network. The port_state shown below is PORT_ACTIVE. If the port status is PORT_INITIALIZE then wait and few second and try again. Also note the hw_ver and fw_ver fields. These need to be checked to make sure they are current version levels.

  6. Check the status LEDs. The LEDs indicate the following operating states:

    TopSpin LEDs

    1. Status LEDs:
      • Both LEDs Off - No system power or LED malfunction
      • LEDs are Yellow solid, Green off - Module error detected: operator intervention required
      • LEDs are Green solid, Yellow off - Module running with no errors detected
    2. Physical Port Status LED:
      • LED Off - No physical link
      • LED solid Green - Successful physical link
    3. Logical Link Status LED:
      • LED Off - No logical link
      • LED blinking Green - Port runs traffic


  7. Verify that the host recognizes the Host Channel Adapter (HCA). The HCA has several utilities that can be used for problem determination. They are vstat, lspci, and the self-test to verify the integrity of the HCA. The syntax and use of these commands are detailed in the Topspin InfiniBand Host Channel Adapter Expansion Card for IBM BladeCenter User Guide.

    • In Linux, to verify if the host recognizes the HCA do the following: [root@qa-bc1-blade2 root]# lspci

  8. Check for host-side or server switch-side problems. If the server switch module sees IP or FC devices then troubleshoot the InfiniBand host. If the server switch module does not see IP or FC devices then troubleshoot the server switch module’s access to the network or SAN.

  9. Check for cabling problems. Cable connectors provide up to four 4x connectors to the InfiniBand fabric. The external connectors can be configured to run as 4 4X connectors or 1 4X connector and 1 12X connector. If the external ports are configured as 4 4X ports they are numbered 15, 16, 17 and 18. If the external ports are configured as 1 4X connector and 1 12X connector they are numbered 15 and 16 respectively. There are two IBM InfiniBand cables that can be used with this switch depending on how you want the ports configured. To configure the connectors as 4 4X ports you would use the InfiniBand “octopus” cable shown in the following graphic.

    • 4x connector is a standard 4x connector at both ends
    • 12x connector uses an "octopus" cable with three legs at 4x speed combined into a single 12x connector

      Cable connectors


    To troubleshoot bad cables: from switch to switch, run the ‘show interface ib all statistics’ command from the CLI. Then look for any ‘in-unknown-protos’ that are nonzero. These are typically indicative of bad edge-switch-to-core-switch cables. To troubleshoot bad cables, from switch to host, first make sure the rpm driver is installed. If it is then check ‘/proc/topspin/core/ca1/port1/counters’ and look for any nonzero ‘Symbol error counter’ values. These are typically indicative of bad node-to-edge switch cables.

  10. Verify network startup scripts for persistent InfiniBand port configuration. When the Server Switch Module (SSM) is first installed the user should configure the switch module’s IP address. The user will configure a static IP address, subnet mask and a gateway IP address. The IP address configuration can be done via the Management Module interface.

  11. Consult the available product documentation:

    Topspin InfiniBand switch module product overview
    BladeCenter switch module sales and marketing
    Topspin Support Web site
    Evaluating the Value Proposition of BladeCenter with InfiniBand


  12. If these steps have not solved your problem:
    Refer to your system's Hardware Maintenance Manual, or refer to "Need more help?"

Need more help?
Please select one of the the following options for further assistance:

//www.ibm.com/i/v14/icons/fw.gif Support forums
//www.ibm.com/i/v14/icons/fw.gif Submit a technical question
Before you call IBM Service

 

Document Location

Worldwide

Operating System

System x Hardware Options:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU00ZNP","label":"System x Hardware Options->BladeCenter Switch Module->Fibre->26K6454"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
29 January 2019

UID

ibm1MIGR-59632