IBM Support

Troubleshooting InfiniBand Network issues in Platform Cluster Manager

Troubleshooting


Problem

Troubleshooting InfiniBand Network issues in Platform Cluster Manager

Resolving The Problem

InfiniBand (IB) fabric is a critical component of many Platform HPC/PCM clusters. As an administrator, you should have basic troubleshooting skills to investigate issues you think are related to the IB fabric in your cluster. This article provides basic troubleshooting examples which you can follow when solving such issues.

Check the Subnet Manager status

The most common problem with IB fabric is that the Subnet Manager (SM) is not running or it is running but reporting errors. SM, a.k.a. opensmd, is a critical service which must run on at least one (1) node on your cluster. The node must have an HCA installed for opensmd to run on it. Typically, the SM runs on your installer node. There are two cases where this may not be true:
  • Subnet manager runs on the IB switch itself
  • Subnet manager runs on a compute node. This can be the case if your installer node does not have an HCA adapter.
If your SM is running on the installer node, then you can check the status of SM as below:
# service opensmd status
opensm (pid 11311) is running...
You can see that the SM is running with a PID of 11311.
 
Another thing you can check is the SM logs, typically found in file /var/log/opensmd.log. If you see errors in the SM log file, save the file and send it to PCM Support for further investigation.

List loaded kernel modules for InfiniBand

Listing loaded kernel InfiniBand modules is important to understand which IB drivers are currently being used.

 [root@compute-00-00 ~]# lsmod |grep -i ib   
ib_sdp                147176  0   
rdma_cm                68500  2 
rdma_ucm,ib_sdp   
ib_addr                41992  1 rdma_cm   
ib_ipoib              113240  0   
ipoib_helper           35728  2 
ib_ipoib   
ib_cm                  73000  3 qlgc_vnic,rdma_cm,ib_ipoib   

ib_sa                  75016  4 qlgc_vnic,rdma_cm,ib_ipoib,ib_cm   
ipv6                  424609  37 ib_ipoib   
ib_uverbs              
75824  1 rdma_ucm   
ib_umad                50472  4   
ib_ipath              355456  0   
mlx4_ib                99260  0   
ib_mthca              157988  0   
ib_mad                 70948  5 
ib_cm,ib_sa,ib_umad,mlx4_ib,ib_mthca   
ib_core               108544  15 
rdma_ucm,qlgc_vnic,ib_sdp,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_sa,ib_uverbs,ib_umad,i
w_cxgb3,ib_ipath,mlx4_ib,ib_mthca,ib_mad   
mlx4_core             136036  
1 mlx4_ib   
libata                208721  1 ata_piix   
scsi_mod              196569  12 
scsi_dh,sg,mptfc,scsi_transport_fc,mptctl,mptspi,scsi_transport_spi,libata,mptsa
s,mptscsih,scsi_transport_sas,sd_mod   

You can now use modinfo command to get more details about the loaded modules. For example, modinfo ib_core shows you the exact path the module is loaded from.

ibstat

This command shows you the host adapter status. Look at port state in the output; the State can be either Down or Active; Active means that the port is OK and ready for use. You must see at least one active port on all your IB enabled nodes.

[root@compute-00-00 ~]# ibstat   

CA 'mlx4_0'   
CA type: MT25408   
Number of ports: 2   
Firmware version: 2.3.0   
Hardware version: a0   
Node GUID: 
0x0002c9030002847c   
System image GUID: 0x0002c9030002847f   
Port 
1:   
State: Down   
Physical state: Polling   
Rate: 10   
Base lid: 0   
LMC: 0   
SM lid: 0   
Capability mask: 
0x02510868   
Port GUID: 0x0002c9030002847d   
Port 2:   
State: 
Active   
Physical state: LinkUp   
Rate: 10   
Base lid: 2   

LMC: 0   
SM lid: 2   
Capability mask: 0x0251086a   
Port 
GUID: 0x0002c9030002847e

ibnetdiscover

This command scans your IB network topology and reports all IB devices it discovers. You should see all your HCAs reported here, including both the switches and nodes.

[root@compute-00-00 ~]# ibnetdiscover   
#   
# Topology file: 
generated on Wed May  6 15:14:42 2009   
#   
# Max of 2 hops 
discovered   
# Initiated from node 0002c9030002847c port 
0002c9030002847e   

vendid=0x2c9  
devid=0xa87c  

switchguid=0xb8cffff0053ee(b8cffff0053ee)   
Switch  8 
"S-000b8cffff0053ee"          # "MT43132 Mellanox Technologies" base port 0 lid 
3 lmc 0   
[1]     "H-0002c902002789ac"[1](2c902002789ad)          # 
"compute-00-01 HCA-1" lid 1 4xSDR   
[2]     
"H-0002c9030002847c"[2](2c9030002847e)          # "compute-00-00 HCA-1" lid 2 
4xSDR   

vendid=0x2c9  
devid=0x5a44  
sysimgguid=0x2c902002789af  
caguid=0x2c902002789ac  
Ca      2 
"H-0002c902002789ac"          # "compute-00-01 HCA-1"   
[1](2c902002789ad)      "S-000b8cffff0053ee"[1]         # lid 1 lmc 0 
"MT43132 Mellanox Technologies" lid 3 4xSDR   

vendid=0x2c9  
devid=0x6340  
sysimgguid=0x2c9030002847f  
caguid=0x2c9030002847c  

Ca      2 "H-0002c9030002847c"          # "compute-00-00 HCA-1"   
[2](2c9030002847e)      "S-000b8cffff0053ee"[2]         # lid 2 lmc 0 
"MT43132 Mellanox Technologies" lid 3 4xSDR   

ibchecknet

This command performs port/node/errors check on your IB network. You should not see any bad ports and bad nodes found. If you have ports with errors beyond threshold, use ibclearerrors (below) and run this command again after five minutes. If you still see errors reported, you most likely have a problem in your IB fabric hardware - possibly a bad cable connection or a bad switch port.


 

[root@compute-00-00 ~]# ibchecknet

# Checking Ca: nodeguid 0x0002c902002789ac

# Checking Ca: nodeguid 0x0002c9030002847c

## Summary: 3 nodes checked, 0 bad nodes found
## 4 ports checked, 0 bad ports found
## 0 ports have errors beyond threshold

ibnodes

This command scans IB network topology and display all hosts + switches. The output is just a more concise form of ibnetdiscover (above).

[root@compute-00-00 ~]# ibnodes
Ca      : 0x0002c902002789ac ports 2 "compute-00-01 HCA-1"
Ca      : 
0x0002c9030002847c ports 2 "compute-00-00 HCA-1"
Switch  : 
0x000b8cffff0053ee ports 8 "MT43132 Mellanox Technologies" base port 0 lid 3 
lmc 0

ibcheckstate

This command performs port state and physical port state check on your IB fabric. You should not see any ports with bad state. If you see any, restart your SM and check again. If the errors still persist, contact Platform Support.

[root@compute-00-00 
~]# ibcheckstate

## Summary: 3 nodes checked, 0 bad nodes found
##          4 ports checked, 0 ports with bad state found

ibcheckerrors

This command performs error check your IB fabric. This is useful to find ports with error counters beyond the indicated thresholds.

[root@compute-00-00 ~]# ibcheckerrors

## Summary: 
3 nodes checked, 0 bad nodes found
##          4 ports checked, 0 ports 
have errors beyond threshold

ibclearerrors

This command clears all error counters on your IB fabric. You should use this to determine if the errors are accumulating over time. After you clear the errors, run ibcheckerrors (above) after five minutes to see if the error counters have increased. If yes, this indicates a problem with your IB fabric.


 

[root@compute-00-00 ~]# ibclearerrors

## Summary: 3 nodes cleared 0 errors

ibclearcounters

This command clears all port counters on your IB fabric. You should use this to determine if the counters are accumulating over time, as described above.

[root@compute-00-00 ~]# 
ibclearcounters

## Summary: 3 nodes cleared 0 errors

[{"Product":{"code":"SSZUCA","label":"IBM Spectrum Cluster Foundation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"Version Independent","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZUCA","label":"IBM Spectrum Cluster Foundation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":null,"Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 September 2018

UID

isg3T1016169