Troubleshooting
Problem
Troubleshooting InfiniBand Network issues in Platform Cluster Manager
Resolving The Problem
Check the Subnet Manager status
- Subnet manager runs on the IB switch itself
- Subnet manager runs on a compute node. This can be the case if your installer node does not have an HCA adapter.
# service opensmd statusopensm (pid 11311) is running...
List loaded kernel modules for InfiniBand
Listing loaded kernel InfiniBand modules is important to understand which IB drivers are currently being used.
[root@compute-00-00 ~]# lsmod |grep -i ib ib_sdp 147176 0 rdma_cm 68500 2 rdma_ucm,ib_sdp ib_addr 41992 1 rdma_cm ib_ipoib 113240 0 ipoib_helper 35728 2 ib_ipoib ib_cm 73000 3 qlgc_vnic,rdma_cm,ib_ipoib ib_sa 75016 4 qlgc_vnic,rdma_cm,ib_ipoib,ib_cm ipv6 424609 37 ib_ipoib ib_uverbs 75824 1 rdma_ucm ib_umad 50472 4 ib_ipath 355456 0 mlx4_ib 99260 0 ib_mthca 157988 0 ib_mad 70948 5 ib_cm,ib_sa,ib_umad,mlx4_ib,ib_mthca ib_core 108544 15 rdma_ucm,qlgc_vnic,ib_sdp,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_sa,ib_uverbs,ib_umad,i w_cxgb3,ib_ipath,mlx4_ib,ib_mthca,ib_mad mlx4_core 136036 1 mlx4_ib libata 208721 1 ata_piix scsi_mod 196569 12 scsi_dh,sg,mptfc,scsi_transport_fc,mptctl,mptspi,scsi_transport_spi,libata,mptsa s,mptscsih,scsi_transport_sas,sd_mod
You can now use modinfo command to get more details about the loaded modules. For example, modinfo ib_core shows you the exact path the module is loaded from.
ibstat
This command shows you the host adapter status. Look at port state in the output; the State can be either Down or Active; Active means that the port is OK and ready for use. You must see at least one active port on all your IB enabled nodes.
[root@compute-00-00 ~]# ibstat CA 'mlx4_0' CA type: MT25408 Number of ports: 2 Firmware version: 2.3.0 Hardware version: a0 Node GUID: 0x0002c9030002847c System image GUID: 0x0002c9030002847f Port 1: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c9030002847d Port 2: State: Active Physical state: LinkUp Rate: 10 Base lid: 2 LMC: 0 SM lid: 2 Capability mask: 0x0251086a Port GUID: 0x0002c9030002847e
ibnetdiscover
This command scans your IB network topology and reports all IB devices it discovers. You should see all your HCAs reported here, including both the switches and nodes.
[root@compute-00-00 ~]# ibnetdiscover # # Topology file: generated on Wed May 6 15:14:42 2009 # # Max of 2 hops discovered # Initiated from node 0002c9030002847c port 0002c9030002847e vendid=0x2c9 devid=0xa87c switchguid=0xb8cffff0053ee(b8cffff0053ee) Switch 8 "S-000b8cffff0053ee" # "MT43132 Mellanox Technologies" base port 0 lid 3 lmc 0 [1] "H-0002c902002789ac"[1](2c902002789ad) # "compute-00-01 HCA-1" lid 1 4xSDR [2] "H-0002c9030002847c"[2](2c9030002847e) # "compute-00-00 HCA-1" lid 2 4xSDR vendid=0x2c9 devid=0x5a44 sysimgguid=0x2c902002789af caguid=0x2c902002789ac Ca 2 "H-0002c902002789ac" # "compute-00-01 HCA-1" [1](2c902002789ad) "S-000b8cffff0053ee"[1] # lid 1 lmc 0 "MT43132 Mellanox Technologies" lid 3 4xSDR vendid=0x2c9 devid=0x6340 sysimgguid=0x2c9030002847f caguid=0x2c9030002847c Ca 2 "H-0002c9030002847c" # "compute-00-00 HCA-1" [2](2c9030002847e) "S-000b8cffff0053ee"[2] # lid 2 lmc 0 "MT43132 Mellanox Technologies" lid 3 4xSDR
ibchecknet
This command performs port/node/errors check on your IB network. You should not see any bad ports and bad nodes found. If you have ports with errors beyond threshold, use ibclearerrors (below) and run this command again after five minutes. If you still see errors reported, you most likely have a problem in your IB fabric hardware - possibly a bad cable connection or a bad switch port.
[root@compute-00-00 ~]# ibchecknet
# Checking Ca: nodeguid 0x0002c902002789ac
# Checking Ca: nodeguid 0x0002c9030002847c
## Summary: 3 nodes checked, 0 bad nodes found
## 4 ports checked, 0 bad ports found
## 0 ports have errors beyond threshold
ibnodes
This command scans IB network topology and display all hosts + switches. The output is just a more concise form of ibnetdiscover (above).
[root@compute-00-00 ~]# ibnodes Ca : 0x0002c902002789ac ports 2 "compute-00-01 HCA-1" Ca : 0x0002c9030002847c ports 2 "compute-00-00 HCA-1" Switch : 0x000b8cffff0053ee ports 8 "MT43132 Mellanox Technologies" base port 0 lid 3 lmc 0
ibcheckstate
This command performs port state and physical port state check on your IB fabric. You should not see any ports with bad state. If you see any, restart your SM and check again. If the errors still persist, contact Platform Support.
[root@compute-00-00 ~]# ibcheckstate ## Summary: 3 nodes checked, 0 bad nodes found ## 4 ports checked, 0 ports with bad state found
ibcheckerrors
This command performs error check your IB fabric. This is useful to find ports with error counters beyond the indicated thresholds.
[root@compute-00-00 ~]# ibcheckerrors ## Summary: 3 nodes checked, 0 bad nodes found ## 4 ports checked, 0 ports have errors beyond threshold
ibclearerrors
This command clears all error counters on your IB fabric. You should use this to determine if the errors are accumulating over time. After you clear the errors, run ibcheckerrors (above) after five minutes to see if the error counters have increased. If yes, this indicates a problem with your IB fabric.
[root@compute-00-00 ~]# ibclearerrors
## Summary: 3 nodes cleared 0 errors
ibclearcounters
This command clears all port counters on your IB fabric. You should use this to determine if the counters are accumulating over time, as described above.
[root@compute-00-00 ~]# ibclearcounters ## Summary: 3 nodes cleared 0 errors
Was this topic helpful?
Document Information
Modified date:
16 September 2018
UID
isg3T1016169