Verifying RDMA configurations for connectivity issues on AIX®

RDMA connectivity issues are most commonly caused by misconfiguration. You can verify RDMA configurations to ensure that the members can communicate with the CF.

Before you begin

You must ensure that all of the required RDMA packages are installed on all the hosts, and that all required software meets the minimum supported levels. You can check the installed packages and versions by running the db2prereqcheck command with the -hl ( host list ) and -nm ( netname ) options.

Procedure

Use the following steps to verify your RDMA configurations:

  1. Verify that the RoCE ports are functional on all hosts, and the physical port states are ACTIVE.
    To verify, run the lsdev | grep -i roce command to find the names of the port entries. Ensure that the Logical Port State and Physical Port State are Active. In the following example, the status of port 1 of HCA iba0 is displayed:
    -------------------------------------------------------------------------------
     IB PORT 1 INFORMATION (iba0)
    -------------------------------------------------------------------------------
    Global ID Prefix:                       fe.80.00.00.00.00.00.00
    Local ID (LID):                         0011
    Local Mask Control (LMC):               0000
    Logical Port State:                     Active
    Physical Port State:                    Active
    Physical Port Physical State:           Link Up
    Physical Port Speed:                    5.0G
    Physical Port Width:                    4X
    Maximum Transmission Unit Capacity:     2048
    Current Number of Partition Keys:       1
    Partition Key List:
      P_Key[0]:                             ffff
    Current Number of GUID's:               1
    Globally Unique ID List:
      GUID[0]:                              00.02.55.00.80.2d.dd.00
    The output for lsdev | grep RoCE | grep ent command is as follows:
    Available 00-00-01 RoCE Converged Network Adapter.
    entstat -d ent1 | grep -i "Link Status"
    Physical Port Link Status: Up
    Logical Port Link Status: Up
  2. On the CF hosts, verify that the IP address associated with the RDMA or RoCE ports matches the IP addresses used for the net names for the CF entry in the db2nodes.cfg file.
    1. View the IP address that is associated with the RDMA ports on the CF host.
      To view the IP address that is associated with the RDMA port, run the ifconfig -a command. The IP address can be found by looking at the address that is associated with the inet field as shown:
      coralpib23:/coralpib23/home/lpham> ifconfig -a
      ib0: flags=e3a0063<UP,BROADCAST,NOTRAILERS,RUNNING,ALLCAST,MULTICAST,LINK0,LINK1,GROUPRT,64BIT>
      			inet 10.1.1.23 netmask 0xffffff00 broadcast 10.1.1.255
      				tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1
      In the output, ib0 is the interface name. The status is UP, and the IP address is 10.1.1.23. It is important to ensure that the interface status is up.
    2. Ensure the network names for the CF in the db2nodes.cfg file match with the IP addresses for the intended RDMA port to use for the CF.
      You must also ensure that the name can be pinged, and is reachable from all hosts on the cluster.

      From each member host, run a ping command against the network names that are associated with the CF entry in the db2nodes.cfg file. Observe the IP address returned. The IP address must match the IP address that is associated with the RDMA port configuration at the CF host, as in the ifconfig -a output.

      Note: When you ping an IP address on a different subnet, the pings are unsuccessful. This occurs when you have multiple subnet masks for each interface when there are multiple interfaces defined for the CF. In this case, from the member, ping the target IP address on the CF host that has the same subnet mask as the interface on the member host.
  3. Ensure that the port value specified on the client connect request match the port value the CF listens on.
    You must ensure that the CF port values are the same in the /etc/services files for all hosts in the cluster.
    1. To determine the port value that is used for the CF, look in the CF diagnostic log file.
      In the cfdiag_<timestamp>.<id>.log file, look for the value that is associated with the CA Port[0] field as part of the prolog information at the beginning of the log file. In the following example, the port value for the CF is 37761.
      Cluster Accelerator Object Information
          AIX 64-bit
          Compiler: xlC VRM (900)
          SVN Revision:  4232013
          Build mode: PERFORMANCE
          Built on Apr 23 2013 at 12:52:24
          Executable generated with symbols.
          Executable generated without asserts.
          Model Components Loaded: CACHE, LIST, LOCK
          Transport: uDAPL
          Number of HCAs: 1
          Device[0]: hca2
          CA Port[0]: 37761
          Total Workers: 1
          Conn/Worker: 128
          Notify conns: 1024
          Processor Speed: 5000.000000 Mhz.
          Allocatable Structure memory: 494 MB
    2. To determine the port value that is used by the member on the connect request, look for the PsOpen event in the Db2® member diagnostic log (db2diag.log) file.
      Look for the value of the caport field. In the following example, the port value for the target CF is also 37761.
      2013-04-29-16.00.56.371442-240 I80540A583           LEVEL: Event
      PID     : 10354874             TID : 772            PROC : db2sysc 0
      INSTANCE: lpham                NODE : 000
      HOSTNAME: coralpib23
      EDUID   : 772                  EDUNAME: db2castructevent 0
      FUNCTION: Db2, Shared Data Structure Abstraction Layer for CF, SQLE_SINGLE_CA_HANDLE::sqleSingleCfOpenAndConnect, probe:1264
      DATA #1 : <preformatted>
      PsOpen SUCCESS: hostname:coralpib23-ib0 (member#: 128, cfIndex: 1) ; device:hca2 ; caport:37761 ; transport: UDAPL
      Connection pool target size = 9 conn (seq #: 3 node #: 1)
  4. Perform an RDMA ping across the cluster by running the following:
    db2cluster -verify -req -rdma_ping