Verifying RDMA configurations for connectivity issues on AIX®
RDMA connectivity issues are most commonly caused by misconfiguration. You can verify RDMA configurations to ensure that the members can communicate with the CF.
Before you begin
lslpp -l udapl.rte command.Procedure
Use the following steps to verify your RDMA configurations:
- Verify that the InfiniBand (IB) ports are functional on all hosts, and that the physical
port states are ACTIVE. To verify, issue the
ibstat -vcommand. Ensure that the Logical Port State and Physical Port State are Active. You must also ensure that the Physical Port Physical state is Link Up. In the following example, the status of port 1 of HCAiba0is displayed.------------------------------------------------------------------------------- IB PORT 1 INFORMATION (iba0) ------------------------------------------------------------------------------- Global ID Prefix: fe.80.00.00.00.00.00.00 Local ID (LID): 0011 Local Mask Control (LMC): 0000 Logical Port State: Active Physical Port State: Active Physical Port Physical State: Link Up Physical Port Speed: 5.0G Physical Port Width: 4X Maximum Transmission Unit Capacity: 2048 Current Number of Partition Keys: 1 Partition Key List: P_Key[0]: ffff Current Number of GUID's: 1 Globally Unique ID List: GUID[0]: 00.02.55.00.80.2d.dd.00 - On the CF hosts, verify that
the IP address associated with the IB or RoCE ports matches the IP addresses used for the net names
for the CF entry in the db2nodes.cfg file.
- View the IP address that is associated with the IB ports on the CF host. To view the IP address that is associated with the IB port, run the
ifconfig -acommand. The IP address can be found by looking at the address that is associated with theinetfield as shown:
In the output,coralpib23:/coralpib23/home/lpham> ifconfig -a ib0: flags=e3a0063<UP,BROADCAST,NOTRAILERS,RUNNING,ALLCAST,MULTICAST,LINK0,LINK1,GROUPRT,64BIT> inet 10.1.1.23 netmask 0xffffff00 broadcast 10.1.1.255 tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1ib0is the interface name. The status is UP, and the IP address is 10.1.1.23. It is important to ensure that the interface status is up. - Ensure the network names for the CF in the db2nodes.cfg file
match with the IP addresses for the intended IB port to use for the CF. You must also ensure that the name can be pinged, and is reachable from all hosts on the cluster.
From each member host, run a ping command against the network names that are associated with the CF entry in the db2nodes.cfg file. Observe the IP address returned. The IP address must match the IP address that is associated with the IB port configuration at the CF host, as in the
ifconfig -aoutput.Note: When you ping an IP address on a different subnet, the pings are unsuccessful. This occurs when you have multiple subnet masks for each interface when there are multiple interfaces defined for the CF. In this case, from the member, ping the target IP address on the CF host that has the same subnet mask as the interface on the member host.
- View the IP address that is associated with the IB ports on the CF host.
- Verify that the uDAPL interface is configured in the /etc/dat.conf
file on all hosts, and that the right adapter port value is used. Since Db2® pureScale® uses uDAPL 2.0, look for the first entry that has
u2.0in the second column with the matching interface name and port number. The following entry might look similar to the entry in your /etc/dat.conf file:
In the example,hca2 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/iba0 1 ib0" " "hca2is the unique transport device name for the uDAPL interface. Theu2.0indicates that the entry is for a uDAPL 2.0 application. You must ensure that the /usr/lib/libdapl/libdapl2.a file exists for it is the uDAPL shared library. The /dev/iba0 1 ib0 output is the uDAPL provider-specific instance data. In this case, the adapter isiba0. The port is 1, and the interface name isib0.If the CF is configured with multiple interfaces by using multiple netnames in the db2nodes.cfg file, you must ensure that all the interfaces are defined in the dat.conf file.
Note: The /etc/dat.conf file must only contain entries for the adapters that are in the local host. The sample /etc/dat.conf file that is installed by default typically contains irrelevant entries. To avoid unnecessary processing of the file, make the following changes:- Move all the Db2 pureScale cluster-related adapter entries to the top of the file.
- Comment out the irrelevant entries or remove them from the file.
- Ensure that the port value specified on the client connect request match the port value
the CF listens on. You must ensure that the CF port values are the same in the /etc/services files for all hosts in the cluster.
- To determine the port value that is used for the CF, look in the CF diagnostic log
file. In the cfdiag_<timestamp>.<id>.log file, look for the value that is associated with the
CA Port[0]field as part of the prolog information at the beginning of the log file. In the following example, the port value for the CF is 37761.Cluster Accelerator Object Information AIX 64-bit Compiler: xlC VRM (900) SVN Revision: 4232013 Build mode: PERFORMANCE Built on Apr 23 2013 at 12:52:24 Executable generated with symbols. Executable generated without asserts. Model Components Loaded: CACHE, LIST, LOCK Transport: uDAPL Number of HCAs: 1 Device[0]: hca2 CA Port[0]: 37761 Total Workers: 1 Conn/Worker: 128 Notify conns: 1024 Processor Speed: 5000.000000 Mhz. Allocatable Structure memory: 494 MB -
To determine the port value that is used by the member on the connect request, look for the
PsOpen event in the Db2 member diagnostic log
(
db2diag.log) file.Look for the value of thecaportfield. In the following example, the port value for the target CF is also 37761.2013-04-29-16.00.56.371442-240 I80540A583 LEVEL: Event PID : 10354874 TID : 772 PROC : db2sysc 0 INSTANCE: lpham NODE : 000 HOSTNAME: coralpib23 EDUID : 772 EDUNAME: db2castructevent 0 FUNCTION: Db2, Shared Data Structure Abstraction Layer for CF, SQLE_SINGLE_CA_HANDLE::sqleSingleCfOpenAndConnect, probe:1264 DATA #1 : <preformatted> PsOpen SUCCESS: hostname:coralpib23-ib0 (member#: 128, cfIndex: 1) ; device:hca2 ; caport:37761 ; transport: UDAPL Connection pool target size = 9 conn (seq #: 3 node #: 1)
- To determine the port value that is used for the CF, look in the CF diagnostic log
file.
- Perform an RDMA ping across the cluster by running the
following:
db2cluster -verify -req -rdma_ping