Configuring the network settings of hosts in a Db2 pureScale environment on an RoCE network without IP support (AIX)

A remote direct memory access (RDMA) over Converged Ethernet (RoCE) network without IP support is characterized by an RoCE special device file (and the absence of a network interface) on hosts which can only transmit and receive RDMA data. To configure the network settings, you must install required uDAPL software and configure ICM, associate interconnect netnames with pseudo IP addresses, and add required entries to the Direct Access Transport (DAT) configuration file.

Before you begin

The steps in this topic are to configure the network settings of hosts on an RoCE network that does not have network interface card IP support. This topic is specific to configurations with these adapters: EC26, EC27, EC28, EC29, EC30. If you are configuring the network settings of hosts on an RoCE network with IP support, see topic Configuring network settings on an RoCE network with IP support.

Ensure that you complete the following tasks:

About this task

You must perform these steps on each host, or LPAR, you want to participate in the Db2 pureScale instance.

Cluster caching facilities (CFs) and members support multiple communication adapter ports to help Db2 pureScale environments scale and to help with high availability.

One communication adapter port for each CF or member is all that is required, though it is recommended to use more adapter ports to increase bandwidth, add redundancy, and allow the use of multiple switches. This topic guides you through the installation and setup of User Direct Access Programming Library (uDAPL) on AIX® hosts and configuring IP addresses.

Procedure

  1. Log in as root.
  2. Ensure that any AIX fixes are installed from the installation prerequisites at this time.
  3. If file /etc/dat.conf was previously setup with the desired values, save the existing copy of dat.conf.
  4. Verify that your system has the correct uDAPL and RoCE network file sets.
    To verify uDAPL is installed correctly, run the following command, shown with sample output:
    $  lslpp -l bos.mp64 devices.chrp.IBM.lhca.rte devices.common.IBM.ib.rte devices.pciex.b3154a63.rte devices.pciex.b315506714101604.rte udapl.rte
    
      Fileset                      Level  State      Description
      ----------------------------------------------------------------------------
    Path: /usr/lib/objrepos
      bos.mp64                   7.1.5.32  APPLIED    Base Operating System 64-bit
                                                      Multiprocessor Runtime
      devices.chrp.IBM.lhca.rte  7.1.5.30  APPLIED    Infiniband Logical HCA Runtime
                                                      Environment
      devices.common.IBM.ib.rte  7.1.5.30  APPLIED    Infiniband Common Runtime
                                                      Environment
      devices.pciex.b3154a63.rte
                                 7.1.5.30  APPLIED    4X PCI-E DDR Infiniband Device
                                                      Driver
      devices.pciex.b315506714101604.rte
                                 7.1.4.30  COMMITTED  RoCE Host Bus Adapter 
                                                      (b315506714101604)
      udapl.rte                  7.1.5.0   APPLIED    uDAPL
    
    Path: /etc/objrepos
      bos.mp64                   7.1.5.32  APPLIED    Base Operating System 64-bit
                                                      Multiprocessor Runtime
      devices.chrp.IBM.lhca.rte  7.1.4.30  COMMITTED  Infiniband Logical HCA Runtime
                                                      Environment
      devices.common.IBM.ib.rte  7.1.5.30  APPLIED    Infiniband Common Runtime
                                                      Environment
      devices.pciex.b3154a63.rte
                                 7.1.5.30  APPLIED    4X PCI-E DDR Infiniband Device
                                                      Driver
      devices.pciex.b315506714101604.rte 
                                 7.1.4.30  COMMITTED  RoCE Host Bus Adapter
                                                      (b315506714101604)
      udapl.rte                  7.1.5.0   APPLIED    uDAPL
    The command output varies depending on version, technology level, and service pack level.
  5. If any of the filesets in the previous step were newly installed or updated, reboot the system by running the following command:
     shutdown -Fr
  6. Configure the RoCE subsystem and set IP addresses:
    1. Configure the RoCE network subsystem in this substep only if an RoCE network was never set up before on the host. Run the smitty icm command:
      1. Select Add an InfiniBand Communication Manager
      2. Key Enter and wait for the command to complete
      3. Exit by keying Esc+0
      For example,
      Infiniband Communication Manager Device Name        icm
      Minimum Request Retries                            [1]
      Maximum Request Retries                            [7]
      Minimum Response Time (msec)                       [100]
      Maximum Response Time (msec)                       [4300]
      Maximum Number of HCA's                            [256]
      Maximum Number of Users                            [65000]
      Maximum Number of Work Requests                    [65000]
      Maximum Number of Service ID's                     [1000]
      Maximum Number of Connections                      [65000]
      Maximum Number of Records Per Request              [64]
      Maximum Queued Exception Notifications Per User    [1000]
      Number of MAD buffers per HCA                      [64]
  7. Reboot the systems by running the following command on each host:
     shutdown -Fr
  8. You must associate each interconnect netname for a member or CF that will be selected during install with an IPv4 pseudo IP address in /etc/hosts. Each interconnect netname is associated with an RoCE communication adapter port via the Direct Access Transport (DAT) configuration file in the next step. This pseudo IP address is used only for resolving the netname and for uDAPL purposes, it is not pingable. Each pseudo IP address must be unique.
    Update the /etc/hosts file on each of the hosts so that for each host in the planned Db2 pureScale environment, the file includes all the pseudo IP addresses of interconnect netnames in the planned environment. The /etc/hosts file must have this format: <IP_Address> <fully_qualified_name> <short_name>. All hosts in the cluster must have the same /etc/hosts format. For example, in a planned Db2 pureScale environment with multiple communication adapter ports on the CFs and four members, the /etc/hosts configuration file might resemble the following file:
    10.222.1.1       cf1-en1.example.com cf1-en1
    10.222.2.1       cf1-en2.example.com cf1-en2
    10.222.3.1       cf1-en3.example.com cf1-en3
    10.222.4.1       cf1-en4.example.com cf1-en4
    
    10.222.1.2       cf2-en1.example.com cf2-en1
    10.222.2.2       cf2-en2.example.com cf2-en2
    10.222.3.2       cf2-en3.example.com cf2-en3
    10.222.4.2       cf2-en4.example.com cf2-en4
    
    10.222.1.101     member1-en1.example.com member1-en1
    10.222.2.101     member1-en2.example.com member1-en2
    10.222.1.102     member2-en1.example.com member2-en1
    10.222.2.102     member2-en2.example.com member2-en2
    
    10.222.1.103     member3-en1.example.com member3-en1
    10.222.2.103     member3-en2.example.com member3-en2
    10.222.1.104     member4-en1.example.com member4-en1
    10.222.2.104     member4-en2.example.com member4-en2
    
    Note: The pseudo IP addresses of each netname for the CF and member must have a different third octet. All pseudo IP address of members must have the same third octet, which is the same as the third octet for the pseudo IP address associated with the first communication adapter port of each of the CFs and members. In the previous example, the third octet is 1.
    All host names in the example above are not associated with regular Ethernet adapters. These host names are set up only for resolving the netnames and for uDAPL purposes. They are not pingable.
    In a four member environment that uses only one communication adapter port for each CF and member, the file would look similar to the previous example, but contain only the first pseudo IP address of each of the CFs in the previous example. Here is an example of this:
    10.222.1.1       cf1-en1.example.com cf1-en1
    
    10.222.1.2       cf2-en1.example.com cf2-en1
    
    10.222.1.101     member1-en1.example.com member1-en1
    10.222.1.102     member2-en1.example.com member2-en1
    10.222.1.103     member3-en1example.com member3-en1
    10.222.1.104     member4-en1.example.com member4-en1
  9. If the Direct Access Transport (DAT) configuration file /etc/dat.conf was previously saved, verify that the contents are still equivalent. If the contents are not still equivalent, replace the currently dat.conf with the saved copy. If the dat.conf file was not previously setup, edit the dat.conf file on each host to add a line to associate each interconnect netname with a uDAPL device and an RoCE Adapter port.
    The /etc/dat.conf file must only contain entries for the adapters that are in the local host. The sample /etc/dat.conf file that is installed by default typically contains irrelevant entries. To avoid unnecessary processing of the file, make the following changes:
    • Move all the Db2 pureScale cluster-related adapter entries to the top of the file.
    • Comment out the irrelevant entries or remove them from the file.
    The following is an example:
    <interface adapter name> u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 1 hostname-en1" " "
    • The <interface adapter name> string cannot be more than 19 characters long.
    • The name within quotes ("/dev/roce0 1 hostname-en1") is the platform-specific string. This string consists of:
      • Adapter special file ( /dev/roce0 )
      • port number ( 1 or 2 )
      • The interconnect netname for the member or CF that will run on this host.
    The following format is also supported:
    hca0 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 1 10.10.11.131" " "
    Where 10.10.11.131 is the pseudo IP address corresponding to the netname.
    Note: If you are receiving a communication error between the member and CF, it is likely that the system attempted to communicate with an adapter interface that is not set up correctly in the Direct Access Transport (DAT) configuration file for the adapter port.
    In the case of a CF or member that uses two communication adapters, each communication adapter having 2 ports, the /etc/dat.conf would resemble the following example:
    hca0 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 1 cf1-en1" " "
    hca1 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 2 cf1-en2" " "
    hca2 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce1 1 cf1-en3" " "
    hca3 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce1 2 cf1-en4" " "
  10. Verify the RoCE network subsystem. Verify the RoCE network components are in the Available State:
    For example, the system output of the following command run on a host, verifies that all devices are available:
     # lsdev -C | grep -E "Infiniband|PCIE RDMA"
    icm        Available             Infiniband Communication Manager
    roce0      Available 02-00       PCIE RDMA over Converged Ethernet RoCE Adapter 
                                     (b315506714101604)
    To check the state, use the ibstat -v command. Verify that the ports are active and the links are up. This check applies only for the port and interface that were previously identified in /etc/dat.conf (by default port 1 on roce0):
    -------------------------------------------------------------------------------
    ETHERNET PORT 1 INFORMATION (roce0)
    -------------------------------------------------------------------------------
     Link State: UP
     Link Speed: 10G XFI
     Link MTU: 9600
     Hardware Address: 00:02:c9:4b:97:b8
     GIDS (up to 3 GIDs):
     GID0 :00:00:00:00:00:00:00:00:00:00:00:02:c9:4b:97:b8
     GID1 :00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
     GID2 :00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
  11. Ensure Global Pause (IEEE 802.3x) is enabled on the switches connected to the adapters. For details see: Switch configuration on an RoCE network (AIX).