A
remote direct memory access (RDMA) over Converged Ethernet (RoCE) network without IP support is
characterized by an RoCE special device file (and the absence of a network interface) on hosts which
can only transmit and receive RDMA data. To configure the network settings, you must install
required uDAPL software and configure ICM, associate interconnect netnames with pseudo IP addresses,
and add required entries to the Direct Access Transport (DAT) configuration file.
Before you begin
The
steps in this topic are to configure the network settings of hosts on an RoCE network that does not
have network interface card IP support. This topic is specific to configurations with these
adapters: EC26, EC27, EC28, EC29, EC30. If you are configuring the network settings of hosts on an
RoCE network with IP support, see topic Configuring network settings on
an RoCE network with IP support.
Ensure that you complete the following
tasks:
About this task
You must perform these steps on each host, or LPAR, you
want to participate in the Db2
pureScale instance.
Cluster caching facilities (CFs)
and members support multiple communication adapter ports to
help Db2
pureScale environments
scale and to help with high availability.
One communication adapter port for each CF or member is
all that is required, though it is recommended to use more adapter
ports to increase bandwidth, add redundancy, and allow the use of
multiple switches. This topic guides you through the installation
and setup of User Direct Access Programming Library (uDAPL)
on AIX® hosts
and configuring IP addresses.
Procedure
- Log in as root.
- Ensure that any AIX fixes
are installed from the installation prerequisites at this time.
-
If file /etc/dat.conf was previously setup
with the desired values, save the existing copy of dat.conf.
- Verify that your system has the correct uDAPL and RoCE network file sets.
To
verify uDAPL is installed correctly, run the following command, shown with sample output:
$ lslpp -l bos.mp64 devices.chrp.IBM.lhca.rte devices.common.IBM.ib.rte devices.pciex.b3154a63.rte devices.pciex.b315506714101604.rte udapl.rte
Fileset Level State Description
----------------------------------------------------------------------------
Path: /usr/lib/objrepos
bos.mp64 7.1.5.32 APPLIED Base Operating System 64-bit
Multiprocessor Runtime
devices.chrp.IBM.lhca.rte 7.1.5.30 APPLIED Infiniband Logical HCA Runtime
Environment
devices.common.IBM.ib.rte 7.1.5.30 APPLIED Infiniband Common Runtime
Environment
devices.pciex.b3154a63.rte
7.1.5.30 APPLIED 4X PCI-E DDR Infiniband Device
Driver
devices.pciex.b315506714101604.rte
7.1.4.30 COMMITTED RoCE Host Bus Adapter
(b315506714101604)
udapl.rte 7.1.5.0 APPLIED uDAPL
Path: /etc/objrepos
bos.mp64 7.1.5.32 APPLIED Base Operating System 64-bit
Multiprocessor Runtime
devices.chrp.IBM.lhca.rte 7.1.4.30 COMMITTED Infiniband Logical HCA Runtime
Environment
devices.common.IBM.ib.rte 7.1.5.30 APPLIED Infiniband Common Runtime
Environment
devices.pciex.b3154a63.rte
7.1.5.30 APPLIED 4X PCI-E DDR Infiniband Device
Driver
devices.pciex.b315506714101604.rte
7.1.4.30 COMMITTED RoCE Host Bus Adapter
(b315506714101604)
udapl.rte 7.1.5.0 APPLIED uDAPL
The
command output varies depending on version, technology level, and service pack level.
- If
any of the filesets in the previous step were newly installed or updated,
reboot the system by running the following command:
- Configure the RoCE subsystem and set IP addresses:
- Configure the RoCE network subsystem in this substep only if an RoCE network was never
set up before on the
host. Run the
smitty icm command:
- Select Add an InfiniBand Communication Manager
- Key Enter and wait for the command to complete
- Exit by keying Esc+0
For
example,
Infiniband Communication Manager Device Name icm
Minimum Request Retries [1]
Maximum Request Retries [7]
Minimum Response Time (msec) [100]
Maximum Response Time (msec) [4300]
Maximum Number of HCA's [256]
Maximum Number of Users [65000]
Maximum Number of Work Requests [65000]
Maximum Number of Service ID's [1000]
Maximum Number of Connections [65000]
Maximum Number of Records Per Request [64]
Maximum Queued Exception Notifications Per User [1000]
Number of MAD buffers per HCA [64]
- Reboot the systems by running the following command on
each host:
-
You must associate each interconnect netname for a member or CF that will be selected during
install with an IPv4 pseudo IP address in /etc/hosts. Each interconnect netname
is associated with an RoCE communication adapter port via the Direct Access Transport (DAT)
configuration file in the next step. This pseudo IP address is used only for resolving the netname
and for uDAPL purposes, it is not pingable. Each pseudo IP address must be unique.
Update the
/etc/hosts file on each of the hosts so that for each
host in the planned
Db2
pureScale environment,
the file includes all the pseudo IP addresses of interconnect netnames in the planned environment.
The /etc/hosts file must have this format:
<IP_Address> <fully_qualified_name> <short_name>. All hosts in the
cluster must have the same /etc/hosts format. For example, in a planned
Db2
pureScale
environment with multiple communication adapter ports on the CFs and four members, the
/etc/hosts configuration file might resemble the following
file:
10.222.1.1 cf1-en1.example.com cf1-en1
10.222.2.1 cf1-en2.example.com cf1-en2
10.222.3.1 cf1-en3.example.com cf1-en3
10.222.4.1 cf1-en4.example.com cf1-en4
10.222.1.2 cf2-en1.example.com cf2-en1
10.222.2.2 cf2-en2.example.com cf2-en2
10.222.3.2 cf2-en3.example.com cf2-en3
10.222.4.2 cf2-en4.example.com cf2-en4
10.222.1.101 member1-en1.example.com member1-en1
10.222.2.101 member1-en2.example.com member1-en2
10.222.1.102 member2-en1.example.com member2-en1
10.222.2.102 member2-en2.example.com member2-en2
10.222.1.103 member3-en1.example.com member3-en1
10.222.2.103 member3-en2.example.com member3-en2
10.222.1.104 member4-en1.example.com member4-en1
10.222.2.104 member4-en2.example.com member4-en2
Note: The pseudo IP addresses of each netname for the CF and
member must have a different third octet. All pseudo IP address of members must have the same
third octet, which is the same as the third octet for the pseudo IP address associated with the
first communication adapter port of each of the CFs and members. In
the previous example, the third octet is 1
.
All host names in the example above are not
associated with regular Ethernet adapters. These host names are set up only for resolving the
netnames and for uDAPL purposes. They are not pingable.
In a four member environment that uses
only one communication adapter port for each CF
and member, the file
would look similar to the previous example, but contain only the first pseudo IP address of each of
the CFs in the previous example. Here is an example of
this:
10.222.1.1 cf1-en1.example.com cf1-en1
10.222.1.2 cf2-en1.example.com cf2-en1
10.222.1.101 member1-en1.example.com member1-en1
10.222.1.102 member2-en1.example.com member2-en1
10.222.1.103 member3-en1example.com member3-en1
10.222.1.104 member4-en1.example.com member4-en1
-
If the
Direct Access Transport (DAT) configuration file /etc/dat.conf was previously
saved, verify that the contents are still equivalent. If the contents are not still equivalent,
replace the currently dat.conf with the saved copy. If the
dat.conf file was not previously setup, edit the
dat.conf file on each host to add a line to associate each interconnect netname
with a uDAPL device and an RoCE Adapter port.
The
/etc/dat.conf file must only contain entries for the adapters that are in the
local host. The sample
/etc/dat.conf file that is installed by default
typically contains irrelevant entries. To avoid unnecessary processing of the file, make the
following changes:
- Move all the Db2
pureScale
cluster-related adapter entries to the top of the file.
- Comment out the irrelevant entries or remove them from the file.
The following is an example:
<interface adapter name> u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 1 hostname-en1" " "
- The <interface adapter name> string cannot be more than 19 characters
long.
- The name within quotes ("/dev/roce0 1 hostname-en1") is the platform-specific string. This
string consists of:
- Adapter special file ( /dev/roce0 )
- port number ( 1 or 2 )
- The interconnect netname for the member or CF that will run on this host.
The following format is also supported:
hca0 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 1 10.10.11.131" " "
Where
10.10.11.131 is the pseudo IP address corresponding to the netname.
Note: If you
are receiving a communication error between the member and CF, it is likely
that the system attempted to communicate with an adapter interface that is not set up correctly in
the Direct Access Transport (DAT) configuration file for the adapter port.
In the case of a
CF
or member that uses two communication adapters, each communication
adapter having 2 ports, the
/etc/dat.conf would resemble the following
example:
hca0 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 1 cf1-en1" " "
hca1 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 2 cf1-en2" " "
hca2 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce1 1 cf1-en3" " "
hca3 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce1 2 cf1-en4" " "
- Verify the RoCE network subsystem. Verify the RoCE network
components are in the Available State:
For example,
the system output of the following command run on a host, verifies
that all devices are available:
# lsdev -C | grep -E "Infiniband|PCIE RDMA"
icm Available Infiniband Communication Manager
roce0 Available 02-00 PCIE RDMA over Converged Ethernet RoCE Adapter
(b315506714101604)
To
check the state, use the
ibstat -v command. Verify
that the ports are active and the links are up. This check applies
only for the port and interface that were previously identified in
/etc/dat.conf (by
default port 1 on
roce0):
-------------------------------------------------------------------------------
ETHERNET PORT 1 INFORMATION (roce0)
-------------------------------------------------------------------------------
Link State: UP
Link Speed: 10G XFI
Link MTU: 9600
Hardware Address: 00:02:c9:4b:97:b8
GIDS (up to 3 GIDs):
GID0 :00:00:00:00:00:00:00:00:00:00:00:02:c9:4b:97:b8
GID1 :00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
GID2 :00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
-
Ensure
Global Pause (IEEE 802.3x) is enabled on the switches connected to the adapters. For details see:
Switch configuration
on an RoCE network (AIX).