A
remote direct memory access (RDMA) over Converged Ethernet (RoCE)
network without IP support is characterized by a RoCE special device
file (and the absence of a network interface) on hosts which can only
transmit and receive RDMA data. To configure the network settings,
you must install required uDAPL software and configure ICM, associate
interconnect netnames with pseudo IP addresses, and add required entries
to the Direct Access Transport (DAT) configuration file.
Before you begin
The
steps in this topic are to configure the network settings of hosts
on a RoCE network that does not have network interface card IP support.
This topic is specific to configurations with these adapters: EC26,
EC27, EC28, EC29, EC30. If you are configuring the network settings
of hosts on a RoCE network with IP support, see topic Configuring network settings on a RoCE network
with IP support.
Ensure that you complete the following
tasks:
About this task
You must perform these steps on each host, or LPAR, you
want to participate in the Db2
pureScale instance.
Cluster caching facilities (CFs)
and members support multiple communication adapter ports to
help Db2
pureScale environments
scale and to help with high availability.
One communication adapter port for each CF or member is
all that is required, though it is recommended to use more adapter
ports to increase bandwidth, add redundancy, and allow the use of
multiple switches. This topic guides you through the installation
and setup of User Direct Access Programming Library (uDAPL)
on AIX® hosts
and configuring IP addresses.
Procedure
- Log in as root.
- Ensure that any AIX fixes
are installed from the installation prerequisites at this time.
-
If file /etc/dat.conf was previously setup
with the desired values, save the existing copy of dat.conf.
- Verify that your system has the correct uDAPL and RoCE
network file sets.
To
verify uDAPL is installed correctly, run the following command, shown
with sample output:
$ lslpp -l bos.mp64 devices.chrp.IBM.lhca.rte devices.common.IBM.ib.rte devices.pciex.b3154a63.rte devices.pciex.b315506714101604.rte udapl.rte
Fileset Level State Description
----------------------------------------------------------------------------
Path: /usr/lib/objrepos
bos.mp64 7.1.3.45 APPLIED Base Operating System 64-bit
Multiprocessor Runtime
devices.chrp.IBM.lhca.rte 7.1.3.45 APPLIED Infiniband Logical HCA Runtime
Environment
devices.common.IBM.ib.rte 7.1.3.45 APPLIED Infiniband Common Runtime
Environment
devices.pciex.b3154a63.rte
7.1.3.45 APPLIED 4X PCI-E DDR Infiniband Device
Driver
devices.pciex.b315506714101604.rte
7.1.3.30 APPLIED Dual Port 10 Gigabit RDMA
Converged Ethernet Adapter
(RoCE)
udapl.rte 7.1.3.30 APPLIED uDAPL
Path: /etc/objrepos
bos.mp64 7.1.3.45 APPLIED Base Operating System 64-bit
Multiprocessor Runtime
devices.chrp.IBM.lhca.rte 7.1.3.45 APPLIED Infiniband Logical HCA Runtime
Environment
devices.common.IBM.ib.rte 7.1.3.45 APPLIED Infiniband Common Runtime
Environment
devices.pciex.b3154a63.rte
7.1.3.45 APPLIED 4X PCI-E DDR Infiniband Device
Driver
devices.pciex.b315506714101604.rte
7.1.3.30 COMMITTED RoCE Host Bus Adapter
(b315506714101604)
udapl.rte 7.1.3.30 APPLIED uDAPL
The
command output varies depending on version, technology level, and
service pack level.
- If
any of the filesets in the previous step were newly installed or updated,
reboot the system by running the following command:
- Configure the RoCE subsystem and set IP addresses:
- Configure the RoCE network subsystem in this substep
only if a RoCE network was never set up before on the host.
Run the smitty icm command:
- Select Add an InfiniBand Communication Manager
- Key Enter and wait for the command to complete
- Exit by keying Esc+0
For example,
Infiniband Communication Manager Device Name icm
Minimum Request Retries [1]
Maximum Request Retries [7]
Minimum Response Time (msec) [100]
Maximum Response Time (msec) [4300]
Maximum Number of HCA's [256]
Maximum Number of Users [65000]
Maximum Number of Work Requests [65000]
Maximum Number of Service ID's [1000]
Maximum Number of Connections [65000]
Maximum Number of Records Per Request [64]
Maximum Queued Exception Notifications Per User [1000]
Number of MAD buffers per HCA [64]
- Reboot the systems by running the following command on
each host:
-
You must associate each interconnect netname for a member or
CF that will be selected during install with an IPv4 pseudo IP address
in /etc/hosts. Each interconnect netname is associated
with a RoCE communication adapter port via the Direct Access Transport
(DAT) configuration file in the next step. This pseudo IP address
is used only for resolving the netname and for uDAPL purposes, it
is not pingable. Each pseudo IP address must be unique.
Update
the
/etc/hosts file on each of the hosts so that
for each host in the planned
Db2
pureScale environment,
the file includes all the pseudo IP addresses of interconnect netnames
in the planned environment.
The /etc/hosts file
must have this format: <IP_Address> <fully_qualified_name> <short_name>.
All hosts in the cluster must have the same /etc/hosts format. For
example, in a planned
Db2
pureScale environment
with multiple communication adapter ports on the CFs and four members,
the
/etc/hosts configuration file might resemble
the following file:
10.222.1.1 cf1-en1.example.com cf1-en1
10.222.2.1 cf1-en2.example.com cf1-en2
10.222.3.1 cf1-en3.example.com cf1-en3
10.222.4.1 cf1-en4.example.com cf1-en4
10.222.1.2 cf2-en1.example.com cf2-en1
10.222.2.2 cf2-en2.example.com cf2-en2
10.222.3.2 cf2-en3.example.com cf2-en3
10.222.4.2 cf2-en4.example.com cf2-en4
10.222.1.101 member1-en1.example.com member1-en1
10.222.2.101 member1-en2.example.com member1-en2
10.222.1.102 member2-en1.example.com member2-en1
10.222.2.102 member2-en2.example.com member2-en2
10.222.1.103 member3-en1.example.com member3-en1
10.222.2.103 member3-en2.example.com member3-en2
10.222.1.104 member4-en1.example.com member4-en1
10.222.2.104 member4-en2.example.com member4-en2
Note: The pseudo IP addresses of each netname for the
CF and member must have a different
third octet. All pseudo IP address of members must have the same third
octet, which is the same as the third octet for the pseudo IP address
associated with the first communication adapter port of each of the
CFs and members. In the previous example,
the third octet is 1
.
All host names in the example above
are not associated with regular Ethernet adapters. These host names
are set up only for resolving the netnames and for uDAPL purposes.
They are not pingable.
In a four member environment that uses only
one communication adapter port for each CF
and
member, the file would look similar to the previous example,
but contain only the first pseudo IP address of each of the CFs in
the previous example. Here is an example of this:
10.222.1.1 cf1-en1.example.com cf1-en1
10.222.1.2 cf2-en1.example.com cf2-en1
10.222.1.101 member1-en1.example.com member1-en1
10.222.1.102 member2-en1.example.com member2-en1
10.222.1.103 member3-en1example.com member3-en1
10.222.1.104 member4-en1.example.com member4-en1
-
If the
Direct Access Transport (DAT) configuration file /etc/dat.conf was previously
saved, verify that the contents are still equivalent. If the contents are not still equivalent,
replace the currently dat.conf with the saved copy. If the
dat.conf file was not previously setup, edit the
dat.conf file on each host to add a line to associate each interconnect netname
with a uDAPL device and a RoCE Adapter port.
The
/etc/dat.conf file must only contain entries for the adapters that are in the
local host. The sample
/etc/dat.conf file that is installed by default
typically contains irrelevant entries. To avoid unnecessary processing of the file, make the
following changes:
- Move all the Db2
pureScale
cluster-related adapter entries to the top of the file.
- Comment out the irrelevant entries or remove them from the file.
The following is an example:
<interface adapter name> u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 1 hostname-en1" " "
- The <interface adapter name> string cannot be more than 19 characters
long.
- The name within quotes ("/dev/roce0 1 hostname-en1") is the platform-specific string. This
string consists of:
- Adapter special file ( /dev/roce0 )
- port number ( 1 or 2 )
- The interconnect netname for the member or CF that will run on this host.
The following format is also supported:
hca0 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 1 10.10.11.131" " "
Where
10.10.11.131 is the pseudo IP address corresponding to the netname.
Note: If you
are receiving a communication error between the member and CF, it is likely
that the system attempted to communicate with an adapter interface that is not set up correctly in
the Direct Access Transport (DAT) configuration file for the adapter port.
In the case of a
CF
or member that uses two communication adapters, each communication
adapter having 2 ports, the
/etc/dat.conf would resemble the following
example:
hca0 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 1 cf1-en1" " "
hca1 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce0 2 cf1-en2" " "
hca2 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce1 1 cf1-en3" " "
hca3 u2.0 nonthreadsafe default /usr/lib/libdapl/libdapl2.a(shr_64.o) IBM.1.1 "/dev/roce1 2 cf1-en4" " "
- Verify the RoCE network subsystem. Verify the RoCE network
components are in the Available State:
For example,
the system output of the following command run on a host, verifies
that all devices are available:
# lsdev -C | grep -E "Infiniband|PCIE RDMA"
icm Available Infiniband Communication Manager
roce0 Available 02-00 PCIE RDMA over Converged Ethernet RoCE Adapter
(b315506714101604)
To
check the state, use the
ibstat -v command. Verify
that the ports are active and the links are up. This check applies
only for the port and interface that were previously identified in
/etc/dat.conf (by
default port 1 on
roce0):
-------------------------------------------------------------------------------
ETHERNET PORT 1 INFORMATION (roce0)
-------------------------------------------------------------------------------
Link State: UP
Link Speed: 10G XFI
Link MTU: 9600
Hardware Address: 00:02:c9:4b:97:b8
GIDS (up to 3 GIDs):
GID0 :00:00:00:00:00:00:00:00:00:00:00:02:c9:4b:97:b8
GID1 :00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
GID2 :00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
-
Ensure
Global Pause (IEEE 802.3x) is enabled on the switches connected to
the adapters. For details see: Configuring
switch failover for a DB2 pureScale environment on a RoCE network
(AIX).