RDMA is a mechanism that enables computers to access memory locations on other computers in a way that bypasses the operating system, that is, the kernel and the TCP stack. Compared with a traditional hardware-software architecture based on TCP, RDMA offers several advantages. Kernel bypass means shorter paths between two applications and reduced CPU utilization. The protocol overhead specific to TCP is eliminated by transferring the data directly between the network adapter and the application memory (user space) without the need to copy and buffer data into the kernel space.
In order to take advantage of all the benefits offered by RDMA, applications need to be written using RDMA semantics or upper layer protocols like User-Level Direct Access Transport (uDAPL), or Message Passing Interface (MPI). However, because the process of rewriting a TCP application for RDMA can be very expensive, an alternative solution was developed for such cases. This method is called Direct Socket Protocol (SDP) and does not require any recoding of applications.
SDP is a wire protocol used between RDMA-capable adapters and sockets. For that reason, SDP is transparent to applications, and the standard streams socket implementation need not be replaced with another API. DB2 applications and DB2 servers can run unmodified over SDP or TCP. The user can simply select the protocol to be used prior to executing the application by pre-loading the SDP shared library. All settings pertaining to TCP, such as host names, IP addresses, and ports require no modifications either.
For example, a Java application that uses TCP to connect to a database
server can also work over SDP using the same JDBC URL. The SDP library,
once pre-loaded, will decide which protocol must be enabled based on a set
of rules defined in /etc/libsdp.conf, and the
protocol that the server accepts. The default rules specify SDP as the
first option, and in case the connection fails, the SDP library will fall
back to TCP.
An application can use SDP only, TCP only, or both at the same time. For instance, an application can be configured to use SDP for the DB2 database connections and TCP for LDAP connections. The database and the LDAP servers can be running on different physical machines, or on the same machine but listening on different interfaces. The different scenarios and how rules apply will be discussed later in this article.
Hardware and software infrastructure
RDMA requires special hardware and software infrastructure. This article will describe SDP on the Linux x86 platform.
- Host adapters
RDMA is supported on two types of host adapters: Infiniband adapters and RoCE adapters. The former requires Infiniband switches, and the latter requires Ethernet switches.
Mellanox RoCE adapters are used for this example. All instructions provided in this article apply to Infiniband adapters with little or no modifications. RoCE (RDMA over Converged Ethernet) is an implementation of the Infiniband protocol over Ethernet. Converged Ethernet networks allow different types of protocols to share the same medium. Regular LAN Ethernet frames encapsulating IP packets, Infiniband over Ethernet, and Fiber Channel over Ethernet can all coexist on a single Ethernet wire.
- Device drivers
There are several options regarding device drivers and tools needed to operate a Mellanox adapter. All available device drivers and tools are based on the OFED stack maintained by OpenFabrics Alliance. See the Resources section for a link to OFED website.
- Operating system
Red Hat Linux 6.2, kernel version 2.6.32-220.4.1.el6.x86_64.
- Ethernet switches
10Gbps RackSwitch G8264, Operating System version 6.8.1.0.
Installing Mellanox drivers and tools
-
The Mellanox software is packaged as an ISO image. Download the appropriate image for your operating system from the link provided in the Resources section. For Red Hat 6.2, the image is called MLNX_OFED_LINUX-1.5.3-3.0.0-rhel6.2-x86_64.iso.
Unless specified, all of the following commands are executed as root.
Mount the ISO image on each computer and start the installer, as shown in Listing 1.
Listing 1. ISO image and installerbash# mount -o loop MLNX_OFED_LINUX-1.5.3-3.0.0-rhel6.2-x86_64.iso /mnt/iso bash# cd /mnt/iso bash# ./mlnxofedinstall This program will install the MLNX_OFED_LINUX package on your machine. Note that all other Mellanox, OEM, OFED, or Distribution IB packages will be removed.
The installer will install all drivers and tools and will attempt to upgrade the firmware in each Mellanox card if they are at a lower level than the one included in the ISO image.
- Reboot the computer in order to load all kernel modules.
- SDP requires the Ethernet interfaces to be configured with an IP
address. For Infiniband interfaces, IPoIB must be enabled, which
basically means configuring IP addresses on the ib* interfaces and
enable
IPOIB_LOAD=yesin/etc/infiniband/openib.conf.First, you need to identify the new devices. This command will list all Mellanox network adapters and current settings of the adapters including Link Layer type (Ethernet in this case), firmware version and link status, as shown in Listing 2.
Listing 2. Mellanox network adaptersbash# ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.10.300 node_guid: 0002:c903:0005:6aa8 sys_image_guid: 0002:c903:0005:6aab vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: IBM1020110023 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet - To identify which port is associated with an eth* interface, execute
the following command and assign IP addresses to the appropriate
interface, as shown in Listing 3.
Listing 3. Identify which portbash# ibdev2netdev mlx4_0 port 1 ==> eth8 (Up) mlx4_0 port 2 ==> eth9 (Down) bash# ifconfig eth8 10.7.7.1 netmask 255.255.255.0
Enabling SDP in kernel and user space
SDP is not enabled by default. In addition to the
libsdp.so user library mentioned previously,
part of SDP is implemented as a kernel module which needs to be
loaded.
- Load the SDP kernel module ib_sdp, and edit
/etc/infiniband/openib.confto set this parameter to yes. as shown in Listing 4.
Listing 4. Parameter set to yesSDP_LOAD=yes
- Reboot the system or load the module manually. Then verify the kernel
module is loaded, as shown in Listing 5.
Listing 5. Load the modulebash# modprobe ib_sdp bash# lsmod | grep ib_sdp ib_sdp 130827 0 rdma_cm 35175 2 ib_sdp,rdma_ucm ipv6 322029 89 ib_sdp,ib_addr,ib_ipoib ib_core 69947 14 - Configure applications to load the dynamic library
/usr/lib64/libsdp.so. There are two methods for loading the SDP library. You can preload the library for each process you create, or preload the library for the whole system. These methods apply to any application that uses glibc socket calls, including Java applications that use JDBC to connect to database servers.- Preload the library for each process you create, as shown in
Listing 6.
Listing 6. Preload librarybash# LD_PRELOAD=/usr/lib64/libsdp.so myapplication
- Preload the library for the whole system. Every process in the
system will pre-load this library. Add the following line in
/etc/ld.so.preloadas shown in Listing 7.
Listing 7. Preload SDP with /etc/ld.so.preload/usr/lib64/libsdp.so
- Check if an application will be loading the SDP library at
runtime as shown in Listing 8.
Listing 8. SDP library at runtimebash# ldd my_app | grep sdp /usr/lib64/libsdp.so (0x00007f217170d000)
- Preload the library for each process you create, as shown in
Listing 6.
Before starting the DB2 server and clients, you can use
echo serverandtelnetto test that SDP is functioning properly. Enable the echo server in/etc/xinetd.d/echo-streamby settingdisable = no.- Restart the
xinetddaemon as shown in Listing 9.
Listing 9. xinetd daemonbash# service xinetd restart Stopping xinetd: [ OK ] Starting xinetd: [ OK ]
- Verify the
xinetdprocess is linked to the SDP library, as shown in Listing 10.
Listing 10. xinetd processbash# lsof -p `pidof xinetd` | grep sdp xinetd 10090 root mem REG 8,5 69128 9339172 /usr/lib64/libsdp.so.1.0.0
- You may now connect to the
echo serveron port 7 from another computer that is configured with SDP as shown in Listing 11.
Listing 11. Echo server connectbash# LD_PRELOAD=/usr/lib64/libsdp.so telnet 10.7.7.1 7 Trying 10.7.7.1... Connected to 10.7.7.1. Escape character is '^]'. test sdp test sdp
- At this point the
telnetclient is connected to theecho serverand any line you type intotelnetwill be echoed back by theecho server. The active SDP connections can be listed with thesdpnetstatcommand which is similar to the standard Linuxnetstat, as shown in Listing 12.
Listing 12. Active SDP connections commandbash# sdpnetstat -Sn Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State sdp 0 0 10.7.7.2:41771 10.7.7.1:7 ESTABLISHED
Note that the SDP connections cannot be displayed by the standard
netstatcommand.
- Restart the
The steps required to enable SDP for DB2 are similar with the echo server
example. Make sure the SDP library is listed in
/etc/ld.so.preload, and restart the DB2
instance. The DB2 server cannot be started by setting or exporting the
environment variable LD_PRELOAD because DB2
consists of many processes being forked, and some of these processes are
running as root or setuid root. Therefore the variable will not be
propagated to all children processes.
- Log in as the instance owner and restart the instance, as shown in
Listing 13.
Listing 13. Restart the DB2 instancedb2inst1$ db2stop db2inst1$ db2start
- To test that DB2 is using SDP, use a simple command that is included
in DB2 called
db2batch. On the DB2 client machine, log in as the client instance owner and create a file that contains the SQL statements to be used in the benchmark. A sample fileupdate.sqlis provided in Listing 14.
Listing 14. Sample update.sql file--#SET PERF_DETAIL 0 create table demo (c1 bigint, c2 double, c3 varchar(8)); --#BGBLK 4000 insert into demo values (-9223372036854775808, -0.000000000000005, 'demo'); --#EOBLK --#SET ROWS_OUT 0 select * from demo; drop table demo;
There are four statements in this example. The first one creates a table that you will drop at the end of the test in statement four. The second statement isINSERT, which will be executed 4000 times. Statement three will select all rows from the table (4000 rows) but won't display them to stdout. - To start the test, run the command as the client instance owner on the
client machine, as shown in Listing 15.
Listing 15. Client instancedb2inst1$ LD_PRELOAD=/usr/lib64/libsdp.so db2batch -d sample -f update.sql \ -a db2inst1/password -r result.txt,summary.txt
The db2batch command generates two results files.result.txtcontains detailed metrics for each individual statement, andsummary.txtcontains totals and average time for each block.
Even though this is a very simple test, you can see the effect of SDP on performance. Inserts average 230 microseconds over SDP compared to 360 microseconds over TCP. ASELECTstatement that returns 4000 records takes 4 milliseconds over SDP and 6 milliseconds over TCP.
These tests were performed using the default SDP settings forsdp_zcopy_thresh(64K), andrecv_poll(700). See the Performance tuning section for details on how to tune these parameters. - The summary for db2batch over SDP is shown in Listing 16.
Listing 16. Summary for db2batch over SDPType Number Repetitions Total Time (s) Min Time (s) Max Time (s) --------- ----------- ----------- -------------- -------------- -------------- Statement 1 1 0.115556 0.115556 0.115556 Block 1 4000 0.924646 0.000200 0.001328 Statement 3 1 0.004281 0.004281 0.004281 Statement 4 1 0.133243 0.133243 0.133243 Type Number Repetitions Arithmetic Mean Geometric Mean --------- ----------- ----------- --------------- -------------- Statement 1 1 0.115556 0.115556 Block 1 4000 0.000231 0.000229 Statement 3 1 0.004281 0.004281 Statement 4 1 0.133243 0.133243 - The summary for db2batch over TCP is shown in Listing 17.
Listing 17. Summary for db2batch over TCPType Number Repetitions Total Time (s) Min Time (s) Max Time (s) --------- ----------- ----------- -------------- -------------- -------------- Statement 1 1 0.115521 0.115521 0.115521 Block 1 4000 1.445863 0.000326 0.001034 Statement 3 1 0.006005 0.006005 0.006005 Statement 4 1 0.137436 0.137436 0.137436 Type Number Repetitions Arithmetic Mean Geometric Mean --------- ----------- ----------- --------------- -------------- Statement 1 1 0.115521 0.115521 Block 1 4000 0.000361 0.000360 Statement 3 1 0.006005 0.006005 Statement 4 1 0.137436 0.137436 - The comparison between SDP and TCP is graphically shown in Figure 1 and Figure
2.
Figure 1. Average time per INSERT
Figure 2. Total time for a 4000 INSERT block
By default, the SDP library is configured to use SDP first, and if the SDP
connection fails then it will attempt to use TCP. Additional rules can be
defined in /etc/libsdp.conf.
The format
of the file is <address-family>
<role> <program name> <address|*>:<port range|*>
The default values are defined in Listing 18.
Listing 18. Default values
use both server * *:* use both client * *:* |
- The
bothkeyword means try SDP; if it fails try TCP. serverrules apply to applications that listen on a socket.clientrules are for applications that initiate a connection.<program name>specifies the process name this rule applies to.<address:port>matches the local IP and port a server is listening on for aserverrule.<address:port>matches the remote IP and port a client is attempting to connect to.
The first rule that matches your application will be applied, and the remaining rules will be ignored.
For example, these rules will switch Java processes to SDP when connecting to a DB2 server and use TCP for connections to LDAP server, as shown in Listing 19.
Listing 19. SDP connections for DB2. TCP connections for LDAP
use sdp client java 192.168.100.10:50000 use tcp client java *:389 |
These rules configure a DB2 server to accept SDP connections on one interface only, and TCP connections on another, as shown in Listing 20.
Listing 20. Server accepting SDP and TCP connections
use sdp server db2* 192.168.100.10:50000 use tcp server db2* 192.168.200.10:50000 |
The rule shown in Listing 21 configures the SDP library to accept both SDP and TCP connections, SDP being the preferred one.
Listing 21. Rule to configure SDP library
use both server * 192.168.100.10:50000 |
The sdp_zcopy_thresh, and the
recv_poll parameters are important for tuning
SDP's performance, and must be set according to each specific workload and
traffic pattern.
SDP can copy packets in two modes: using zero-copy (the kernel is bypassed, and the user's buffer is transmitted directly from the user memory), or buffer-copy (the user buffer is copied into the kernel space first).
ZCopy is more efficient for larger messages, and Bcopy for messages that are a few KBytes in size.
The sdp_zcopy_thresh parameter specifies which
method to be used. By default, messages smaller than 64 KB are transferred
with Bcopy, and those above 64 KB with Zcopy, as shown in Listing 22.
Listing 22. sdp_zcopy_thresh parameter
bash# cat /sys/module/ib_sdp/parameters/sdp_zcopy_thresh 65536 |
To modify this threshold, write a different value in the
sys file system, as shown in Listing 23.
Listing 23. sys file system
bash# echo 32768 > /sys/module/ib_sdp/parameters/sdp_zcopy_thresh |
Use a value of zero to disable Zcopy completely, forcing Bcopy to be used for all message sizes.
You can also set this parameter in modprobe.conf, as shown in Listing 24.
Listing 24. modprobe.conf
options ib_sdp sdp_zcopy_thresh=32768 |
The minimum value of the threshold is equal to the memory page size. Use
getconf PAGE_SIZE to display the page size of
your system.
The recv_poll parameter specifies the time a receiver polls for incoming data. The default value is 700 microseconds, as shown in Listing 25.
Listing 25. Receive poll timer
bash# cat /sys/module/ib_sdp/parameters/recv_poll 700 |
The poll can be disabled by setting the procfs
parameter recv_poll to zero, as shown in
Listing 26. This is less CPU intensive in contrast to a non-zero value.
However, message latencies are higher with
recv_poll=0.
Listing 26. Using recv_poll=0
bash# echo 0 > /sys/module/ib_sdp/parameters/recv_poll |
Add this in modprobe.conf to be persistent after
a reboot, as shown in Listing 27.
Listing 27. modprobe.conf
options ib_sdp recv_poll=0 |
- Use
sdpnetstatto list the SDP connections, similar tonetstatbutnetstatdoes not display the SDP connections, as shown in Listing 28.
Listing 28. Using sdpnetstatbash# sdpnetstat -Sn Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State sdp 0 0 192.168.100.10:4432 192.168.100.11:7 ESTABLISHED sdp 0 0 192.168.100.11:7 192.168.100.10:4432 ESTABLISHED
- SDP has two components called the kernel module and the user library.
Both of them can be set in debug mode, as shown in Listing 29.
- User-space debug mode is controlled by
libsdp.conf.
Listing 29. The levels of debug are described inlibsdp.conflog min-level 9 destination file libsdp.log
- Kernel-mode debug mode can be enabled in
procfs, as shown in Listing 30.
Listing 30. Kernel-mode debugbash# echo 1 > /sys/module/ib_sdp/debug_level
Kernel-mode debug mode can also be enabled in
modprobe.conf, as shown in Listing 31.
Listing 31. Kernel-mode debug in modprob.confoptions ib_sdp sdp_debug_level=1
- User-space debug mode is controlled by
This article showed how Remote Direct Memory Access (RDMA) protocol is more efficient than TCP, providing a cost effective method for enabling RDMA capabilities in DB2 client server environments without requiring recoding and recompiling of the existing applications. It also discussed how to install Mellanox drivers and tools, enable SDP in kernel and user space, as well as performance tuning and SDP debugging.
Learn
- Learn more about Mellanox and RDMA.
- Stay current with the latest developments
from OpenFabrics
Alliance.
- Explore uDAPL at DAT
Collaborative and MPI at Ohio State
University.
- Learn more about DB2 best practices.
- Visit the developerWorks
Information Management zone: Find more resources for DB2
developers and administrators.
- Stay current with developerWorks technical events and webcasts focused on a
variety of IBM products and IT industry topics.
- Attend a free
developerWorks Live! briefing to get up-to-speed quickly on IBM
products and tools as well as IT industry trends.
- Follow developerWorks on
Twitter.
- Watch developerWorks on-demand demos ranging from product installation
and setup demos for beginners, to advanced functionality for experienced
developers.
Get products and technologies
- Download Mellanox OpenFabrics Enterprise Distribution for Linux
(MLNX_OFED).
- Build your next
development project with IBM trial software, available
for download directly from developerWorks.
-
Evaluate IBM
products in the way that suits you best: Download a product trial,
try a product online, use a product in a cloud environment, or spend a few
hours in the SOA Sandbox learning how to implement Service Oriented
Architecture efficiently.
Discuss
- Get involved in the My developerWorks
community. Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis.





