Enable RDMA capabilities for DB2 with Socket Direct Protocol

LAN networking devices, like host adapters and Ethernet switches, have evolved over the last few years, making it possible to achieve throughput as high as 40Gbps and network latency below one microsecond. Computing power and LAN devices are no longer a bottleneck. Instead, protocols like TCP have become the real bottleneck in this new world of high-performance hardware. To overcome the limitations of TCP, new protocols like Remote Direct Memory Access (RDMA) have been developed. This article will explain a cost-effective method for enabling RDMA capabilities in DB2 client server environments without requiring recoding and recompiling the existing applications.

Sorin Iszlai (siszlai@ca.ibm.com), Systems Engineer, IBM

Author photoSorin Iszlai is a systems and network engineer with over 17 years experience in the IT industry. Sorin is currently working at the IBM Software Lab where he is researching new technology in the field of hardware acceleration and appliances optimization.



19 July 2012

Also available in Chinese Russian Portuguese

Introduction

RDMA is a mechanism that enables computers to access memory locations on other computers in a way that bypasses the operating system, that is, the kernel and the TCP stack. Compared with a traditional hardware-software architecture based on TCP, RDMA offers several advantages. Kernel bypass means shorter paths between two applications and reduced CPU utilization. The protocol overhead specific to TCP is eliminated by transferring the data directly between the network adapter and the application memory (user space) without the need to copy and buffer data into the kernel space.

In order to take advantage of all the benefits offered by RDMA, applications need to be written using RDMA semantics or upper layer protocols like User-Level Direct Access Transport (uDAPL), or Message Passing Interface (MPI). However, because the process of rewriting a TCP application for RDMA can be very expensive, an alternative solution was developed for such cases. This method is called Direct Socket Protocol (SDP) and does not require any recoding of applications.

SDP is a wire protocol used between RDMA-capable adapters and sockets. For that reason, SDP is transparent to applications, and the standard streams socket implementation need not be replaced with another API. DB2 applications and DB2 servers can run unmodified over SDP or TCP. The user can simply select the protocol to be used prior to executing the application by pre-loading the SDP shared library. All settings pertaining to TCP, such as host names, IP addresses, and ports require no modifications either.

For example, a Java application that uses TCP to connect to a database server can also work over SDP using the same JDBC URL. The SDP library, once pre-loaded, will decide which protocol must be enabled based on a set of rules defined in /etc/libsdp.conf, and the protocol that the server accepts. The default rules specify SDP as the first option, and in case the connection fails, the SDP library will fall back to TCP.

An application can use SDP only, TCP only, or both at the same time. For instance, an application can be configured to use SDP for the DB2 database connections and TCP for LDAP connections. The database and the LDAP servers can be running on different physical machines, or on the same machine but listening on different interfaces. The different scenarios and how rules apply will be discussed later in this article.

Hardware and software infrastructure

RDMA requires special hardware and software infrastructure. This article will describe SDP on the Linux x86 platform.

  • Host adapters

    RDMA is supported on two types of host adapters: Infiniband adapters and RoCE adapters. The former requires Infiniband switches, and the latter requires Ethernet switches.

    Mellanox RoCE adapters are used for this example. All instructions provided in this article apply to Infiniband adapters with little or no modifications. RoCE (RDMA over Converged Ethernet) is an implementation of the Infiniband protocol over Ethernet. Converged Ethernet networks allow different types of protocols to share the same medium. Regular LAN Ethernet frames encapsulating IP packets, Infiniband over Ethernet, and Fiber Channel over Ethernet can all coexist on a single Ethernet wire.

  • Device drivers

    There are several options regarding device drivers and tools needed to operate a Mellanox adapter. All available device drivers and tools are based on the OFED stack maintained by OpenFabrics Alliance. See the Resources section for a link to OFED website.

  • Operating system

    Red Hat Linux 6.2, kernel version 2.6.32-220.4.1.el6.x86_64.

  • Ethernet switches

    10Gbps RackSwitch G8264, Operating System version 6.8.1.0.


Installing Mellanox drivers and tools

  1. The Mellanox software is packaged as an ISO image. Download the appropriate image for your operating system from the link provided in the Resources section. For Red Hat 6.2, the image is called MLNX_OFED_LINUX-1.5.3-3.0.0-rhel6.2-x86_64.iso.

    Unless specified, all of the following commands are executed as root.

    Mount the ISO image on each computer and start the installer, as shown in Listing 1.

    Listing 1. ISO image and installer
    bash# mount -o loop MLNX_OFED_LINUX-1.5.3-3.0.0-rhel6.2-x86_64.iso /mnt/iso
                            
    bash# cd /mnt/iso
                            
    bash# ./mlnxofedinstall
    This program will install the MLNX_OFED_LINUX package on your machine.
    Note that all other Mellanox, OEM, OFED, or Distribution IB packages will be removed.

    The installer will install all drivers and tools and will attempt to upgrade the firmware in each Mellanox card if they are at a lower level than the one included in the ISO image.

  2. Reboot the computer in order to load all kernel modules.
  3. SDP requires the Ethernet interfaces to be configured with an IP address. For Infiniband interfaces, IPoIB must be enabled, which basically means configuring IP addresses on the ib* interfaces and enable IPOIB_LOAD=yes in /etc/infiniband/openib.conf.

    First, you need to identify the new devices. This command will list all Mellanox network adapters and current settings of the adapters including Link Layer type (Ethernet in this case), firmware version and link status, as shown in Listing 2.

    Listing 2. Mellanox network adapters
    bash# ibv_devinfo
    hca_id: mlx4_0
            transport:                      InfiniBand (0)
            fw_ver:                         2.10.300
            node_guid:                      0002:c903:0005:6aa8
            sys_image_guid:                 0002:c903:0005:6aab
            vendor_id:                      0x02c9
            vendor_part_id:                 4099
            hw_ver:                         0x0
            board_id:                       IBM1020110023
            phys_port_cnt:                  2
                    port:   1
                            state:                  PORT_ACTIVE (4)
                            max_mtu:                2048 (4)
                            active_mtu:             1024 (3)
                            sm_lid:                 0
                            port_lid:               0
                            port_lmc:               0x00
                            link_layer:             Ethernet
                            
                    port:   2
                            state:                  PORT_DOWN (1)
                            max_mtu:                2048 (4)
                            active_mtu:             1024 (3)
                            sm_lid:                 0
                            port_lid:               0
                            port_lmc:               0x00
                            link_layer:             Ethernet
  4. To identify which port is associated with an eth* interface, execute the following command and assign IP addresses to the appropriate interface, as shown in Listing 3.
    Listing 3. Identify which port
    bash# ibdev2netdev 
    mlx4_0 port 1 ==> eth8 (Up)
    mlx4_0 port 2 ==> eth9 (Down)
                            
    bash# ifconfig eth8 10.7.7.1 netmask 255.255.255.0

Enabling SDP in kernel and user space

SDP is not enabled by default. In addition to the libsdp.so user library mentioned previously, part of SDP is implemented as a kernel module which needs to be loaded.

  1. Load the SDP kernel module ib_sdp, and edit /etc/infiniband/openib.conf to set this parameter to yes. as shown in Listing 4.
    Listing 4. Parameter set to yes
    SDP_LOAD=yes
  2. Reboot the system or load the module manually. Then verify the kernel module is loaded, as shown in Listing 5.
    Listing 5. Load the module
    bash# modprobe ib_sdp
                            
    bash# lsmod | grep ib_sdp
    ib_sdp 130827 0
    rdma_cm 35175 2 ib_sdp,rdma_ucm
    ipv6 322029 89 ib_sdp,ib_addr,ib_ipoib
    ib_core 69947 14
  3. Configure applications to load the dynamic library /usr/lib64/libsdp.so. There are two methods for loading the SDP library. You can preload the library for each process you create, or preload the library for the whole system. These methods apply to any application that uses glibc socket calls, including Java applications that use JDBC to connect to database servers.
    1. Preload the library for each process you create, as shown in Listing 6.
      Listing 6. Preload library
      bash# LD_PRELOAD=/usr/lib64/libsdp.so myapplication
    2. Preload the library for the whole system. Every process in the system will pre-load this library. Add the following line in /etc/ld.so.preload as shown in Listing 7.
      Listing 7. Preload SDP with /etc/ld.so.preload
      /usr/lib64/libsdp.so
    3. Check if an application will be loading the SDP library at runtime as shown in Listing 8.
      Listing 8. SDP library at runtime
      bash# ldd my_app | grep sdp
      /usr/lib64/libsdp.so (0x00007f217170d000)
  4. Before starting the DB2 server and clients, you can use echo server and telnet to test that SDP is functioning properly. Enable the echo server in /etc/xinetd.d/echo-stream by setting disable = no.

    1. Restart the xinetd daemon as shown in Listing 9.
      Listing 9. xinetd daemon
      bash# service xinetd restart
      Stopping xinetd:                                           [  OK  ]
      Starting xinetd:                                           [  OK  ]
    2. Verify the xinetd process is linked to the SDP library, as shown in Listing 10.
      Listing 10. xinetd process
      bash# lsof -p `pidof xinetd` | grep sdp
      xinetd  10090 root  mem  REG    8,5    69128 9339172  /usr/lib64/libsdp.so.1.0.0
    3. You may now connect to the echo server on port 7 from another computer that is configured with SDP as shown in Listing 11.
      Listing 11. Echo server connect
       bash# LD_PRELOAD=/usr/lib64/libsdp.so telnet 10.7.7.1 7
       Trying 10.7.7.1...
       Connected to 10.7.7.1.
       Escape character is '^]'.
       test sdp
       test sdp
    4. At this point the telnet client is connected to the echo server and any line you type into telnet will be echoed back by the echo server. The active SDP connections can be listed with the sdpnetstat command which is similar to the standard Linux netstat, as shown in Listing 12.
      Listing 12. Active SDP connections command
       bash# sdpnetstat -Sn
       Active Internet connections (w/o servers)
       Proto Recv-Q Send-Q Local Address           Foreign Address        State      
       sdp        0      0 10.7.7.2:41771          10.7.7.1:7             ESTABLISHED

      Note that the SDP connections cannot be displayed by the standard netstat command.


DB2 client-server over SDP

The steps required to enable SDP for DB2 are similar with the echo server example. Make sure the SDP library is listed in /etc/ld.so.preload, and restart the DB2 instance. The DB2 server cannot be started by setting or exporting the environment variable LD_PRELOAD because DB2 consists of many processes being forked, and some of these processes are running as root or setuid root. Therefore the variable will not be propagated to all children processes.

  1. Log in as the instance owner and restart the instance, as shown in Listing 13.
    Listing 13. Restart the DB2 instance
    db2inst1$ db2stop 
    db2inst1$ db2start
  2. To test that DB2 is using SDP, use a simple command that is included in DB2 called db2batch. On the DB2 client machine, log in as the client instance owner and create a file that contains the SQL statements to be used in the benchmark. A sample file update.sql is provided in Listing 14.
    Listing 14. Sample update.sql file
     --#SET PERF_DETAIL 0
     create table demo (c1 bigint, c2 double, c3 varchar(8));
                    
     --#BGBLK 4000
     insert into demo values (-9223372036854775808, -0.000000000000005, 'demo');
     --#EOBLK
                    
     --#SET ROWS_OUT 0
     select * from demo;
                    
     drop table demo;

    There are four statements in this example. The first one creates a table that you will drop at the end of the test in statement four. The second statement is INSERT, which will be executed 4000 times. Statement three will select all rows from the table (4000 rows) but won't display them to stdout.
  3. To start the test, run the command as the client instance owner on the client machine, as shown in Listing 15.
    Listing 15. Client instance
     db2inst1$ LD_PRELOAD=/usr/lib64/libsdp.so db2batch -d sample -f update.sql \
                    -a db2inst1/password -r result.txt,summary.txt

    The db2batch command generates two results files. result.txt contains detailed metrics for each individual statement, and summary.txt contains totals and average time for each block.
    Even though this is a very simple test, you can see the effect of SDP on performance. Inserts average 230 microseconds over SDP compared to 360 microseconds over TCP. A SELECT statement that returns 4000 records takes 4 milliseconds over SDP and 6 milliseconds over TCP.
    These tests were performed using the default SDP settings for sdp_zcopy_thresh (64K), and recv_poll (700). See the Performance tuning section for details on how to tune these parameters.
  4. The summary for db2batch over SDP is shown in Listing 16.
    Listing 16. Summary for db2batch over SDP
    Type      Number      Repetitions Total Time (s) Min Time (s)   Max Time (s)
    --------- ----------- ----------- -------------- -------------- --------------
    Statement           1           1       0.115556       0.115556       0.115556
    Block               1        4000       0.924646       0.000200       0.001328
    Statement           3           1       0.004281       0.004281       0.004281
    Statement           4           1       0.133243       0.133243       0.133243
                    
    Type      Number      Repetitions Arithmetic Mean Geometric Mean
    --------- ----------- ----------- --------------- --------------
    Statement           1           1        0.115556       0.115556
    Block               1        4000        0.000231       0.000229
    Statement           3           1        0.004281       0.004281
    Statement           4           1        0.133243       0.133243
  5. The summary for db2batch over TCP is shown in Listing 17.
    Listing 17. Summary for db2batch over TCP
    Type      Number      Repetitions Total Time (s) Min Time (s)   Max Time (s)   
    --------- ----------- ----------- -------------- -------------- --------------
    Statement           1           1       0.115521       0.115521       0.115521
    Block               1        4000       1.445863       0.000326       0.001034
    Statement           3           1       0.006005       0.006005       0.006005
    Statement           4           1       0.137436       0.137436       0.137436
                    
    Type      Number      Repetitions Arithmetic Mean Geometric Mean
    --------- ----------- ----------- --------------- --------------
    Statement           1           1        0.115521       0.115521
    Block               1        4000        0.000361       0.000360
    Statement           3           1        0.006005       0.006005
    Statement           4           1        0.137436       0.137436
  6. The comparison between SDP and TCP is graphically shown in Figure 1 and Figure 2.
    Figure 1. Average time per INSERT
    Average time for INSERT statements over SDP vs. TCP
    Figure 2. Total time for a 4000 INSERT block
    Test duration for 4000 INSERT statements over SDP vs. TCP

Advanced SDP configuration

By default, the SDP library is configured to use SDP first, and if the SDP connection fails then it will attempt to use TCP. Additional rules can be defined in /etc/libsdp.conf.
The format of the file is <address-family> <role> <program name> <address|*>:<port range|*>

The default values are defined in Listing 18.

Listing 18. Default values
use both server * *:* 
use both client * *:*
  • The both keyword means try SDP; if it fails try TCP.
  • server rules apply to applications that listen on a socket.
  • client rules are for applications that initiate a connection.
  • <program name> specifies the process name this rule applies to.
  • <address:port> matches the local IP and port a server is listening on for a server rule.
  • <address:port> matches the remote IP and port a client is attempting to connect to.

The first rule that matches your application will be applied, and the remaining rules will be ignored.

For example, these rules will switch Java processes to SDP when connecting to a DB2 server and use TCP for connections to LDAP server, as shown in Listing 19.

Listing 19. SDP connections for DB2. TCP connections for LDAP
use sdp client java 192.168.100.10:50000
use tcp client java *:389

These rules configure a DB2 server to accept SDP connections on one interface only, and TCP connections on another, as shown in Listing 20.

Listing 20. Server accepting SDP and TCP connections
use sdp server db2* 192.168.100.10:50000
 use tcp server db2* 192.168.200.10:50000

The rule shown in Listing 21 configures the SDP library to accept both SDP and TCP connections, SDP being the preferred one.

Listing 21. Rule to configure SDP library
 use both server * 192.168.100.10:50000

Performance tuning

The sdp_zcopy_thresh, and the recv_poll parameters are important for tuning SDP's performance, and must be set according to each specific workload and traffic pattern.

Zero copy threshold

SDP can copy packets in two modes: using zero-copy (the kernel is bypassed, and the user's buffer is transmitted directly from the user memory), or buffer-copy (the user buffer is copied into the kernel space first).

ZCopy is more efficient for larger messages, and Bcopy for messages that are a few KBytes in size.

The sdp_zcopy_thresh parameter specifies which method to be used. By default, messages smaller than 64 KB are transferred with Bcopy, and those above 64 KB with Zcopy, as shown in Listing 22.

Listing 22. sdp_zcopy_thresh parameter
bash# cat /sys/module/ib_sdp/parameters/sdp_zcopy_thresh 
65536

To modify this threshold, write a different value in the sys file system, as shown in Listing 23.

Listing 23. sys file system
bash# echo 32768 > /sys/module/ib_sdp/parameters/sdp_zcopy_thresh

Use a value of zero to disable Zcopy completely, forcing Bcopy to be used for all message sizes.

You can also set this parameter in modprobe.conf, as shown in Listing 24.

Listing 24. modprobe.conf
 options ib_sdp sdp_zcopy_thresh=32768

The minimum value of the threshold is equal to the memory page size. Use getconf PAGE_SIZE to display the page size of your system.

Receive poll timer

The recv_poll parameter specifies the time a receiver polls for incoming data. The default value is 700 microseconds, as shown in Listing 25.

Listing 25. Receive poll timer
 bash# cat /sys/module/ib_sdp/parameters/recv_poll
 700

The poll can be disabled by setting the procfs parameter recv_poll to zero, as shown in Listing 26. This is less CPU intensive in contrast to a non-zero value. However, message latencies are higher with recv_poll=0.

Listing 26. Using recv_poll=0
 bash# echo 0 > /sys/module/ib_sdp/parameters/recv_poll

Add this in modprobe.conf to be persistent after a reboot, as shown in Listing 27.

Listing 27. modprobe.conf
 options ib_sdp recv_poll=0

SDP debugging

  • Use sdpnetstat to list the SDP connections, similar to netstat but netstat does not display the SDP connections, as shown in Listing 28.
    Listing 28. Using sdpnetstat
    bash# sdpnetstat -Sn
    Active Internet connections (w/o servers)
    Proto Recv-Q Send-Q Local Address        Foreign Address      State
    sdp     0       0   192.168.100.10:4432  192.168.100.11:7     ESTABLISHED
    sdp     0       0   192.168.100.11:7     192.168.100.10:4432  ESTABLISHED
  • SDP has two components called the kernel module and the user library. Both of them can be set in debug mode, as shown in Listing 29.
    • User-space debug mode is controlled by libsdp.conf.
      Listing 29. The levels of debug are described in libsdp.conf
      log min-level 9 destination file libsdp.log
    • Kernel-mode debug mode can be enabled in procfs, as shown in Listing 30.
      Listing 30. Kernel-mode debug
      bash# echo 1 > /sys/module/ib_sdp/debug_level

      Kernel-mode debug mode can also be enabled in modprobe.conf, as shown in Listing 31.

      Listing 31. Kernel-mode debug in modprob.conf
      options ib_sdp sdp_debug_level=1

Conclusion

This article showed how Remote Direct Memory Access (RDMA) protocol is more efficient than TCP, providing a cost effective method for enabling RDMA capabilities in DB2 client server environments without requiring recoding and recompiling of the existing applications. It also discussed how to install Mellanox drivers and tools, enable SDP in kernel and user space, as well as performance tuning and SDP debugging.

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=826021
ArticleTitle=Enable RDMA capabilities for DB2 with Socket Direct Protocol
publish-date=07192012