Reliable Datagram Sockets over InfiniBand and RoCE

Reliable Datagram Sockets (RDS) is a connectionless and record-oriented protocol that provides an in-order and no-duplicate service over InfiniBand and RDMA over Converged Ethernet (RoCE). RDS exposes the User Datagram Protocol (UDP) subset of the socket API.

The RDS is part of the AF_BYPASS domain that is used for protocols that bypass the kernel TCP/IP stack.

The AIX® operating system provides two versions of RDS: RDSv2 and RDSv3. RDSv3 is the latest version and includes support for Remote Direct Memory Access (RDMA). RDSv3 on AIX 7.2, and later, supports Open Fabrics Enterprise Distribution (OFED) based RDMA over Converged Ethernet (RoCE).

Creating an RDS socket

To create an RDS socket, invoke the socket() system call by adding the following lines to the application program:
#include <sys/bypass.h>
include <net/rds_rdma.h>           /* for RDSv3 only */
sock = socket (AF_BYPASS, SOCK_SEQPACKET,BYPASSPROTO_RDS);
If the BYPASSPROTO_RDS protocol is the only reliable datagram protocol that is supported in the AF_BYPASS family, you can also call the socket() system call as follows:
sock = socket (AF_BYPASS, SOCK_SEQPACKET,0);

System calls

The RDS also supports the following system calls:
  • blind()
  • close()
  • getsockopt()
  • recvform()
  • recvmsg()
  • sendmsg()
  • sendto()
  • setsockopt()
In addition, RDSv3 also supports the following system calls:
  • connect()
  • read()
  • recv()
  • send()
  • write()
Note: Although RDS sockets are connectionless, the connect() system call is supported by RDSv3. However, in this case, connect() does not create a socket-level connection entity between two RDS endpoints. It merely associates a default destination endpoint with the socket. For this reason, the listen(), accept(), and shutdown() system calls are not supported for the RDS sockets.

The rdsctrl utility for RDSv2

Use the rdsctrl utility (/usr/sbin/rdsctrl) to change the tunables and the diagnostics for RDS statistics. For RDSv2, the utility can be used after RDS is loaded (bypassctrl load rds). For more information for this utility, run the rdsctrl command with no arguments.

Statistics

To display various RDS statistics, run the # rdsctrl stats command.

To reset the statistics, run the # rdsctrl stats reset command.

Tuning Parameters

The following RDS parameters can be tuned after RDS is loaded, but before an RDS application is run:

rds_sendspace
Specifies the high-water mark of the per-flow send buffer. Each socket might have multiple flow. The default value is 524288 bytes (512 KB). The value is set by using the following command: # rdsctrl set rds_sendspace= <value in bytes>.
rds_recvspace
Specifies the per-flow high-water mark of the per-socket receive buffer. For every additional flow to this socket, the receive high-water mark is increased by this value. The default value is 524288 bytes (512 KB). The value is set by using the following command: # rdsctrl set rds_recvspace= <value in bytes>.
Note: For increased RDS streaming performance, the values of the rds_sendspace parameter and the rds_recvspace parameter must be at least the value of the largest RDS sendmsg() size, multiplied by four. RDS sends an ACK for each set of four messages that are received. If the rds_recvspace is not at least four times larger than the message size, the throughput is very low.
rds_mclustsize
Specifies the size of the individual memory cluster, which is also the message fragment size. The default size is 16384 bytes (16 KB). The value, always a multiple of 4096, is set by using the following command: # rdsctrl set rds_mclustsize= <multiple of 4096, in bytes>.
Attention: The rds_mclustsize value must be the same on all systems (nodes) in the cluster. Changing this value also has performance implications.

The current values for the preceding parameters can be retrieved by using the # rdsctrl get <parameter> command.

To get the list of all tunables and their values, run the # rdsctrl get command.

The rdsctrl utility for RDSv3

For RDSv3, the rdsctrl command supports its options. These options are listed here:

Item Description
help [<tunable name>] The help option displays a descriptive message of the specified RDSv3 tunable. If no tunable is specified, this option displays the list of all the tunables that are supported for RDSv3, along with the description of each tunable.
set [-p] {<tunable name> = <value>} The set option sets the value of the specified RDSv3 tunable. It verifies that the user has the required privileges to prevent unauthorized users to change the RDS tunables. It also does range validation for the new tunable values.

The -p flag makes the assignment permanent across reboot operations.

get [<tunable name>] The get option gets the current value of the queried tunable. When no name field is specified to this command, it returns the current value of all the available RDS tunables.
default [-p] [<tunable name>] The default option is used to reset a tunable to its default value. When the name field is specified, only that tunable is reset. If no name field is specified, this command resets all the tunables to their default values.

This option also provides a way to make the change permanent across reboots by using the -p flag.

load [ ofed | aixib ] The load option loads the RDSv3 kernel extension (if it is not already loaded).

The ofed argument loads the kernel extension in RDSv3 on OFED verbs in RoCE mode. The aixib argument loads the kernel extension in RDSv3 in InfiniBand mode. Specifying an argument for the load option is optional. The load option defaults to the aixib argument when the argument is not specified.

By default, the rdsctrl utility loads the InfiniBand device unless the new attribute (ofed) is specified at the command line.

unload The unload option is used to unload the RDSv3 kernel extension.
ras [-p] <minimal | normal | detail | maximal> The ras option sets the AIX operating system RAS tracing and error checking settings for RDSv3 to the specified level. Internally, this command calls the errctrl and ctctrl AIX operating system commands.

The -p flag makes the settings persistent across reboot operations.

ras extract The ras extract option dumps the contents of the RAS error and non-error trace buffers for RDS to standard output.
info [<flags>] The info option is an alias for the rds-info command.
ping [<IP v4 address>] The ping option is an alias for the rds-ping command.
conn <restart | kill> <source IP address> <destination IP address> The conn option restarts the specified RDS connection (restart suboption) or permanently ends the specified RDS connection (kill suboption). The RDS connection to be restarted or ended is specified by giving the IP addresses of the local and remote nodes for the connection. Restarting a connection drops the underlying InfiniBand connection and attempts to establish the connection again. In contrast, ending a connection (kill suboption) drops the underlying InfiniBand connection and deallocates all resources that are associated with the corresponding RDS connection.
trace start <trace file path> <maximum data captured per RDS fragment> The trace start option initiates a tracing session to capture over-the-wire traffic for the RDSv3 protocol. The RDSv3 messages are transmitted in fragments. Each RDS fragment that is transmitted or received is captured as a trace packet in the specified trace file. For each RDS fragment, its payload is captured up to <maximum data captured per RDS fragment> bytes. Only privileged users can trace RDS traffic and only one tracing session can be active at a time.
trace stop The trace stop option ends a tracing session that was previously initiated by a trace start command. It closes the trace file that is associated with the tracing session. After this command, the trace report command can be used to generate a text report of the trace file.
trace report <trace file path> The trace report option prints a text report to standard output, from a previously captured RDS protocol trace file.
version The version option prints the RDS protocol version that is currently loaded in the system.

RDSv3 tunables

To see the list of tunables that are supported for RDSv3, run the command rdsctrl help with no arguments.

RDMA API (RDSv3 only)

The programming model for working on RDMA with RDS sockets is based on the client/server model. The RDMA client is the application that initiates an RDMA read or write operation from a specified RDMA server. The RDMA server is the application that processes the RDMA data transfer. An RDMA read operation is a data transfer from the client's address space to the server's address space, whereas an RDMA write operation is a data transfer from the server's address space to the client's address space. In either case, data is transferred directly between user-space memory on both sides, without being copied to kernel-space memory on either side.

An RDMA client application can initiate an RDMA read or write operation by sending an application-level request, along with an RDMA cookie, to an RDMA server application. The application-level request must specify whether the operation is an RDMA read or write operation as well as the address and length of the area of the client's memory to be remotely read or written by the RDMA server.

There are two methods for sending an RDMA request from the RDMA client to the RDMA server.

The first method is to send an RDS_CMSG_RDMA_MAP control message (carrying an rds_get_mr_args structure) along with the application-level RDMA request by using the sendmsg() system call on an RDS socket. The AIX operating system kernel at the client side processes the RDS_CMSG_RDMA_MAP control message by mapping the specified area of local memory (from the client application's address space), for DMA access, and generating an RDMA cookie. Then, the application-level request is sent to the server along with the RDMA cookie.

The second method consists of two steps. The first step is to call the setsockopt() system call with the RDS_GET_MR socket option, passing an rds_get_mr_args structure. This call maps the specified area of local memory for DMA access, and returns an RDMA cookie. The second step is to send an RDS_CMSG_RDMA_DEST control message (carrying the RDMA cookie that is obtained from the first step) along with the application-level RDMA request by using the sendmsg() system call.

The first method, which requires one system call, is preferred over the second method, which requires two system calls.

When the RDMA server application receives the application-level RDMA read request from the client, it also receives an RDS_CMSG_RDMA_DEST control message (carrying the RDMA cookie from the client). Then, the server initiates the RDMA read operation, by sending an application-level reply to the client along with an RDS_CMSG_RDMA_ARGS control message (carrying a rds_rdma_args structure). The AIX operating system kernel at the server side processes the RDS_CMSG_RDMA_ARGS control message, by mapping the specified area of local memory (from the server application's address space), for DMA access, and by physically starting the RDMA read operation. The RDMA read operation is performed by the server-side InfiniBand adapter, which interacts with the client-side InfiniBand adapter, to do the data transfer directly from the client application's memory to the server application's memory, without further software intervention. After the RDMA read operation is completed, the server-side adapter sends the application-level reply to the client. This is how the client application knows that its RDMA read operation has been completed.

Note: An RDMA operation is requested by the client by using an RDS_CMSG_RDMA_MAP control in which the RDS_RDMA_USE_ONCE flag is set. For this request, the area of memory that is mapped for DMA in the client's address space of memory is automatically unmapped for DMA, when the client receives the application-level reply from the server.

Although this implicit DMA mapping or unmapping mechanism makes it simpler to write RDMA applications, developers must be aware that registering memory for DMA on the AIX operating system is an expensive operation. Thus, if the same area of memory is going to be accessed by using RDMA multiple times, it is more efficient to do the DMA registration only the first time. To do this activity, a client application needs to use an RDS_CMSG_RDMA_MAP control message without the RDS_RDMA_USE_ONCE flag set when sending the RDMA request to the server. Then, subsequent RDMA transfers to the same area of the client's memory can be initiated by the RDMA server application without the client needing to send another request to the server. At the end, the client application would need to explicitly unmap the DMA-mapped memory by using the setsockopt() system call with the RDS_FREE_MR socket option.

RDS-specific socket options are specified by using SOL_RDS as the level parameter for the setsockopt() or getsockopt() system call.