Reliable Datagram Sockets over InfiniBand and RoCE
Reliable Datagram Sockets (RDS) is a connectionless and record-oriented protocol that provides an in-order and no-duplicate service over InfiniBand and RDMA over Converged Ethernet (RoCE). RDS exposes the User Datagram Protocol (UDP) subset of the socket API.
The RDS is part of the AF_BYPASS domain that is used for protocols that bypass the kernel TCP/IP stack.
The AIX® operating system provides two versions of RDS: RDSv2 and RDSv3. RDSv3 is the latest version and includes support for Remote Direct Memory Access (RDMA). RDSv3 on AIX 7.2, and later, supports Open Fabrics Enterprise Distribution (OFED) based RDMA over Converged Ethernet (RoCE).
Creating an RDS socket
#include <sys/bypass.h>
include <net/rds_rdma.h> /* for RDSv3 only */
sock = socket (AF_BYPASS, SOCK_SEQPACKET,BYPASSPROTO_RDS);
sock = socket (AF_BYPASS, SOCK_SEQPACKET,0);
System calls
- blind()
- close()
- getsockopt()
- recvform()
- recvmsg()
- sendmsg()
- sendto()
- setsockopt()
- connect()
- read()
- recv()
- send()
- write()
The rdsctrl utility for RDSv2
Use the rdsctrl utility (/usr/sbin/rdsctrl) to change the tunables and the diagnostics for RDS statistics. For RDSv2, the utility can be used after RDS is loaded (bypassctrl load rds). For more information for this utility, run the rdsctrl command with no arguments.
Statistics
To display various RDS statistics, run the # rdsctrl stats
command.
To reset the statistics, run the # rdsctrl stats reset
command.
Tuning Parameters
The following RDS parameters can be tuned after RDS is loaded, but before an RDS application is run:
- rds_sendspace
- Specifies the high-water mark of the per-flow send buffer. Each
socket might have multiple flow. The default value is 524288 bytes
(512 KB). The value is set by using the following command:
# rdsctrl set rds_sendspace= <value in bytes>
.
- rds_recvspace
- Specifies the per-flow high-water mark of the per-socket receive
buffer. For every additional flow to this socket, the receive high-water mark
is increased by this value. The default value is 524288 bytes (512
KB). The value is set by using the following command:
# rdsctrl set rds_recvspace= <value in bytes>
.Note: For increased RDS streaming performance, the values of the rds_sendspace parameter and the rds_recvspace parameter must be at least the value of the largest RDS sendmsg() size, multiplied by four. RDS sends an ACK for each set of four messages that are received. If the rds_recvspace is not at least four times larger than the message size, the throughput is very low.
- rds_mclustsize
- Specifies the size of the individual memory cluster, which is
also the message fragment size. The default size is 16384 bytes (16
KB). The value, always a multiple of 4096, is set by using the following
command:
# rdsctrl set rds_mclustsize= <multiple of 4096, in bytes>
.Attention: The rds_mclustsize value must be the same on all systems (nodes) in the cluster. Changing this value also has performance implications.
The current values for the preceding parameters can be retrieved
by using the # rdsctrl get <parameter>
command.
To get the list of all tunables and their values, run the #
rdsctrl get
command.
The rdsctrl utility for RDSv3
For RDSv3, the rdsctrl command supports its options. These options are listed here:
Item | Description |
---|---|
help [<tunable name>] | The help option displays a descriptive message of the specified RDSv3 tunable. If no tunable is specified, this option displays the list of all the tunables that are supported for RDSv3, along with the description of each tunable. |
set [-p] {<tunable name> = <value>} | The set option
sets the value of the specified RDSv3 tunable. It verifies that the
user has the required privileges to prevent unauthorized users to
change the RDS tunables. It also does range validation for the new
tunable values. The -p flag makes the assignment permanent across reboot operations. |
get [<tunable name>] | The get option gets the current value of the queried tunable. When no name field is specified to this command, it returns the current value of all the available RDS tunables. |
default [-p] [<tunable name>] | The default option
is used to reset a tunable to its default value. When the name field
is specified, only that tunable is reset. If no name field is specified,
this command resets all the tunables to their default values. This option also provides a way to make the change permanent across reboots by using the -p flag. |
load [ ofed |
aixib ] |
The load option loads the
RDSv3 kernel extension (if it is not already loaded). The By default, the
rdsctrl utility loads the InfiniBand device unless the new attribute
( |
unload | The unload option is used to unload the RDSv3 kernel extension. |
ras [-p] <minimal | normal | detail | maximal> | The ras option
sets the AIX operating system
RAS tracing and error checking settings for RDSv3 to the specified
level. Internally, this command calls the errctrl and ctctrl AIX operating system commands. The -p flag makes the settings persistent across reboot operations. |
ras extract | The ras extract option dumps the contents of the RAS error and non-error trace buffers for RDS to standard output. |
info [<flags>] | The info option is an alias for the rds-info command. |
ping [<IP v4 address>] | The ping option is an alias for the rds-ping command. |
conn <restart | kill> <source IP address> <destination IP address> | The conn option restarts the specified RDS connection (restart suboption) or permanently ends the specified RDS connection (kill suboption). The RDS connection to be restarted or ended is specified by giving the IP addresses of the local and remote nodes for the connection. Restarting a connection drops the underlying InfiniBand connection and attempts to establish the connection again. In contrast, ending a connection (kill suboption) drops the underlying InfiniBand connection and deallocates all resources that are associated with the corresponding RDS connection. |
trace start <trace file path> <maximum data captured per RDS fragment> | The trace start option initiates a tracing session to capture over-the-wire traffic for the RDSv3 protocol. The RDSv3 messages are transmitted in fragments. Each RDS fragment that is transmitted or received is captured as a trace packet in the specified trace file. For each RDS fragment, its payload is captured up to <maximum data captured per RDS fragment> bytes. Only privileged users can trace RDS traffic and only one tracing session can be active at a time. |
trace stop | The trace stop option ends a tracing session that was previously initiated by a trace start command. It closes the trace file that is associated with the tracing session. After this command, the trace report command can be used to generate a text report of the trace file. |
trace report <trace file path> | The trace report option prints a text report to standard output, from a previously captured RDS protocol trace file. |
version | The version option prints the RDS protocol version that is currently loaded in the system. |
RDSv3 tunables
To see the list of tunables that are supported for RDSv3, run the command rdsctrl help with no arguments.
RDMA API (RDSv3 only)
The programming model for working on RDMA with RDS sockets is based on the client/server model. The RDMA client is the application that initiates an RDMA read or write operation from a specified RDMA server. The RDMA server is the application that processes the RDMA data transfer. An RDMA read operation is a data transfer from the client's address space to the server's address space, whereas an RDMA write operation is a data transfer from the server's address space to the client's address space. In either case, data is transferred directly between user-space memory on both sides, without being copied to kernel-space memory on either side.
An RDMA client application can initiate an RDMA read or write operation by sending an application-level request, along with an RDMA cookie, to an RDMA server application. The application-level request must specify whether the operation is an RDMA read or write operation as well as the address and length of the area of the client's memory to be remotely read or written by the RDMA server.
There are two methods for sending an RDMA request from the RDMA client to the RDMA server.
The first method is to send an RDS_CMSG_RDMA_MAP control message (carrying an rds_get_mr_args structure) along with the application-level RDMA request by using the sendmsg() system call on an RDS socket. The AIX operating system kernel at the client side processes the RDS_CMSG_RDMA_MAP control message by mapping the specified area of local memory (from the client application's address space), for DMA access, and generating an RDMA cookie. Then, the application-level request is sent to the server along with the RDMA cookie.
The second method consists of two steps. The first step is to call the setsockopt() system call with the RDS_GET_MR socket option, passing an rds_get_mr_args structure. This call maps the specified area of local memory for DMA access, and returns an RDMA cookie. The second step is to send an RDS_CMSG_RDMA_DEST control message (carrying the RDMA cookie that is obtained from the first step) along with the application-level RDMA request by using the sendmsg() system call.
The first method, which requires one system call, is preferred over the second method, which requires two system calls.
When the RDMA server application receives the application-level RDMA read request from the client, it also receives an RDS_CMSG_RDMA_DEST control message (carrying the RDMA cookie from the client). Then, the server initiates the RDMA read operation, by sending an application-level reply to the client along with an RDS_CMSG_RDMA_ARGS control message (carrying a rds_rdma_args structure). The AIX operating system kernel at the server side processes the RDS_CMSG_RDMA_ARGS control message, by mapping the specified area of local memory (from the server application's address space), for DMA access, and by physically starting the RDMA read operation. The RDMA read operation is performed by the server-side InfiniBand adapter, which interacts with the client-side InfiniBand adapter, to do the data transfer directly from the client application's memory to the server application's memory, without further software intervention. After the RDMA read operation is completed, the server-side adapter sends the application-level reply to the client. This is how the client application knows that its RDMA read operation has been completed.
Although this implicit DMA mapping or unmapping mechanism makes it simpler to write RDMA applications, developers must be aware that registering memory for DMA on the AIX operating system is an expensive operation. Thus, if the same area of memory is going to be accessed by using RDMA multiple times, it is more efficient to do the DMA registration only the first time. To do this activity, a client application needs to use an RDS_CMSG_RDMA_MAP control message without the RDS_RDMA_USE_ONCE flag set when sending the RDMA request to the server. Then, subsequent RDMA transfers to the same area of the client's memory can be initiated by the RDMA server application without the client needing to send another request to the server. At the end, the client application would need to explicitly unmap the DMA-mapped memory by using the setsockopt() system call with the RDS_FREE_MR socket option.
RDS-specific socket options are specified by using SOL_RDS as the level parameter for the setsockopt() or getsockopt() system call.