Common JSOR problems (Linux only)

Use Java™ Sockets over Remote Direct Memory Access (JSOR) to take advantage of high performance networking infrastructures such as InfiniBand. To use JSOR you must set up, configure, and tune various resources. If not done correctly, issues can occur.

Note: The RDMA implementation is deprecated and will likely be removed in a future release of IBM® SDK, Java Technology Edition, Version 8. A possible alternative is the open source Libfabric library.

RDMA socket or thread creation failed

A JSOR problem can occur where a thread or Remote Direct Memory Access (RDMA) socket cannot be created. This problem can be caused by running concurrent connections over RDMA transport.

Possible causes

  • RDMA socket buffers are by default pinned, or memory locked. A restricted memlock setting in your environment can result in a failure to create or register new RDMA sockets.
  • When you are running concurrent connections, each RDMA socket implicitly uses a file descriptor for event tracking. If the maximum user open files limit is too low, socket creation can fail.
  • When you are running concurrent connections, thread creation failure can be caused by a maximum user process limit that is too low. For more information, see the following technote: java.lang.OutOfMemoryError while creating new threads.

Mitigation

  • To avoid socket creation failures, check your ulimit -l setting and change your memlock setting to an appropriate value based on the usage of the socket buffers.
  • To avoid socket creation failures when you are running concurrent connections, check your ulimit -n setting and change your nofile setting to an appropriate value based on the scalability requirements of the application.
  • To avoid thread creation failures when you are running concurrent connections, check your ulimit -u setting and change your nproc setting to an appropriate value based on the scalability requirements of the application.

RDMA Network provider initialization failure

A JSOR problem can occur where the Remote Direct Memory Access (RDMA) network provider initialization fails on a 64-bit Linux operating system when you are running a 32-bit JVM.

During the RDMA network initialization stage, the JSOR runtime environment checks for the availability of compatible OFED runtime libraries. If the runtime environment cannot locate and load the librdmacm.so and libibverbs.so 32-bit libraries, you might see this problem. To avoid the problem, install the 32-bit OFED runtime libraries alongside the usual 64-bit libraries on a 64-bit Linux® machine.

RDMA connection failed

A JSOR problem can occur where a Remote Direct Memory Access (RDMA) client fails to connect to an RDMA server.

Client and server on the same host

If the client and server are on the same host, this behavior is expected because there is currently no support for RDMA loop back. For a successful connection, both client and server should be on different hosts connected by an InfiniBand switch through the RDMA network interface adapters.

If the RDMA NIC has two physical ports, with two InfiniBand interface addresses, you can use the external hardware loop back plug (purchased separately) to connect them. Data from one port can then be fed back to the other port on the same host. With this arrangement, the RDMA client and server can run on the same host, each picking up a different InfiniBand address.
Note: This configuration has not been tested in the Java environment.

Client and server on different subnets

The RDMA client and server should be on the same network, connected by a common InfiniBand switch and managed by a single subnet manager. If your RDMA client and server must be on different subnets, ensure that inter-network switching and packet forwarding is enabled at the hardware and software levels.

Client and server on the same subnet

If the client and server are on the same subnet, a connection failure could be caused by incorrect client or server configuration files, or an incorrect InfiniBand setup on one or both hosts.

Ensure that the rule entries in your configuration files are defined correctly, as described in -Dcom.ibm.net.rdma.conf (Linux only).

Follow these steps to check your InfiniBand setup:
  1. Ensure that each host that is involved in the communication has an appropriate InfiniBand host channel adapter or RDMA network interface card with valid InfiniBand addresses (interfaces that begin with the prefix ib).
  2. Ensure that each InfiniBand port is active and that the maximum transfer unit is properly set. To check the maximum transfer unit, run one the following OFED runtime commands: ibstat or ibv_devinfo.
  3. Ensure that the ifconfig command lists all the InfiniBand interfaces, and that each interface has a valid IP address.
  4. Choose two valid InfiniBand addresses that are registered with the subnet manager for framing JSOR configuration rules, then verify that basic RDMA communication is possible between the host and client machines by running the rping command with your chosen InfiniBand addresses.
  5. Similarly, run the ibv_rc_pingpong command.
  6. Similarly, run the ib_read_bw and ib_write_bw commands.
If all these steps are successful, there is no issue with basic RDMA communication between the machines. For further problem determination, read the configuration section of the README.txt file that is part of the OFED source distribution: https://www.openfabrics.org/ofed-for-linux/.

RDMA connection reset exceptions

Concurrent Remote Direct Memory Access (RDMA) clients that try to send small chunks of data millions of times to a single RDMA server can throw connection reset exceptions.

Java Sockets over Remote Direct Memory Access (JSOR) employs the R-Sockets protocol as the basis for implementing socket-level APIs on top of RDMA. The R-Sockets protocol uses the send and receive queue sizes as a basis for implementing data flow and event control between sender and receiver. When several parallel clients try to send small amounts of data million of times, they might experience connection reset exceptions due to insufficient queue sizes. This behavior is because queue sizes dictate the amount of work that can be queued up on either side.

Because the default queue sizes are large (see JSOR environment settings (Linux only)), the tuning of queue sizes is necessary only in rare cases. You should determine the queue sizes based on the workload characteristics of your application. The maximum number and frequency of send and receive operations is particularly important. There is no general formula for determining optimal queue sizes.

RDMA communication appears to hang

The Remote Direct Memory Access (RDMA) communication between client and server appears to hang when you are running RPC-based workloads with unpredictable message sizes.

Java Sockets over Remote Direct Memory Access (JSOR) employs the R-Sockets protocol as the basis for implementing socket-level APIs on top of RDMA. To transfer data properly, the R-Sockets protocol requires both the sender and receiver to be coordinated. The receiver must be ready with a receive buffer available for the sender to put data in. This behavior differs from TCP/IP where buffers are allocated dynamically as required. RDMA receive operations fail if sufficient receive buffers are not available in advance. For more information, see the flow control section of the IETF draft of Remote Direct Memory Access Transport for Remote Procedure Call.

The JSOR implementation by default provides small send and receive buffers, which are less than 50 KB in size. When an RDMA client or server tries to send a large payload, for example 2 MB or 4 MB, in chunks of say 1 KB, in one direction without synchronized data flow between end points, the receive buffers can be exhausted, resulting in the hung situation. The R-Socket protocol tries to recycle the receive buffers, but if the rate of replenishment is less than the data send rate, progress is impossible. These effects are more pronounced when hundreds of parallel clients try to do the same operations on the same RDMA transport, because the clients compete for the same set of physical network resources. The R-Sockets protocol takes a long time to recover from this situation because it relies on retries and receiver-not-ready negative acknowledgements to make progress. In the worst case scenario, this behavior can result in a deadlock situation between end points.

Similarly, the size of the send buffer should be sufficient to transfer the data to the corresponding receive buffer.

Mitigation

For an Java RPC application, tune the buffer sizes before you deploy the application in a production environment. Set the buffer sizes according to the workload characteristics and maximum payload size of the application. There is no general formula for determining optimal buffer sizes. For more information, see JSOR environment settings (Linux only).

Enable application or runtime data transfer time outs that allow the client to cancel and try the data transfer again, with increased buffer sizes if necessary. See the following APAR for an example: PM52124: OutOfMemoryError errors on eXtreme Scale clients can cause the grid to fail. In this example, a lack of memory caused the server thread to get stuck in a socketWrite() method. The suggested resolution is to set the com.ibm.CORBA.SocketWriteTimeout property.

Problems encountered with the zero copy function

Java applications hang when the zero copy function is enabled

Due to the internal synchronization that is required between the data source and the data sink when you use the zero copy function, a client or server application might hang if you enable the zero copy function for only one endpoint.

To avoid this problem, take the following actions:

Java applications are not using the zero copy function

Java applications might not use the zero copy function even after you specify the -Dcom.ibm.net.rdma.zeroCopy=true parameter.

The zero copy function is used only when both the following statements are true:
  • You specified the -Dcom.ibm.net.rdma.zeroCopy=true parameter.
  • The buffer sizes that are passed inside Java read and write calls exceed the value that is specified by the -Dcom.ibm.net.rdma.zeroCopyThreshold parameter. For more information about this parameter, see -Dcom.ibm.net.rdma.zeroCopyThreshold (Linux only).
You can use the standard JSOR tracing mechanism to find out whether the zero copy function is being used by looking for the following functions in the trace:
  • socketRead0Direct
  • socketWrite0Direct
  • RDMA_ReadDirect
  • RDMA_SendDirect

Java applications do not scale

The zero copy function is designed for large data transfers, a few at a time. Due to internal synchronization, and resource allocation and management overheads, the scalability is restricted by the size of the data that is transferred.

Ensure that the usage scenario is close to that of file transfer (FTP) type of data transfers in zero copy mode.

Problems encountered with fork compatibility mode

Several problems can be associated with operating in fork compatibility mode between Java clients and native forked servers.

Native server error message librdmacm: Fatal: unable to open RDMA device

This message relates to a known Open Fabrics Enterprise Distribution (OFED) problem that occurs when around 100 client threads repeatedly open a connection and transfer data. This issue is seen on the following system configuration:
  • POWER® PC systems with a Mellanox RDMA over converged ethernet (RoCE) MT26448 adapter
  • Redhat Enterprise Linux (RHEL) v6.4
  • MLNX_OFED_LINUX-2.0-3.0.0
This problem is not seen on systems with a later Mellanox adapter, MT4099, running RHEL v7 beta and OFED v3.5.

If you encounter this issue, upgrade to the latest version of the operating system and OFED software. If the problem persists, consider upgrading to the latest version of the Mellanox RoCE adapter.

Java clients hang

In fork compatibility mode, Java clients can hang when systems connect to native forked servers. This problem is associated with the RSocket preloading library. Internally the library creates a named semaphore, /rsocket_fork, when processing fork() support. However, when complete, the R-Socket library does not remove the semaphore, which persists until the system is rebooted. Any stale link or value for this named semaphore from a previous invocation blocks the native server from accepting remote client connections.

To work around this problem, use the rm command to unlink the /rsocket_fork named semaphore before fork() preloading begins. On Red Hat Enterprise Linux (RHEL), you can find the named semaphores in the directory /dev/shm. These files have a prefix of sem..

Java clients do not scale

The RSockets protocol currently offers fork preloading support only for simple applications that run under ideal conditions. While preloading a forked process, the RSockets library uses blocking semantics to migrate a connection to RDMA..

The current support for the fork() method in the RSocket is therefore inherently non-scalable. Java multithreaded clients that try to connect to native forked servers by using the native interoperability function might experience a large number of failed connections. To mitigate this problem, increase the client connection retry count to more than one.