Taking Advantage of 64-bit DMA capability on PowerLinux
Aug 2014. This article's relevancy and currency may be need to be improved due to out-of-date information. Please consider helping to improve the contents or ask questions on the DeveloperWorks Forums.
Taking Advantage of 64-bit DMA capability on PowerLinux
On new IBM® Power Systems™ servers running Linux, a set of the PCIe slots support a unique feature called 64-bit Direct Memory Access (DMA), which improves I/O operations. Adapters and PCIe slots enabled for this feature allow I/O traffic to take place with significantly less system overhead, improving latency and throughput. This article describes how to recognize whether an IBM Power System server supports the feature, and provides examples of the performance benefits when the feature is enabled.
In this section, the technical terms are explained for further use throughout the article.
PCIe stands for Peripheral Component Interconnect Express and it is a computer bus for attaching extension hardware devices to a motherboard. PCIe is one of the main buses used to attach peripheral devices to a IBM Power Systems server. Refer to the Introduction to PCI Express IBM Redbooks® article for more details about PCIe.
DMA stands for Direct Memory Access. DMA allows an I/O adapter to access a limited amount of memory directly, without involving the CPU for memory transfers. Both the device driver for the adapter and the operating system must recognize and support this.
RDMA stands for Remote Direct Memory Access. RDMA enables an application to write directly to physical memory on a remote system. RDMA supports direct memory access from the memory of one system into another system's memory without operating system's overhead, as data copies from the network stack to the application memory area. By eliminating the operating system involvement, this promotes high throughput, low-latency communication and it's often used by High Performance Computing (HPC).
IOMMU stands for I/O Memory Management Unit and is responsible for managing the I/O memory addresses, as well as enabling the connection between DMA-capable I/O buses and the main memory. Refer to this article for more details about IOMMU.
DMA window is a range of addresses the adapter is allowed to access. A typical DMA window is relatively small, around 2GB. The DMA window address is mapped to the physical memory using a Translation Control Entry (TCE) table on the IOMMU system, as shown in Figure 1. In the normal mode of using DMA, device drivers must request mappings from the operating system for every I/O operation, and subsequently remove those mappings after they are used. Some I/O operations allow mappings to be cached and reused by the driver. The performance advantage of using IOMMU is that data is delivered directly to, or read directly from, memory that is a part of the application space. This typically eliminates extra memory copies for the I/O.
Figure 1: Memory mapping not exploiting 64-bit DMA feature
64-bit DMA is a Power System PCIe slot capability that enables a DMA window to be wider, possibly allowing all the partition memory to be mapped for DMA. This avoids overhead when DMA mappings are requested by the driver, since all the system memory is already mapped. Consequently, this feature enables a faster data transfer between the I/O card placed in this slot and the system memory. This capability is also known as Huge Dynamic DMA Window in some kernel patches and discussions.
If the card or the device driver does not support the 64-bit DMA feature, the PCIe slot works in a standard way, not being differentiated from the other slots.
Advantages of 64-bit DMA
With a wider DMA window, it is possible to map the entire partition memory. This enables a direct translation between I/O address space and the memory address space, not requiring manipulation on the hypervisor translation control entry (TCE) as is shown in Figure 1. With a wider DMA window, the whole memory address space can be mapped. Therefore, there is a direct map between I/O address space and the memory address space, as shown in Figure 2. If the DMA window is not able to cover the whole memory, then there must be a translation process that converts from one memory space to another, and this operation consumes some CPU cycles.
Moreover, the 64-bit DMA support allows RDMA-aware applications and libraries to transfer data directly to any location in a remote system's memory (with appropriate access restrictions), as shown in Figure 2. This, in turn, results in a complete offload of all protocol processing to the RDMA adapter. The result is lower-latency and higher-bandwidth communications, as well as significant reduction of the number of CPU cycles needed to move the data from one system to another.
Figure 2: Memory mapping with 64-bit DMA feature enabled
In order to properly realize this benefit, as well as to function correctly, RDMA devices such as InfiniBand HCAs and RoCE or iWARP RNICs must be installed in 64-bit DMA-capable PCIe slots. This is because applications that can potentially use RDMA services can pin arbitrary amounts of memory that will be involved in RDMA transfers. The only way to ensure that there will not be over-consumption of mappable DMA by the adapter is to place it into a slot that provides 64-bit DMA capability (i.e., full DMA-space addressing).
IBM Power system requirements
In order to take advantage of 64-bit DMA, the Power System server should be upgraded to a firmware level of at least version 740 To check the firmware level on Linux, you can check the file /proc/device-tree/openprom/ibm,fw-vernum_encoded using lsprop command provided by powerpc-utils package. You can also use the cat command to dump the firmware version. The following file will display an output similar to ALXXX_YYY, where XXX is the firmware release level, and YYY is the service pack level. This is an example of the firmware output on the systems used to run the performance tests on this article.
If your machine does not have the proper firmware, it can be upgraded. The IBM Power System firmware can be downloaded from IBM Fix Central and upgraded according to the system documentation.
The following systems are known to support the 64-bit DMA feature:
IBM Flex System compute nodes:
- IBM Flex System p24L
- IBM Flex System p240
- IBM Flex System p470
- IBM Flex System p270
IBM Power Systems servers:
- IBM PowerLinux 7R1 server
- IBM PowerLinux 7R2 server
- IBM Power 710 Express Server
- IBM Power 720 Express Server
- IBM Power 730 Express Server
- IBM Power 740 Express Server
- IBM Power 750 Express Server
- IBM Power 760 Server
- IBM Power 770 Server
- IBM Power 780 Server
If your system has an RDMA or a high performance adapter, such as a 10 Gb/s Network adapter, it's recommended that you place the adapter in a slot that supports 64-bit DMA, and that you have latest Linux distro version installed.
Linux distribution versions
The 64-bit DMA feature is a relatively new feature, so you need the newer Linux versions for full support of the 64-bit DMA on Power systems, as shown below:
Red Hat Enterprise Linux
|Version||64-bit DMA Support|
|Red Hat Enterprise Linux version 5||No|
|Red Hat Enterprise Linux version 6.0||No|
|Red Hat Enterprise Linux version 6.1 and later||Yes|
SUSE Linux Enterprise Server
|Version||64-bit DMA Support|
|SUSE Linux Enterprise Server 10||No|
|SUSE Linux Enterprise Server 11 GA and SP1||No|
|SUSE Linux Enterprise Server 11 SP2 and later||Yes|
64-bit DMA capable PCIe slots per system
The following table shows the IBM systems that support the feature, and the PCIe slots that have the capability.
64-bit DMA capable slots
Note: Slots C4 and C6 support 64-bit DMA only if the system firmware is 760 or later.
PCI-e slots on IBM Power Systems
The following table shows the rear of the IBM Power machines and the slots numbers for each model. Some machines share the same rear arrangement, and in this case, the slot order and the picture is valid for all the models under the picture.
For updated information about the machine PCI-e placement slots check at System overview for POWER7® processor-based systems, select the machine model, then Installing and configuring the system -> PCI adapters -> Installing, removing, and replacing PCI adapters -> PDF file for Installing PCI adapters -> Installing PCI adapters PDF and then save the PDF file.
LPAR memory configuration
Despite the name, 64-bit DMA supports DMA window size up to 2TB. Depending on the maximum memory size of the logical partition (LPAR), the window may not be wide enough to map the whole memory partition. In this case, there is no benefit to placing the card in the 64-bit DMA capable PCIe slot as shown below.
|Partition maximum memory size||64-bit DMA benefit|
|Higher than 2TB||No|
|Lower than 2TB||Yes|
Thus, to take advantage of 64-bit DMA, ensure that the LPAR's Maximum memory is not greater 2TB. If an LPAR's maximum memory is greater than 2TB, the 64-bit DMA feature will not be enabled for that LPAR. In order to configure the maximum memory, complete these steps:
- In the navigation area, expand Systems Management > Servers and select the server on which the logical partition is located.
- In the contents are, select the logical partition.
- Click Tasks and select Configuration > Manage Profiles. The Managed Profiles page is displayed.
- Select the profile which you want to use and click Action and select Edit.... The Logical Partition Profile Properties page is displayed.
- Select Memory and change the Maximum memory to be equal to the Desired memory.
- Click OK.
Figure 3: Configuring the maximum memory
Checking for 64-bit DMA enabled adapters
Because not all PowerLinux PCIe slots nor all device drivers are 64-bit DMA compatible, you should verify whether the device driver is using such a configuration by checking the device driver module message on Linux. To check the message, do the following:
- Place the PCIe adapter in the specified slots, as listed in the IBM Power system requirements section.
- Load the adapter's device drivers.
- Check the log messages, looking for the message "Using 64-bit direct DMA at offset".
If this message appears in the logs, then the adapter is placed in the correct PCIe slot, and the device driver has the feature support, as follows:
# dmesg | grep 64-bit
mlx4_core 0004:01:00.0: Using 64-bit direct DMA at offset 10000000000000
mlx4_core 0004:01:00.0: Using 64-bit direct DMA at offset 10000000000000
be2net 0007:01:00.0: Using 64-bit direct DMA at offset 10000000000000
In the case above, both mlx4_core and be2net device drivers are using the 64-bit DMA feature. Hence, the adapters at PCI slots 0004:01:00.0 and 0007:01:00.0 are taking advantage of the 64-bit DMA feature. If the adapter does not support 64-bit DMA, the adapter can use only the default 2GB window.
Performance experiment and comparison
To show the benefit of the 64-bit DMA feature, IBM engineers ran a set of network tests comparing the benefits of this feature over a traditional slot with a standard PCIe adapter. Some experiments were executed using a pair of IBM PowerLinux 7R2 servers with 16x 3.55Ghz cores and 128GB of memory. Two adapters were used, one in each system. The adapter model used in these tests was a PCIe2 2-Port 10GbE RoCE SR adapter (EC30). This is a 2-port 10Gb Network adapter using the mlx4_en driver, which supports 64-bit DMA. The tests were executed using an out-of-box Red Hat Enterprise Linux version 6.4 distribution. The cards were plugged into a 10Gb/s switch model IBM G8124E.
The tests were executed using the uperf (unified performance) tool. uperf is a network performance measurement tool that supports execution of workload profiles. The uperf tool collects quite a wide variety of statistics, such as packet latency and throughput. The CPU utilization was also captured during the tests in order to demonstrate the benefits of the 64-bit DMA capable PCIe slots on the server and client machines.
The test scenario was based on two similar runs using the same uperf profilers. The only difference between the two run instances was the slot that the network interface cards were plugged into. In the first run, the card was plugged into a standard PCIe slot, and, then, it was moved to a 64-bit DMA capable PCIe slot. Nothing else was changed or tuned.
To do this comparison, two sets of tests were run. The first one was focused on latency; thus, it ran over UDP protocol using a request and response flow. This test is widely used to measure latency of transactions on an adapter, operating system, switch, and so forth. Request/Response performance is represented by "transactions per second" for a given request and response size. A transaction is a unit of work, with either an iteration or a duration associated with it. A transaction is generally defined as the exchange of a single request and a single response. From a transaction rate, we can infer the average latency. To simulate mixed workloads, in a scenario that several applications are using the network at the same time, multiples instances are being used on the tests, where each of them is transferring its set of messages.
The second test was focused on throughput, hence the protocol TCP was used. In this case, a set of TCP connection was opened between the server and the client, and the client sent the maximum amount of messages it was able to. The message size is determined as a uperf profile value. During both tests, different messages size were tested in order to cover different application scenarios.
When discussing I/O performance, a number alone is not enough to represent the performance of an adapter. This is because there are a lot of variables involved in a run, such as the number of instances, MTU size, protocol, traffic type, message size, CPU utilization on the server (RX) side, CPU utilization on the client (TX) side, and buffer configurations. The results that follow show an average of what is seen on the most common workloads with different message sizes.
As is usual with performance metrics, the following measurements are simply an example of what was seen in a very particular setup. Our goal is to provide you with the information needed to reproduce these tests yourself, in your environment, on your systems. If you have questions on your results, we encourage you to ask a question on the PowerLinux Community message board.
The first test was aimed at reaching the best latency from the adapter, on both system setups, using 50 concurrent flow instances and an MTU of 1500 bytes.
Figure 4: Latency comparison using different message size
Figure 4 shows the comparison of latency when the adapter is plugged into a standard PCIe slot compared to the same card plugged into a 64-bit DMA capable PCIe slot. The test was run using different message sizes, and in almost all cases the latency is improved when the card is placed in the 64-bit DMA capable PCIe slot.
- A broader test, considering also a smaller amount of threads, was run with different message sizes and instances, and it was found that the latency is on average 21% better when the card is placed in the proper slot.
As described above, a second test was run focusing on extracting the best throughput from the set of cards. In this case, two TCP stream sessions were created between the client and the server systems, and the data with a fixed size was sent throughout these sessions.
Figure 5: Throughput comparison using different message size
Figure 5 compares the throughput in both cases encompassing different message sizes and using two instances. Running the tests in a broader way, using different instances (2, 4 and 8) shows that the adapter in a 64-bit DMA capable PCIe slots improves the throughput, on average, by 11%, when compared to a setup where the card is placed in a standard PCIe slot.
This article presented the concepts and benefits of 64-bit DMA feature on IBM Power Systems servers. It also included the required steps for exploring this feature on IBM Power Systems servers.
The article included also two performance charts showing the benefit when using the adapter on a 64-bit DMA capable PCIe slot, mainly when the workload is sensitive to network packet latency. It was shown that the latency, on average, improves 21% if this feature is enabled on the system. On the other side, the throughput, on average, improves around 11% over standard PCIe slots.