Contents


Achieving high performance on IBM AIX using Coherent Accelerator Processor Interface (CAPI)

Accelerating I/O operations

Comments

What is CAPI?

IBM® POWER8® and later architectures provide support for Coherent Accelerator Processor Interface (CAPI), which is available to certain PCIe slots in the IBM Power® system. CAPI can be thought of as a special tunneling protocol through PCIe that allows PCIe adapters to look like a special purpose co-processor or accelerator, which can read/write an applications memory and generate page faults. As a result, the host interface to an adapter running in CAPI mode neither require the data buffers to be direct memory access (DMA) mapped through Translation Control Entries (TCEs) nor the memory to be pinned.

Any application entity (user or kernel threads) that wants to take advantage of CAPI acceleration will perform a context attach, on the special purpose CAPI co-processor which reserves scheduling bandwidth and resources to handle page faults. When it is done with its work it will perform a context detach which will free up the resources for use by other threads. Attaching a process means providing access to its address space so that the co-processor has direct read/write access to the application's command/response buffers, context save/restore, and data buffers without involvement of the kernel except for page faults. At this point, there would be a direct path or a super pipe between the co-processor and the application through which data flows in both directions.

Why to use CAPI?

If applications need superior performance with a much smaller programming investment, CAPI provides the best alternative. It provides a truly shared address memory space with the accelerator allowing the application to quickly process the data. On systems where processor resources are a premium with many workloads competing for the systems resources, CAPI provides a better solution to accelerate performance by offloading the processor-intensive processing to the accelerator in a transparent manner.

Using CAPI for I/O to flash storage

I/O operations to storage devices such as flash, solid-state drive (SSD) and so on involve transfer of information to or from the processor or memory. Therefore, I/O operations to storage devices consume considerable amount of processor execution cycles when sending or receiving data. As CAPI is coherent to processor, CAPI I/O acceleration technology can be used to offload the I/O operations from processor. This helps in less processor usage for I/O-intensive workloads, and thereby, releasing the processor cycles for other operations.

You can use a CAPI Flash adapter in two modes:

  • Traditional legacy file system mode
  • Superpipe I/O mode where applications directly issue I/O requests to adapter bypassing operating system stack

Using CAPI in the superpipe I/O mode requires knowledge about CAPI Flash adapter, registers, and command structures. IBM AIX® provides a library, named CAPI Flash block library, to ease the application programming development in the superpipe I/O mode. Refer to the "Related topics" section for APIs provided by the CAPI Flash block library. An example superpipe I/O program is given in "Sample superpipe I/O application" section.

CAPI support in AIX

The initial CAPI Flash support, to make use of CAPI adapters connected to flash storage in direct-attached configuration, has been added in AIX releases 7200-00 and 7200-01. In this stack, generic SCSI3 disk driver is interfaced with CAPI Flash adapter driver to use CAPI Flash disks (IBM FlashSystem 900 flash disks connected to CAPI Flash adapter).

The stack has been further optimized to give better performance for legacy I/O in AIX release 7200-02-01. In this optimization, the generic SCSI3 driver has been replaced with monolithic CAPI driver which handles both CAPI Flash disks and CAPI Flash adapter. Refer Figure 1 for a high-level view of the optimized stack.

Figure 1. Optimized CAPI stack in AIX

CAPI Flash devices are created in the /dev/ file system. Run the following command to list the CAPI Flash adapters:

# lsdev -Cs capi
cflash0 Available 00-48000000 CAPI Flash Adapter (1410f0041410f004)
cflash1 Available 01-48000001 CAPI Flash Adapter (1410f0041410f004)

Run the following command to list the CAPI Flash disks:

# lsdev -Cs capidev
hdisk0 Available 00-48000000 MPIO CAPI Flash Disk
hdisk1 Available 00-48000000 MPIO CAPI Flash Disk
hdisk2 Available 00-48000000 MPIO CAPI Flash Disk
hdisk3 Available 00-48000000 MPIO CAPI Flash Disk
hdisk4 Available 01-48000001 MPIO CAPI Flash Disk
hdisk5 Available 01-48000001 MPIO CAPI Flash Disk

Refer to the following example of the lspath output. Here, the connection field has three values and they represent adapter port number, target worldwide port name (WWPN), and the logical unit number (LUN) identifier.

# lspath -l hdisk0 -F "name parent connection status"
hdisk0 cflash0 0,500507605e8397a3,0 Enabled
hdisk0 cflash0 1,500507605e839784,0 Enabled

Hardware configuration

CAPI Flash adapter requires support from IBM POWER® processor. CAPI adapters can be placed only in third generation PCIe x16 slots. CAPI adapters are supported on POWER8 processor-based systems in the following slots.

Table 1. CAPI supported slots in POWER 8
Machine typeSupported slots
8286-41A, 8286-42A (1 processor)P1-C6, P1-C7
8286-42A, 8284-22AP1-C3, P1-C5, P1-C6, P1-C7
E870 (9119-MME), E880 (9119-MHE)P1-C2, P1-C4, P1-C6, P1-C8
IBM Power 850 (8408-E8E), Power 860 (9109-RME)P1-C1, P1-C3, P1-C7, P1-C9

Performance results

The performance tests for various types of workloads such as read, write, and mixed read/write using flexible I/O tester are conducted on CAPI (optimized legacy mode) and Fibre Channel (FC). In both cases, link speed is achieved but processor utilization is less in case of CAPI. Our performance results show an average improvement of 135% for read operations. The following section provides performance results and system configuration details.

System configuration details:

  • IBM Power System 8286-42A server with 32 processors at a frequency of 3.325 GHz
  • IBM FlashSystem 900 flash Storage with two storage ports
  • CAPI Flash adapter with two ports
  • Eight disks with a size of 50 GB each
  • AIX configuration: AIX 7.2-TL2-SP1 (Using an optimized stack, as shown in Figure 1)

Thousand input/output operations per second (KIOPS)/CPU is the metric used for performance analysis. KIOPS is the throughput achieved and CPU is the processor utilization percentage in kernel. Though IOPS driven has reached close to the maximum link bandwidth for both CAPI and FC, the processor utilization in case of CAPI is far less when compared to FC. The comparison is shown in Figure 2.

KIOPS/CPU values are calculated for the following workloads using raw I/O directly to the CAPI disks:

  • 100 % Read operations
  • 70% Read operations and 30% write operations
  • 100% Write operations

In all the three cases, CAPI outperforms FC and the maximum gain is obtained in case of 100% read operations (as shown in Table 2).

Figure 2. KIOPS/CPU comparison – CAPI versus FC

The exact gain in percentage for each of the above cases is shown in the following table:

Table 2. Performance gain – CAPI versus FC
WorkloadKIOPS/CPU
CAP
KIOPS/CPU
FC driver
Gain
Read88.937.77135.37%
70/30 Read/Write81.1740.5100.41%
Write72.542.968.99%

CAPI use cases

With the advent of real-time analytics and cognitive applications, the demand for in-memory databases has increased multifold. The size of in-memory databases is increasing rapidly while the increase in systems dynamic random access memory (DRAM) size is limited. Hence the demand of low-latency storages that can extend memory is increasing.

The latency of CAPI Flash when used in a superpipe I/O mode is low compared to traditional I/O. CAPI Flash can be seen as slow memory (as the latency cannot match with that of DRAM) in this perspective. So, CAPI Flash can be used in extending the system's memory size by multifold as it is possible to have bigger flash storage connected to a CAPI Flash adapter.

The in-memory database players can use CAPI Flash to extend the memory. Refer to Linux Redis Labs NoSQL Database Exploitaton of POWER8 CAPI for more details.

In future, CAPI adapters (OpenCAPI) might be connected to POWER processors over a high-speed link. This can further decrease the latencies and increase throughputs.

Sample superpipe I/O application

Refer to the following example program that uses CAPI adapter in the superpipe I/O mode through CAPI Flash block library.

/**********   cblk_test.c   *************/

#include <stdio.h>
#include <fcntl.h>
#include <sys/capiblock.h>

int
main(int argc, char *argv[])
{
     int fd = -1, rc = -1, offset = 0, cnt = 0;
     chunk_id_t chnk_id;
     char buf[4096], file_buf[4096];

     /* Initialize CAPI Flash block library */
     rc = cblk_init(NULL, 0);
     if (rc != 0)
     {
         return(rc);
     }

     /* Open a virtual lun on the given CAPI Flash disk */
     chnk_id = cblk_open(argv[2], 0, O_RDWR, 0, CBLK_OPN_VIRT_LUN);

     if (chnk_id == NULL_CHUNK_ID)
     {
         return (-1);
     }

     /* Assign 256MB to Virtual LUN ( (256 * 1024)/4 = 65536) */
     rc = cblk_set_size(chnk_id, 65536, 0);
     if (rc)
     {
         return -2;
     }

     /* Open the given file on a filesystem */
     fd = open(argv[1], O_RDONLY);
     if (fd < 0)
     {
         return -3;
     }

     /* Copy the file contents to virtual lun */
     while ( (cnt = read(fd, buf, 4096)) > 0)
     {
         rc = cblk_write(chnk_id, buf, offset, 1, 0);
         if (rc == 0)
         {
             perror("write:");
             return -4;
         }

         offset++;
     }

     /* Now read the data from Virtual lun and compare */
     /* with file contents */
     offset = 0;
     rc = lseek(fd, 0, SEEK_SET);
     if (rc != 0)
     {
         perror("lseek:");
         return -5;
     }

     while ((rc = cblk_read(chnk_id, buf, offset, 1, 0)) > 0)
     {
          cnt = read(fd, file_buf, 4096);
          if (cnt < 0)
          {
              perror("File read:");
              return -6;
          }

          /* Compare two buffers */
          rc = bcmp(buf, file_buf, cnt);
          if (rc != 0)
          {
              printf("Data Mismatch\n");
              return -6;
          }
     }

     close(fd);
     cblk_close(chnk_id, 0);
     cblk_term(NULL, 0);

     return 0;
}

Use the following commands to compile the sample superpipe application.

  • Using the GNU gcc compiler:
	# gcc cblk_test.c -o cblk_test -lcflsh_block
  • Using the xlc compiler:
	# cc cblk_test.c -o cblk_test -lcflsh_block

CAPI limitations in AIX

CAPI implementation on AIX includes the following limitations:

  • No other Fibre Channel topology than direct-attached is supported.
  • No boot support with disks attached through CAPI Flash adapter.
  • Virtualization of the CAPI Flash disk devices through VIOS is not supported.
  • Live Partition Mobility (LPM) is not supported with CAPI disks.
  • 512-byte block size disks are not supported.

Conclusion

With the increasing demand of real-time analytics, the demand for larger memories is increasing. However, the memory size in a system is not increasing with the demand.

CAPI Flash with its superior low latencies in the superpipe I/O mode can be treated as slow memory and can be used in extending the memory footprint of in-memory databases, which are extensively used in analytics and other real-time applications.

Related topics


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=1057287
ArticleTitle=Achieving high performance on IBM AIX using Coherent Accelerator Processor Interface (CAPI)
publish-date=01242018