Accelerated data compression that uses zlibNX in PASE for i

The zlibNX library is an enhanced version of the zlib compression library that supports hardware-accelerated data compression and decompression on IBM® POWER® processor-based servers.

Note: In IBM i 7.6, the zlibNX support is delivered in the base operating system plus PTF SJ05335.

Computer files often contain redundant or predictable data, like repeated byte strings and bytes that are much more or less common than others; operating on files with a large amount of redundant data consumes unnecessary input/output resources, like storage and network. A family of lossless data compression techniques allows storing data with much of the redundancy removed while still being able to exactly reconstruct the original data. One of the most popular industry formats for lossless data compression is called DEFLATE, which is used in files with suffix .zip and .gz (often created by a program called gzip); the most popular C library for managing DEFLATE data is called zlib. The IBM i Open Source Repository contains a program that is called pigz that uses zlib to create and operate on .gz files while making use of multiple processor threads, running within PASE.

A newer version of zlib called zlibNX is an enhanced version of the zlib compression library that supports hardware-accelerated data compression and decompression on IBM POWER processor-based servers. Any existing programs that use zlib can, without changes, automatically use zlibNX to get faster compression when conditions described elsewhere are satisfied, including:
  • The partition is running on Power® (or later) processors.
  • The partition processor configuration is in Power11 (or later) compatibility mode.
  • Accelerator resources are available to the partition. For more information about accelerator resources, see the IBM i on Power - Performance FAQ.

Configuring and verifying the environment

The directory /QOpenSys/usr/zlibNX/lib contains a file named libz.so.1 that provides the interface to use the zlibNX library.

To use pigz with zlibNX, install version 2.8.0 or later of the pigz RPM from the IBM i Open Source Repository, which has been updated to use the library in /QOpenSys/usr/zlibNX/lib if it is present.

To verify that an environment is ready for acceleration, create a test file of at least 4096 bytes and use the ZLIB_VERBOSE environment variable to determine whether acceleration is occurring:

dd if=/dev/zero of=testfile bs=4096 count=1
ZLIB_VERBOSE=1 pigz testfile 
If visible output containing the text sanitize_accel: NX CC: 64 and deflate_nx() returning bstate: 3 is returned, acceleration support is active. If the output does not indicate acceleration support is active or no visible output is displayed, perform the following steps:
  1. Ensure any required PTFs are loaded and applied and load the latest version of the pigz RPM.
  2. If the visible output contains nx_config_query() failed, a configuration prerequisite is not complete. For example, the partition may be is running in an unsupported processor compatibility mode. Ensure that all prerequisites are satisfied.
To take advantage of zlibNX from other programs within PASE, use the LDR_PRELOAD64 environment variable:
LDR_PRELOAD64='/QOpenSys/usr/zlibNX/lib/libz.so.1(shr_64.o)'
export LDR_PRELOAD64
Any 64-bit programs that use a zlib routine that runs with that environment variable setting will instead automatically use the zlibNX version of the routine. As described previously, ZLIB_VERBOSE=1 can be used to determine whether zlibNX is in use; if it is, some output is generated on standard error as soon as the first deflate or inflate operation is performed.
Note: Instead of using LDR_PRELOAD64, the environment variables LIBPATH or LD_LIBRARY_PATH can be used to accomplish the same task, but those mechanisms are somewhat less efficient and more prone to causing unexpected behavior with other programs running in PASE.
To disable acceleration for some operations when using zlibNX, use the following environment variable setting:
ZLIB_COMPRESS_ACCEL=0
export ZLIB_COMPRESS_ACCEL

Performance considerations when using zlibNX

When using zlibNX, a block of data is compressed by sending a request to one of the system's accelerators and waiting for a response; nearly all of the CPU and wall-clock time is spent waiting, not doing calculations, and the accelerator does the compression faster than the CPU could do it. This leads to much faster wall-clock time and greatly reduced CPU time when compared to doing the work on a single CPU thread. (When CPU is constrained, this advantage is even larger.) This typically means a reduction of CPU utilization of more than 90% (that is, more than 10x as much data processed per unit of CPU); workload wall-clock duration improvements vary significantly depending on other conditions, including input/output configuration.

The accelerator does not search as deeply for compression opportunities as some of the software implementations of zlib, so compressed sizes when using acceleration are typically larger than when running without acceleration, often on the order of 20%, depending on the nature of the source data. If a particular workload needs to maximize the amount of redundancy that is removed even at the cost of greatly increased time and CPU utilization, disable acceleration as previously described.

Each accelerator can process one compression request at a time, with the rest of the requests queued, and requests can come from any logical partition in the system. Thus, wait time (and CPU time spent waiting) for a request increases if other workloads in the system are making heavy use of accelerated compression (whether in this partition or other partitions). Also, since each accelerator can process only one compression request at a time, there are significant limits to the throughput improvement that can be obtained by submitting requests from multiple processor threads at the same time. In addition, running accelerated with more than one pigz thread imposes significant increases to CPU utilization and impairs the compression ratio because of repeated initialization of zlib streams; the impact to compression ratio of running with multiple pigz threads is larger for input data that is biased toward specific byte values without containing many repeated byte strings, and in the worst case, it can lead to compressed output that is larger than the original uncompressed data.

Both accelerated and non-accelerated compression perform better on large files and when invoked on multiple files with one call (because each invocation of the compression program has relatively significant overhead for program startup and for initialization of key data structures). Since the cost of initialization is somewhat higher with acceleration and the cost of compressing each block is so much lower, the relative improvement for operating on larger data is more significant for accelerated compressions.

Table 1 shows sample lab results that are obtained when compressing the publicly available Silesia corpus on an IBM i partition with dedicated Power11 processors and no concurrent users of the accelerator.
Note: Other variables can greatly affect wall-clock time spent, especially storage I/O configurations; the Silesia corpus is small enough that it does not maximize the advantage of the accelerator; this test is not intended to represent expected absolute results in other environments, just general tendencies.
Table 1. Sample lab performance results that use compression acceleration
command Without acceleration With acceleration
  Byte reduction Wall time CPU time Byte reduction Wall time CPU time
pigz -p1 67.9% 10.74 3.15 60.7% 0.76 0.14
pigz -p2 67.9% 5.44 3.25 58.4% 0.55 0.25
pigz -p8 67.9% 1.68 3.08 58.4% 0.55 0.29
Given all of that information, IBM recommends three different ways of using pigz in PASE, depending on acceleration availability and workload goals.
  • Most users who are using Power11 and have reliable access to accelerator resources should use one pigz thread (pigz -p1) for the absolute best CPU performance, excellent wall-clock time, and good compression.
  • Users without access to acceleration (and those who choose to disable acceleration in order to maximize the compression ratio) should use eight pigz threads (pigz -p8), or maybe even a higher number depending on CPU configuration, for the best non-accelerated throughput, at the cost of more than 10 times the CPU that would have been used compressing the same data with acceleration enabled.
  • Users with access to acceleration with no appreciable contention on the accelerator and who can tolerate impaired compression as described previously can consider using two pigz threads (pigz -p2) to maximize use of the accelerator and maximize throughput at the cost of roughly doubling the CPU utilization. (If operating on multiple files, one can get similar throughput to pigz -p2 while preserving the pigz -p1 CPU utilization and compression ratio by using two parallel instances of pigz -p1, each covering about half of the data.)

For more detailed performance results and analysis, see the IBM i on Power - Performance FAQ.