Accelerated data compression that uses zlibNX in PASE for i
The zlibNX library is an enhanced version of the zlib compression library that supports hardware-accelerated data compression and decompression on IBM® POWER® processor-based servers.
Computer files often contain redundant or predictable data, like repeated byte strings and bytes
that are much more or less common than others; operating on files with a large amount of redundant
data consumes unnecessary input/output resources, like storage and network. A family of lossless
data compression
techniques allows storing data with much of the redundancy removed while still
being able to exactly reconstruct the original data. One of the most popular industry formats for
lossless data compression is called DEFLATE, which is used in files with suffix
and .zip
(often created by a program called
gzip); the most popular C library for managing DEFLATE data is called zlib. The IBM i Open Source Repository contains a program that is called pigz that uses zlib
to create and operate on .gz
files while making use of multiple processor
threads, running within PASE..gz
- The partition is running on Power® (or later) processors.
- The partition processor configuration is in Power11 (or later) compatibility mode.
- Accelerator resources are available to the partition. For more information about accelerator resources, see the IBM i on Power - Performance FAQ.
Configuring and verifying the environment
The directory /QOpenSys/usr/zlibNX/lib contains a file named
libz.so.1 that provides the interface to use the zlibNX library.
To use pigz with zlibNX, install version 2.8.0 or later of the pigz RPM from the IBM i Open Source Repository, which has been updated to use the library in /QOpenSys/usr/zlibNX/lib if it is present.
ZLIB_VERBOSE environment variable to determine whether
acceleration is occurring:
dd if=/dev/zero of=testfile bs=4096 count=1
ZLIB_VERBOSE=1 pigz testfile sanitize_accel: NX CC: 64 and
deflate_nx() returning bstate: 3 is returned, acceleration support is
active. If the output does not indicate acceleration support is active or no visible output is
displayed, perform the following steps: - Ensure any required PTFs are loaded and applied and load the latest version of the pigz RPM.
- If the visible output contains
, a configuration prerequisite is not complete. For example, the partition may be is running in an unsupported processor compatibility mode. Ensure that all prerequisites are satisfied.nx_config_query() failed
LDR_PRELOAD64 environment variable:
LDR_PRELOAD64='/QOpenSys/usr/zlibNX/lib/libz.so.1(shr_64.o)'
export LDR_PRELOAD64ZLIB_VERBOSE=1 can be used to determine whether zlibNX is in use; if it is, some
output is generated on standard error as soon as the first deflate or inflate operation is
performed.LDR_PRELOAD64, the environment variables
LIBPATH or LD_LIBRARY_PATH can be used to accomplish the same
task, but those mechanisms are somewhat less efficient and more prone to causing unexpected behavior
with other programs running in PASE.ZLIB_COMPRESS_ACCEL=0
export ZLIB_COMPRESS_ACCELPerformance considerations when using zlibNX
When using zlibNX, a block of data is compressed by sending a request to one of the system's accelerators and waiting for a response; nearly all of the CPU and wall-clock time is spent waiting, not doing calculations, and the accelerator does the compression faster than the CPU could do it. This leads to much faster wall-clock time and greatly reduced CPU time when compared to doing the work on a single CPU thread. (When CPU is constrained, this advantage is even larger.) This typically means a reduction of CPU utilization of more than 90% (that is, more than 10x as much data processed per unit of CPU); workload wall-clock duration improvements vary significantly depending on other conditions, including input/output configuration.
The accelerator does not search as deeply for compression opportunities as some of the software implementations of zlib, so compressed sizes when using acceleration are typically larger than when running without acceleration, often on the order of 20%, depending on the nature of the source data. If a particular workload needs to maximize the amount of redundancy that is removed even at the cost of greatly increased time and CPU utilization, disable acceleration as previously described.
Each accelerator can process one compression request at a time, with the rest of the requests
queued, and requests can come from any logical partition in the system. Thus, wait time (and CPU
time spent waiting) for a request increases if other workloads in the system are making heavy use of
accelerated compression (whether in this partition or other partitions). Also, since each
accelerator can process only one compression request at a time, there are significant limits to the
throughput improvement that can be obtained by submitting requests from multiple processor threads
at the same time. In addition, running accelerated with more than one pigz thread imposes
significant increases to CPU utilization and impairs the compression ratio because of repeated
initialization of zlib streams; the impact to compression ratio of running with multiple pigz
threads is larger for input data that is biased toward specific byte values without containing many
repeated byte strings, and in the worst case, it can lead to compressed
output that is larger
than the original uncompressed data.
Both accelerated and non-accelerated compression perform better on large files and when invoked on multiple files with one call (because each invocation of the compression program has relatively significant overhead for program startup and for initialization of key data structures). Since the cost of initialization is somewhat higher with acceleration and the cost of compressing each block is so much lower, the relative improvement for operating on larger data is more significant for accelerated compressions.
| command | Without acceleration | With acceleration | ||||
|---|---|---|---|---|---|---|
| Byte reduction | Wall time | CPU time | Byte reduction | Wall time | CPU time | |
pigz -p1 |
67.9% | 10.74 | 3.15 | 60.7% | 0.76 | 0.14 |
pigz -p2 |
67.9% | 5.44 | 3.25 | 58.4% | 0.55 | 0.25 |
pigz -p8 |
67.9% | 1.68 | 3.08 | 58.4% | 0.55 | 0.29 |
- Most users who are using Power11 and have reliable
access to accelerator resources should use one pigz thread (
) for the absolute best CPU performance, excellent wall-clock time, and good compression.pigz -p1 - Users without access to acceleration (and those who choose to disable acceleration in order to
maximize the compression ratio) should use eight pigz threads (
), or maybe even a higher number depending on CPU configuration, for the best non-accelerated throughput, at the cost of more than 10 times the CPU that would have been used compressing the same data with acceleration enabled.pigz -p8 - Users with access to acceleration with no appreciable contention on the accelerator and who can
tolerate impaired compression as described previously can consider using two pigz threads
(
) to maximize use of the accelerator and maximize throughput at the cost of roughly doubling the CPU utilization. (If operating on multiple files, one can get similar throughput topigz -p2
while preserving thepigz -p2
CPU utilization and compression ratio by using two parallel instances ofpigz -p1
, each covering about half of the data.)pigz -p1
For more detailed performance results and analysis, see the IBM i on Power - Performance FAQ.