Note： this article was written by Anna Thomas, a former IBMer. This article was originally published in developerWorks in October, 2015 but was unpublished in 2018. So we republish it in this community.
The recent technological advancements have focused on enabling higher performance for scientific and analytical workloads that are inherently computation intensive. Single Instruction Multiple Data (SIMD) processing is one such enhancement for increased parallelism, and it requires both hardware and compiler support. With the addition of the SIMD processing unit in the new z13 processor, you have the hardware support required for processing SIMD code. The compiler support is provided through Vector Programming (built-in functions) which was added in the z/OS XL C/C++ V2R1M1 compiler.
The AutoSIMD compiler optimization automatically transforms scalar code into SIMD code. It was first implemented in the z/OS V2R2 XL C/C++ compiler. This article focuses on the three advantages of the AutoSIMD compiler feature:
- The effort involved in efficient code generation for the SIMD hardware is transferred from the application developer to the compiler. You need not rewrite your applications using Vector Programming to exploit the SIMD instruction set.
- This feature is equipped with the knowledge of the z/Architecture and takes advantage of the SIMD instructions where it is best suited. It generates efficient vector code in conjunction with scalar code.
- The AutoSIMD optimization is strategically placed among other compiler optimizations, to maximize synergy between transformations in order to deliver the best possible transformation sequence for the application code.
Simdization transforms code from a scalar form (a single operation taking a single set of operands) to a vector form, i.e. a single operation taking multiple set of operands. In other words, simdization follows the Single Instruction Multiple Data (SIMD) model.
The AutoSIMD optimization performs automatic simdization for loops and code blocks. It is run after other optimizations that can potentially expose new opportunities for simdization. The AutoSIMD optimization contains three major phases:
- The safety analysis phase to identify if the transformation is safe to perform
- The profitability analysis phase to evaluate the SIMD code being generated is better than the original scalar code
- The SIMD code generation phase to generate the vector code in place of the scalar code
The first phase is the safety analysis phase. It studies various code properties such as data types and loop properties, and decides if the loops or code blocks are viable candidates for simdization.
The second phase, which is the profitability analysis phase, studies the cost versus benefit for generating equivalent vector code for a given scalar code. It takes into account various factors such as z/ Architecture, and operations needed for setting up the vector code. Not all SIMD transformations are beneficial compared to the equivalent scalar code, you'll see one such scenario in the next section.
The third phase is the code generation phase. It uses information from the profitability analysis and generates the SIMD code sequence instead of the scalar code sequence. The final outcome produces the exact same result as the scalar case.
The AutoSIMD optimization is turned on by default when the
HOT option is in effect,
FLOAT(AFP(NOVOLATILE)) is set,
TARGET(V2R2) and applications are compiled with z/OS V2R2 XL C/C++ compiler at
ARCH=11. The optimization can be controlled by invoking the
AUTOSIMD/NOAUTOSIMD sub option of
VECTOR . For more details on the
AUTOSIMD sub option, refer to the z/OS V2R2 XL C/C++ Compiler User Guide.
A few examples with source code and a snippet of the final pseudo-assembly generated when AutoSIMD optimization is in effect, are provided in this article. The pseudo-assembly contains the vector instructions relevant to the specific source code. Use the LIST option to see the complete listing. The vector programming equivalent of the source code is also included for all examples. Vector programming built-ins are explained in detail in the XL C/C++ V2R2 Compiler Programming Guide.
Note : All the loop examples in this article are a loop with an upper bound which is not known at compile time. The loop is unrolled to pack multiple statements into a single vector statement. The end result of the loop is a 2x or 4x reduction in the number of iterations.
Example 1. AutoSIMD on a simple loop
In this loop example, each iteration scales an element of array b by factor x, adds the corresponding element of array a, and stores the result into array a.
Listing 1. Source code
Listing 2. Vector programming equivalent for Listing 1
Table 1 shows the pseudo-assembly of the source code, when compiled with AutoSIMD versus without AutoSIMD.
Table 1. Pseudo-Assembly of Listing 1 with versus without AutoSIMD
Although the Pseudo-Assembly with AutoSIMD has more numbers of instructions statically, the amount of data processed at each iteration is 4x that of scalar.
Example 2. AutoSIMD on loops with dependencies
The example in Listing 3 is a loop with dependencies. This example shows how the placement of the AutoSIMD optimization facilitates the simdization of the loop.
This loop updates arrays 'a' and 'd' only when the value being computed is the maximum of the current array value versus a computed value.
Listing 3. Source code
Array 'd' has a loop carried dependence on itself. During the safety analysis phase in AutoSIMD, the loop is deemed an unsafe candidate for simdization unless this dependency is removed. However, the loop distribution optimization which is done before the AutoSIMD optimization, distributes the statements calculating the array 'd' into another loop. This allows AutoSIMD to simdize the loop which calculates array 'a'.
The source code in Listing 3 can be separated out into two loops, so that the first loop which has no loop dependencies, can be rewritten using the vector builtins. Note that the single source loop is split into two loops for clarity.
Listing 4. Vector programming equivalent for Listing 3
Loop distribution splits the source loop into two loops, and AutoSIMD simdizes loop 1, which updates array 'a'. For brevity, only statements of loop 1 in this pseudo-assembly are shown in Table 2. The complete listing with LIST shows the unrolled loop1 with vector instructions separated to avoid dependency stalls. Note that other compiler optimizations optimize loop 2 which updates array 'd'.
Table 2. Pseudo-assembly for Listing 3 with versus without AutoSIMD
Example 3. AutoSIMD on code blocks outside loops
The modification of an array of doubles is shown in this example. The vector facility for z/Architecture provides support for Binary Floating Point (BFP) operations.
Listing 5. Source code
Listing 6. Vector programming equivalent for Listing 5
AutoSIMD generates the vector instructions for the source code block and uses the vector fused multiply add instruction VFMA. In Table 3, the corresponding scalar code is generated twice for the doubles, where MADB is the scalar version of VFMA.
Table 3. Pseudo-assembly for Listing 5 with versus without AutoSIMD
Example 4. Non profitable SIMD situations are identified and never simdized
Listing 7 is an example where simdization is not beneficial, due to a better performing scalar hardware instruction. The AutoSIMD profitability analysis phase identifies this situation and avoids simdization of this loop.
Listing 7. Source code
Listing 8. Pseudo-assembly when AutoSIMD in effect for Listing 7
When AutoSIMD is in effect, this code is left in its scalar form which uses the MVC instruction to move 256 bytes from source array a to destination array b. If the code is simdized using vector programming, the code generated is 16 byte vector loads and stores, compared to 256 bytes with MVC. When the number of iterations are large, this becomes a long sequence of dependent vector loads and stores which can cause a performance degradation. Hence, AutoSIMD avoids generating the code sequence shown in Listing 9.
Listing 9. Vector programming equivalent for Listing 7
Listing 10. Pseudo-assembly of vector programming equivalent in Listing 9
With the advent of the SIMD unit in the new z13 processor, increased data parallelism is available for existing analytic applications. This article introduced the AutoSIMD compiler optimization in the z/OS V2R2 XL C/C++ compiler to automatically leverage SIMD opportunities in existing applications. The optimization safely transforms scalar code to vector code after considering the profitability of this transformation. The AutoSIMD optimization in combination with other compiler optimizations tries to generate efficient object code for improved execution time.