Achieving high performance for Advanced Encryption Standard (AES) applications
How to use the IBM POWER8 built-in functions of IBM XL C/C++ and Fortran compilers
AES is a US government standard for
encryption, and is widely used by the US government and industries around the world. It
protects classified information and sensitive data. The new IBM XL compilers for C/C++
and Fortran provide support for AES built-in functions on both little endian and big
endian systems. Using the IBM POWER8 built-in functions, AES applications can achieve
high performance on IBM POWER8 processors.
Generally available in 2014, IBM XL compilers deliver support for POWER8
functionality, unleashing the power of the latest POWER8 processors. Among the new
features that directly invoke the architecture instructions at the application level
are the POWER8 cryptography built-in functions for AES. These built-in functions map
each of the major AES operations, one-to-one, to the corresponding instruction at
the hardware level on POWER8 processors. This article explores the new built-in
cryptography functions for AES encryption and decryption supported by IBM XL
compilers.
Versions of IBM XL compilers that support AES built-in functions
AES built-in functions are supported in the following IBM XL compilers:
- C/C++ version 13 and higher
- Fortran version 15 and higher
Depending on the platform, the versions of the compilers supporting POWER8 features and enhancements are shown in Table 1.
Table 1. Versions of XL compilers with POWER8 built-in functions
Language | Big endian | Little endian | ||
---|---|---|---|---|
C/C++ | XL C/C++ V13.1 | OS: AIX®, Linux® | XL C/C++ V13.1.1 | OS: Linux |
Fortran | XL Fortran V15.1 | OS: AIX, Linux | XL Fortran V15.1.1 | OS: Linux |
A primer of AES
In May 2002, AES became the US government standard for encryption. It was the first publicly accessible and open cipher approved by the US National Security Agency (NSA) to protect classified information up to the top secret level when used in an NSA approved cryptographic module. AES is a block cipher algorithm that uses a combination of both substitution and permutation on blocks of text and a key in order to produce the output blocks. Each block, also called a state, is a 4x4 column-major matrix of bytes (128-bit value).
Note: The column-major matrix of a state is stipulated by AES and is independent of the language implementation of native storage orders in the memory.
Now, let's take a look at how AES works and how the architecture maps to specific POWER8 built-in functions.
An example of the state representation of an input text stream is shown in Figure 1.
Figure 1. State representation of a text stream

AES is a symmetric key algorithm that uses the same key for both encryption and decryption processes. The legal values for the length of an AES key are 16, 24, and 32 in bytes or equivalently 128, 192 and 256 in bits. These values correspond to the key length (Nk) values 4, 6 and 8, specified by the standard. The key length used by an AES application dictates the number of transformation rounds to convert plain text to cipher text, and from cipher text to plain text. The number of rounds (Nr) computation is as follows:
Number of rounds (Nr) = 10 + (Key length in -bytes – 16) / 4
The key length, block size, and the number of rounds for 128-bit, 192-bit, and 256-bit AES are displayed in Table 2.
Table 2. Legal values for 128-bit, 192-bit, and 256-bit AES
Cipher | Key length (in bits) | Key length (in bytes) | Key length (Nk words) | Block size (Nb words) | Number of rounds (Nr) |
---|---|---|---|---|---|
128-bit AES | 128 | 16 | 4 | 4 | 10 |
192-bit AES | 192 | 24 | 6 | 4 | 12 |
256-bit AES | 256 | 32 | 8 | 4 | 14 |
The key for each round is derived from the given cipher key using the Rijndael [an algorithm selected by the US National Institute of Standards and Technology (NIST) as for AES] key schedule. As shown in Figure 2, regardless of the length of the key in use, each round key is a 4x4 column-major matrix of bytes, which is 128 bits in length. Note that IBM XL compilers do not provide a built-in function for key expansion.
Figure 2. Relationship between key expansion and rounds in AES encryption

The initial round of AES encryption consists of only AddRoundKey()
transformation, where a round key is added to the state by an XOR
operation. All other rounds, with the exception of the last round, consist of four
transformations: SubBytes()
, ShiftRows()
,
MixColumns()
, and AddRoundKey()
. The final round is
slightly different: it only has SubBytes()
, ShiftRows()
and AddRoundKey()
.
Although using the same cipher key as AES encryption, AES decryption is slightly
different. It has the initial round consisting of only AddRoundKey( )
transformation. All intermediate rounds consist of InverseShiftRows( )
,
InverseSubBytes( )
, AddRoundKey( )
, and
InverseMixColumns( )
. The final round only has
InverseShiftRows( )
, InverseSubBytes( )
, and
AddRoundKey( )
.
The symmetry between the corresponding operations in AES encryption and AES decryption is displayed in Figure 3.
Figure 3. Symmetry between AES encryption and AES decryption

IBM XL compiler built-in functions for AES
IBM POWER8 architectures provide hardware level support for the most critical functions used by AES. IBM XL compilers provide a set of cryptography built-in functions, mapping one-to-one to the hardware instructions. They are alternatives to managing hardware registers through assembly language. They provide access to these optimized instruction sets for software engineers. The IBM XL compiler built-in functions for AES include:
- vcipher (state_array, round_key)
- vcipherlast (state_array, round_key)
- vncipher (state_array, round_key)
- vncipherlast (state_array, round_key)
- vsbox (state_array)
vcipher (state_array, round_key)
This built-in function performs one round of the AES cipher operation on intermediate state_array using a given round_key. This is used for all rounds during the encryption process, except the initial and the final rounds. The interfaces of the vcipher built-in function in C/C++, Fortran and assembly languages are listed in this section.
- C/C++
vector unsigned char __vcipher (vector unsigned char state_array, vector unsigned char round_key)
- Fortran
VCIPHER (STATE_ARRAY, ROUND_KEY) where both arguments and the result are unsigned vectors of kind 1.
- Assembly
vcipher VRT,VRA,VRB- VRT holds the result, representing the new intermediate state of the cipher operation.
- VRA holds the state, representing the intermediate state array during AES cipher operation.
- VRB holds the round key.
vcipherlast (state_array, round_key)
This built-in function performs a final round of the AES cipher operation on intermediate state_array using a given round_key. This is used for the final round of the encryption process. Its output is the ciphertext. The interfaces of the vcipherlast built-in function in C/C++, Fortran and assembly languages are listed in this section.
- C/C++
vector unsigned char __vcipherlast (vector unsigned char state_array, vector unsigned char round_key)
- Fortran
VCIPHERLAST (STATE_ARRAY, ROUND_KEY) where both arguments and the result are unsigned vectors of kind 1.
- Assembly
vcipherlast VRT,VRA,VRB- VRT holds the result, representing the final state of the cipher operation.
- VRA holds the state, representing the intermediate state array during AES cipher operation.
- VRB holds the round key.
The pseudo codes of AES encryption when using user-defined functions and using POWER8 built-in functions are shown in Figure 4.
Figure 4. Relationship between AES operations and AES built-in functions in AES encryption

vncipher (state_array, round_key):
This built-in function performs one round of the AES decipher operation on intermediate state_array using a given round_key. This is used for all rounds during the decryption process, except the initial and the final rounds. The interfaces of the vncipher built-in function in C/C++, Fortran and assembly languages are listed in this section.
- C/C++
vector unsigned char __vncipher (vector unsigned char state_array, vector unsigned char round_key)
- Fortran
VNCIPHER (STATE_ARRAY, ROUND_KEY) where both arguments and the result are unsigned vectors of kind 1.
- Assembly
vncipher VRT,VRA,VRB- VRT holds the result, representing the new intermediate state of the inverse cipher operation.
- VRA holds the State, representing the intermediate state array during AES inverse cipher operation.
- VRB holds the round key.
vncipherlast (state_array, round_key)
This function performs a final round of the AES decipher operation on intermediate state_array using a given round_key. This is used for the final round of the decryption process. Its output is the original plain text. The interfaces of the vncipherlast built-in function in C/C++, fortran and Assembly languages are listed in this section.
- C/C++
vector unsigned char __vncipherlast(vector unsigned char state_array, vector unsigned char round_key)
- Fortran
VNCIPHERLAST (STATE_ARRAY, ROUND_KEY) where both arguments and the result are unsigned vectors of kind 1.
- Assembly
vncipherlast VRT,VRA,VRB- VRA holds the state, representing the intermediate state array during AES inverse cipher operation.
- VRB holds the round key.
- VRT holds the result, representing the final state of the inverse cipher operation.
The pseudo codes of AES decryption when using user-defined functions and using POWER8 built-in functions are presented in Figure 5.
Figure 5. Relationship between AES operations and AES built-in functions in AES decryption

The relationship between POWER8 built-in functions to AES encryption and AES decryption is illustrated in Figure 6.
Figure 6. AES built-in functions in AES encryption and decryption

In addition, there is a built-in function to perform the SubBytes( ) operation, as defined in FIPS-197, on state_array.
vsbox (state_array)
This built-in function performs the SubBytes( ) operation, a nonlinear byte substitution operating independently on each byte of the state using a pre-defined substitution table (S-box). The interfaces of the vsbox built-in function in C/C++, Fortran, and assembly languages are listed in this section.
- C/C++
vector unsigned char __vsbox(vector unsigned char state_array)
- Fortran
VSBOX(STATE_ARRAY) where the argument and result are unsigned vectors of kind 1.
- Assembly
vsbox VRT,VRA- VRA holds the state, representing the intermediate state array during AES cipher operation.
- VRT holds the result of applying the transform, SubBytes() on state, as defined in FIPS-197.
Figure 7. S-box lookup operation

For example, if the input is {68}, then the substitution value would be determined by the intersection of the sixth row and the eighth column of the S-box. This would result in having an output value of {45}.
Test programs using cryptography built-in functions of IBM XL C/C++ V13.1 compiler
The following snippets use the XL C/C++ V13.1 cryptography built-in functions on POWER8 architecture.
- An intermediate round of AES encryption with _vcipher
- A SubBytes( ) option, that is, S-box lookup with _vsbox
Example of __vcipher
The input is the intermediate block and the round key listed in Appendix B of Federal Information Processing Standards Publication 197(FIPS-197).
Figure 8. Transformation example based on Appendix B of FIPS-197

At the beginning of round 1, the input is
{ 0x19, 0x3d, 0xE3, 0xBE, 0xA0, 0xF4, 0xE2, 0x2B, 0x9A, 0xC6, 0x8D, 0x2A, 0xE9, 0xF8, 0x48, 0x08 }
The round key corresponding to the round is
{ 0xA0, 0xFA, 0xFE, 0x17, 0x88, 0x54, 0x2C, 0xB1, 0x23, 0xA3, 0x39, 0x39, 0x2A, 0x6C, 0x76, 0x05 }
After going through SubBytes( )
, ShiftRows( )
,
MixColumns( )
, and AddRoundKey( )
in round 1, the
output will be
{ 0xA4, 0x9C, 0x7F, 0xF2, 0x68, 0x9F, 0x35, 0x2B, 0x6B, 0x5B, 0xEA, 0x43, 0x02, 0x6A, 0x50, 0x49 }
Listing 1: Contents of the test__vcipher.c file
int main(){ /* Appendix B p.33 Round 1 */ vector unsigned char state = { 0x19, 0x3d, 0xE3, 0xBE, 0xA0, 0xF4, 0xE2, 0x2B, 0x9A, 0xC6, 0x8D, 0x2A, 0xE9, 0xF8, 0x48, 0x08 }; vector unsigned char roundKey = { 0xA0, 0xFA, 0xFE, 0x17, 0x88, 0x54, 0x2C, 0xB1, 0x23, 0xA3, 0x39, 0x39, 0x2A, 0x6C, 0x76, 0x05 }; vector unsigned char expect = { 0xA4, 0x9C, 0x7F, 0xF2, 0x68, 0x9F, 0x35, 0x2B, 0x6B, 0x5B, 0xEA, 0x43, 0x02, 0x6A, 0x50, 0x49 }; vector unsigned char answer =__vcipher(state, roundKey); return answer == expect ? 0 : 1; }
Example of __vsbox
Again, the input is the intermediate block listed in Appendix B of Federal Information Processing Standards Publication 197 (FIPS-197).
Figure 9. SubByte operation example based on Appendix B of FIPS-197

At the beginning of round 1, the input is:
{ 0x19, 0x3d, 0xE3, 0xBE, 0xA0, 0xF4, 0xE2, 0x2B, 0x9A, 0xC6, 0x8D, 0x2A, 0xE9, 0xF8, 0x48, 0x08 }
After going through SubBytes( ), the output, which is the result of the pre-defined S-box lookup will be:
{ 0xD4, 0x27, 0x11, 0xAE, 0xE0, 0xBF, 0x98, 0xF1, 0xB8, 0xB4, 0x5D, 0xE5, 0x1E, 0x41, 0x52, 0x30 }
Listing 2: Contents of the test__vsbox.c file
int main(){ /* Appendix B p.33 Round 1 */ vector unsigned char state = { 0x19, 0x3d, 0xE3, 0xBE, 0xA0, 0xF4, 0xE2, 0x2B, 0x9A, 0xC6, 0x8D, 0x2A, 0xE9, 0xF8, 0x48, 0x08 }; vector unsigned char expect = { 0xD4, 0x27, 0x11, 0xAE, 0xE0, 0xBF, 0x98, 0xF1, 0xB8, 0xB4, 0x5D, 0xE5, 0x1E, 0x41, 0x52, 0x30 }; vector unsigned char answer =__vsbox(state); return answer == expect ? 0 : 1; }
The code in Listing 1 and Listing 2 can be compiled with the following command:
xlc –o test_vsbox –qarch=pwr8 -qaltivec ./test__vsbox.c
When compiling at higher optimization levels, additional option
–qtune=pwr8
is recommended. The existence of
POWER8 specific options –qarch=pwr8
and
–qtune=pwr8
, controls the
code generated by the compiler. They inform the compiler to adjust the instructions,
scheduling, and other optimizations to achieve the best performance on POWER8
processor.
Code generated by the compiler
The following are two similar test programs, one using the POWER8 cryptography built-in function and one not.
Listing 3: Contents of the test__vsbox.c file, using built-in function
int main(){ /* Appendix B p.33 Round 1 */ vector unsigned char state = { 0x19, 0x3d, 0xE3, 0xBE, 0xA0, 0xF4, 0xE2, 0x2B, 0x9A, 0xC6, 0x8D, 0x2A, 0xE9, 0xF8, 0x48, 0x08 }; vector unsigned char roundKey = { 0xA0, 0xFA, 0xFE, 0x17, 0x88, 0x54, 0x2C, 0xB1, 0x23, 0xA3, 0x39, 0x39, 0x2A, 0x6C, 0x76, 0x05 }; vector unsigned char expect = { 0xA4, 0x9C, 0x7F, 0xF2, 0x68, 0x9F, 0x35, 0x2B, 0x6B, 0x5B, 0xEA, 0x43, 0x02, 0x6A, 0x50, 0x49 }; vector unsigned char answer =__vcipher(state, roundKey); return answer == expect ? 0 : 1; }
Both programs can be compiled with –c and –qlist to generate the listing file (or alternatively, compiling them with –c and –S to generate the assembly files).
Figure 10. Listing file when using built-in function

The listing file of test__vcipher.c, the program with POWER8 built-in function, reveals that the program quickly terminates after the POWER8 instruction vcipher is called. The instruction count is 35.
On the other hand, the test program not using the POWER8 built-in functions is shown in Listing 4.
Listing 4: Contents of file test_no__vsbox.c, without built-in function
extern vector unsigned char mySubBytes(vector unsigned char); extern vector unsigned char myShiftRows(vector unsigned char); extern vector unsigned char myMixColumns(vector unsigned char); extern vector unsigned char myAddRoundKey(vector unsigned char, vector unsigned char); int main(){ /* Appendix B p.33 Round 1 */ vector unsigned char state = { 0x19, 0x3d, 0xE3, 0xBE, 0xA0, 0xF4, 0xE2, 0x2B, 0x9A, 0xC6, 0x8D, 0x2A, 0xE9, 0xF8, 0x48, 0x08 }; vector unsigned char roundKey = { 0xA0, 0xFA, 0xFE, 0x17, 0x88, 0x54, 0x2C, 0xB1, 0x23, 0xA3, 0x39, 0x39, 0x2A, 0x6C, 0x76, 0x05 }; vector unsigned char expect = { 0xA4, 0x9C, 0x7F, 0xF2, 0x68, 0x9F, 0x35, 0x2B, 0x6B, 0x5B, 0xEA, 0x43, 0x02, 0x6A, 0x50, 0x49 }; vector unsigned char answer = mySubBytes(state); answer = myShiftRows(answer); answer = myMixColumns(answer); answer = myAddRoundKey(answer, roundKey); return answer == expect ? 0 : 1; }
Figure 11. Listing file when not using built-in function

The listing file shows that the program makes four calls to the supporting functions. Function calls, together with the associated tasks to set up and manage the call stacks, are expensive in terms of performance cost. Even if the optimization decides to inline those function calls to improve performance, the inline code will have hundreds, if not thousands of instructions. Compared to the built-in functions, the user-defined functions require more time to run and a larger footprint for the generated binary.
Another practical reason that makes using AES built-in functions critical is the mode of operation implemented by the program. If the selected mode of operation is chaining mode (for example CBC or PCBC) or feedback mode (for example CFB or OFB), one or even both of the encryption and decryption cannot be parallelized. The sequential operations then become the bottleneck in terms of performance. This is what makes using built-in functions essential for high performance applications.
Resources
- Visit the XL C/C++ for Linux and XL C/C++ for AIX product pages for more information.
- Get the free trial download for XL C/C++ for Linux.
- Join the rational C/C++ Cafe community and get connected.
References
- Announcing the Advanced Encryption Standard (AES): Federal Information Processing Standards Publication 197. United States National Institute of Standards and Technology (NIST). November 26, 2001. Retrieved September 1, 2014
- Compiler references, Compiler built-in functions, Cryptography built-in functions. IBM Corporation. 2014. Retrieved September 1, 2014.
- Block cipher mode of operation. Retrieved September 1, 2014.