This article will explain the parallel implementation of the AES algorithm discussed in Part 1 of this series using the CUDA framework. The first focus in on implementing all AES transformations and their corresponding inverse transformations in CUDA. Then you look at the bottlenecks in the basic implementation and apply GPU specific optimizations to mitigate their effect. At the end of the article, you review the benchmarks that signify the acceleration achieved using CUDA-enabled GPUs against standard CPUs.
Before you examine how AES is implemented on CUDA, let's begin with an overview of cryptography on GPUs.
The use of general purpose graphics hardware to accelerate cryptographic solutions has quite a history. Gershon Kedem and Yuriko Ishihara published the first paper dealing with cryptography on graphics hardware in August 1999 (see Resources). The authors reported successfully cracking a UNIX® password cipher using the PixelFlow graphics engine running at 100 MHz.
Early commodity graphics cards, however, were not well suited for cryptographic applications due to the lack of a flexible programming environment as well as hardware deficiencies like missing integer support. This has changed in recent years with the latest graphics cards from manufacturers like NVIDIA, which offer more flexibility for general purpose computing applications. The inception of CUDA and the open parallel programming standard, OpenCLTM, have made GPUs even more attractive platforms for parallel computing applications. A number of recent papers have dealt specifically with implementing AES on NVIDIA's GPU using CUDA (see Resources). In all of these papers the core encryption algorithm is implemented on the GPU, leaving tasks like key expansion on the CPU, which simplifies the implementation. We have, however, also investigated the key scheduling on the GPU using shared memory available on the latest hardware, without any significant performance overhead.
The CUDA implementation of AES
Now let's discuss the parallel implementation and optimization of AES using the CUDA framework. Begin with a basic implementation of AES transformations and the required indexing scheme followed by a discussion on the performance limiting factors. Then look at the possible optimization strategies to mitigate these bottlenecks and develop an efficient encryption kernel that fully benefits from the parallel computing power of GPUs.
For the AES implementation, consider the simplest case of a
single CUDA thread operating on a 128-bit state block. The first step is
to allocate memory on the GPU device for input, output, expanded key, and
AES tables using the cudaMalloc API call. These
buffers are created in the global memory or D-RAM of the GPU. Input data,
expanded key, and the AES tables are then copied from the host to the device
using the cudaMemcpy API call.
Now, you can launch the AES kernel and analyze the steps undertaken by your CUDA
thread for the encryption and decryption process.
At the beginning of the Forward or the Inverse Cipher, the input array is
copied to the state array according to the following convention:
S [ r , c ] = In [ r + 4c ] for 0 <= r < 4; and 0 <= c Nb,
where Nb = 4 for the example case. Figure 1 depicts this column major access
pattern. (View a text version of the diagram in Figure 1.)
Figure 1. Copying input array to
state array
Generalizing this indexing scheme to accommodate more threads requires some
mechanism of identifying which thread is being executed. A new variable
named idx is introduced, which queries the CUDA
runtime for a Global Id of each thread. The net offset for each thread is
16*idx, as each thread handles 16 elements
(bytes) of the input array. The same holds true for writing the data to
the output array after completion of the encryption or decryption process.
All AES transformations are implemented as device functions to optimize the resource usage and improve the readability of the code. Two blocks of state size are created in the register files, since all of the AES transformations can't be performed in place. We will explain how the transformations are implemented in the order they are performed while encrypting input data.
-
AddRoundKeyThis transformation is applied by combining each byte of the state with a corresponding byte from the expanded key array using a bit-wise XOR operation. The expanded key is available on the GPU in a Global Memory buffer. This transformation is applied during each round using a different portion of the expanded key depending upon the round number. The
AddRoundKeytransformation is applied according to the equation in Figure 2.
Figure 2.AddRoundKeytransformation is applied according to this equation
Each thread requires 16 global memory accesses from random locations resulting in un-coalesced memory transfers. Global memory is the least bandwidth memory on the GPU and is not backed by a cache. This introduces long latency per access and effectively blocks the whole warp from executing until all memory requests are complete.
-
SubBytesThis transformation substitutes each byte of the state array with an element from AES S-Box located at an index equal to the byte value itself. For example, if S0,0 = {64}, then the substitution value from the S-Box would be determined by the intersection of the row with index
6and the column with index4. The S-Box is a fixed table that resides in the Global Memory of the GPU. Here, the situation is similar to theAddRoundKeytransformation. Global memory accesses are so far the performance bottleneck in your basic implementation. -
ShiftRowsIn a
ShiftRowstransformation, each byte of the second, third, and fourth row is shifted left by an offset of one, two, and three bytes, respectively. The shift rows transformation uses bothn_stateand thestatearrays as it is not an in-place transform, as in Figure 3.
Figure 3. Applying aShiftRowstransformation
The
ShiftRowstransformation does not include any global memory access. Both the state and the n_state array reside in the registers and therefore can be accessed at the highest available bandwidth without any performance issues.The initial round applies only to the
AddRoundKeytransformation. The middle rounds start with SubBytes transformation and respectively applyShiftRows,MixColumnsand theAddRoundKeyfor the specific number of rounds calculated from the key-length. The final round performs theSubBytesandShiftRowstransformation followed byAddRoundKey. The encrypted data is then copied back from state to the global output buffer as described next.
Copy state to the output buffer
At the end of the Forward or the Inverse Cipher, the
state array is copied back to the output array
in global memory as in Figure 4. The convention
followed for this pattern can be defined as
Out [ r + 4c ] = S [ r , c ] for 0 <= r < 4; and 0 <= c <Nb. (View a text version of the diagram in Figure 4.)
Figure 4. Copying state to output array
In this section, you will look at the inverse AES transformations and the order in which they are applied for the decryption process. The Inverse Cipher incorporates minor changes in the transformations, the order of execution, and the required AES-Tables. Figure 5 depicts the order in which transformations are applied for the decryption process.
Figure 5. Inverse AES flow graph
The InvSubBytes transform, which is the inverse
of the SubBytes transform, requires an Inverse
S-box table instead of the S-box table for byte substitution. The
substitution logic remains the same as in the
SubBytes transform.
The InvShiftRows transformation shifts each byte
of the second, third and fourth row respectively by one, two and three
bytes to the right instead as compared to the left shift in the
ShiftRows transformation. Figure 6 depicts how the InvShiftRows
transformation is applied.
Figure 6.
InvShiftRows transformation
The other two transformations, MixColumns and
AddRoundKey, remain unchanged.
Optimizing AES for the CUDA architecture
In the basic implementation discussed so far, you used the Global memory (D-RAM) available on the GPU and the Register memory available to each thread. Remember, Global memory has the least memory bandwidth compared to other memory spaces available on the GPU like Constant and Shared memory. Your kernel performance is limited by global memory bandwidth at one end. The main disadvantage of low memory bandwidth is long latency access. A quick optimization is to reduce the number of global memory accesses required in each transformation. The second factor that degrades kernel performance is a poor device occupancy level, resulting from excessive register usage per thread. In compute capability 2.0, the hardware maximum allowed registers per Symmetric Multiprocessor (SM) are limited to 32768. Higher register usage reduces the number of in-flight thread-blocks on each SM and results in under-utilization of the computing resources. To achieve maximum device occupancy level, you need to bring the register usage per thread within limits, thus allowing more thread-blocks to be processed per SM. To achieve this, offload the state and n_state arrays to the shared memory.
The first performance degrading factor in this AES implementation is the frequent global memory accesses. The data being accessed from global memory includes the fixed AES tables and the expanded key. For faster, cached access, move this read-only data to the constant memory of GPUs.
Constant memory is a high bandwidth read-only memory available on the GPUs. This is an ideal memory space to store constant data for read-only access as it is visible to all threads from all thread-blocks within a kernel. The host processor is responsible for allocating and initializing the memory objects that reside in constant memory space. The constant memory is cache baked, so successive reads from the constant memory do not suffer a memory latency penalty as do the reads from global memory. Once the data is written to constant memory, all the threads can access this data for read operations with minimal latency. The size of constant cache is limited to 64 KB on current hardware. In AES, you have four constant tables that are only read within the kernel. These are the logt, ilogt, sbox, and the isbox. Each table is 256 Bytes in size, making 1 KB for 4 tables, so you can place all the tables in constant cache. Another read-only data for your kernel is the expanded key, which takes up 240 more bytes of the constant cache with ease.
The kernel code does not require any modifications, as constant memory is allocated and initialized by the host processor. Moving AES tables and the expanded key to constant memory results in a significant performance increase, since it drastically reduces the number of Global memory accesses per thread. The global memory is now accessed only for copying data from the input buffer to state and back to the output buffer after completing all the transformations.
Improving device occupancy and the AES kernel using shared memory
The second step in AES optimization is to improve device occupancy, which is currently limited by an excessive register usage per thread. Each thread maintains two arrays of state size (16 Bytes) in the register memory to efficiently compute the AES transformations without returning back to global memory. To save register memory without incurring significant overheads, offload these arrays to an appropriate memory space that offers comparable performance to registers. Shared memory is here to save the day for you.
Shared memory is a high-bandwidth memory on GPUs used for data-sharing among threads within the same thread-block. NVIDIA GPUs with compute capability 2.x have a maximum 48KB of shared memory per Symmetric Multiprocessor (SM). Shared memory offers very high bandwidth, roughly an order of magnitude higher than the global memory. Another advantage of shared memory is that it does not require coalescing; once the data is loaded into local memory, it can be accessed in any pattern without performance degradation, making it an ideal substitute of register memory.
Place thestate and
n_state buffers in the shared memory instead of
the registers. Now all threads in a thread-block share
state and n_state
buffers. Each thread will load 16 elements from the global input buffer to
the state buffer in shared memory at a defined offset. Two buffers in
shared memory are still required, as all the AES transformations are not
in place. This is not a caveat though, as you have 48 KB of local memory
per SM at your disposal, and each thread-block at max use 8 KB at largest
work-group size of 256, thus allowing six thread-blocks to be processed
per SM.
The use of shared memory requires some modification in the indexing scheme. Now you need to track the local-id of each thread, as the local state buffer is now being cooperatively filled by all the threads within a thread-block. A new variable named tid is introduced, which queries the local-id of each thread.
Each thread will write 16 elements to the
state array with an offset of
16*tid. The same holds for the indexing scheme
of output array. The use of shared memory for state buffers reduces the
register file usage per thread and improves the device occupancy. This
allows more warps to run concurrently in each SM improving the overall
kernel performance.
GPUs are Single Instruction Multiple Data (SIMD) processors. The better your
computation matches the SIMD model, the higher throughput you can achieve
from a GPU. In AES implementation, each thread is dealing with one state
block (16 Bytes) independently. Therefore, all AES transformations use a
for loop to operate on all 16 bytes of the
state, as in the sample codes for various steps. The GPU
implementation can take advantage of loop unrolling as it removes all of
the overhead associated with index calculations at run-time.
You can manually unroll each loop or simply use the
#pragma unroll directive before a loop and the
compiler will do the same for you. Loop unrolling increases the Register
usage per thread but you are still within the allowed size of 32768
registers per block. Performance is better from unrolled loops as the
instruction overhead is completely removed and the kernel better matches
the SIMD processing model.
The basic AES implementation expands the key on Host processor and copies
the expanded key to constant memory on the GPU. Now you will investigate
key scheduling on the GPU itself using high bandwidth shared memory. You
need to expand the key for each thread-block individually as the shared
memory cannot be accessed across multiple thread-blocks. The unexpanded
cipher key is copied from a host to a constant memory buffer on the GPU.
For each thread-block, an array equal in size to the expanded key is
created in shared memory. Expanded key size depends on the key-length
being used and is 240 Bytes for 256 Bit key. The first thread
(tidx=0) of each thread-block copies the
cipher-key into this shared array and calls the key expansion function.
Figure 7 below depicts the path followed by each
thread-block. A barrier is placed in the kernel after the key expansion
function. This is required to make sure that no thread proceeds with the
AES transformations before the key expansion is complete.
Figure 7. On-device key scheduling
Key scheduling on GPU increases the shared memory usage per thread block, by 240 Bytes, making it 8432 Bytes. The device occupancy is now limited by the shared memory to 5 thread-blocks per SM. The Register usage per thread-block is no more the bottleneck as it still allows for 6 thread-blocks per SM. The computation overhead incurred per thread block is not significant as the key expansion is not a compute intensive function.
The overall kernel performance has decreased as a result of lower device occupancy. The performance degradation is not very significant as the lower device occupancy is, to some extent, being compensated by relatively faster access to the expanded key in shared memory.
The inherent parallelism in the AES algorithm shows better performance gains for large data sizes, hence it is well suited for bulk encryption. In the benchmarks, you will validate this through performance results taken on various input sizes. Figure 8 shows a performance comparison of Tesla C2050 against a core i7 CPU @2.8 GHz and indicates a sharp rise of encryption time on Intel for bigger files with a minimal time rise on the graphics processor. The graph has been plotted with input size on the horizontal axis (Mega Bytes) and the kernel execution time (milliseconds) for 256-bit AES on the vertical axis.
Figure 8. AES performance on NVIDIA TESLA C2050
Figure 9 shows the performance comparison of various hardware for a fully optimized AES kernel. The benchmarks for Core i7 CPU are obtained using 8 threads in OpenCLTM. The results indicate low values on the Intel and 25+ times greater values on NVIDIA.
Figure 9. AES performance comparison
We have described a CUDA implementation of AES and demonstrated a step-by-step optimization procedure for this implementation. The results prove the enormous performance improvement through GPUs and shows considerable speedups compared to the current generation Intel processor or commodity graphics cards. These optimization techniques are employed in the gKrypt application and engine to accelerate computation-intensive encryption jobs. The techniques make gKrypt easy to use and the wide platform support makes it safe to deploy with consumer products.
Learn
- Protect your data at the speed of light with gKrypt, Part 1 (Jawad masood, developerWorks, April 2012) gets you started with the gKrypt engine and AES algorithm.
- Dig into AES Encryption Implementation and Analysis on Commodity Graphics Processing Units (Owen Harrison and John Waldron, 2007) and its novel approaches to the implementation of the AES block cipher encryption algorithm on GPUs.
- Read Brute Force Attack on UNIX Passwords with SIMD Computer (Gershon Kedem and Yuriko Ishihara, Proceedings of the 8th USENIX Security Symposium, August 1999) for evaluations of the security of the UNIX password scheme.
- Review CUDA compatible GPU as an efficient hardware accelerator for AES cryptography (Svetlin A. Manavski, IEEE International Conference on Signal Processing and Communications, 2007), a study of the efficiency in applying Modern Graphics Processing Units in symmetric key cryptographic solutions.
- Peruse A performance prediction model for the CUDA GPGPU platform (K.Kishore, Rishabh Mukherjee, Suhail Rehman, P. J. Narayanan, and K. Srinathan; Technical Report IIIT/TR/2009/82, IIIT Hyderabad, 2009) for a performance prediction model for the CUDA GPGPU platform.
- Read the FIPS 197: Advanced Encryption Standard (AES), 2001 from National Institute of Standards and Technology (NIST) which specifies a FIPs-approved cryptographic algorithm that can be used to protect electronic data.
- Review FIPS 46-3: Data Encryption Standard (DES) [National Institute of Standards and Technology (NIST), 1999] which specifies two cryptographic algorithms, the Data Encryption Standard (DES) and the Triple Data Encryption Algorithm (TDEA).
- Learn more about Programming Massively Parallel Processors: A Hands-on Approach (David B. Kirk and Wen-mei W. Hwu, Elsevier, 2010) in these basic concepts of parallel programming and GPU architecture.
- In the developerWorks Linux zone, find hundreds of how-to articles and tutorials, as well as downloads, discussion forums, and a wealth of other resources for Linux developers and administrators.
- The Open Source developerWorks zone provides a wealth of information on open source tools and using open source technologies.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools, as well as IT industry trends.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
- Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.
Get products and technologies
- Get the latest CUDA capable drivers from NVIDIA.
- Download the CUDA Occupancy Calculator, which allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
Discuss
- Check out developerWorks blogs and get involved in the developerWorks community.
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.




