Protect your data at the speed of light with gKrypt, Part 2

Accelerate your application's encryption with the GPU-enabled gKrypt engine

Meet the gKrypt engine, the world's first package to employ general purpose graphics units (GPGPUs) for data encryption. It uses an Advanced Encryption Standard (AES) based 256-bit block cipher. This is the second article in a two-part series on AES encryption and the gKrypt engine. Part 1 introduced gKrypt and explained the AES algorithm in detail, its parallel breakdown and how to map it on a massive GPU architecture using the Compute Unified Device Architecture (CUDA). Part 2 looks at how AES is implemented on CUDA.


Jawad Masood (, Lead Engineer, gKrypt Data Security Solutions

Photo of Jawad MasoodJawad Masood is the lead developer at gKrypt Data Security Solutions, a startup focused on providing cost-effective data security solutions using a mix of multi-core and many-core processors to deliver accelerated bulk data encryption and compression performance.

01 May 2012

Also available in Japanese Spanish


This article will explain the parallel implementation of the AES algorithm discussed in Part 1 of this series using the CUDA framework. The first focus in on implementing all AES transformations and their corresponding inverse transformations in CUDA. Then you look at the bottlenecks in the basic implementation and apply GPU specific optimizations to mitigate their effect. At the end of the article, you review the benchmarks that signify the acceleration achieved using CUDA-enabled GPUs against standard CPUs.

Before you examine how AES is implemented on CUDA, let's begin with an overview of cryptography on GPUs.

Cryptography on GPUs

The use of general purpose graphics hardware to accelerate cryptographic solutions has quite a history. Gershon Kedem and Yuriko Ishihara published the first paper dealing with cryptography on graphics hardware in August 1999 (see Resources). The authors reported successfully cracking a UNIX® password cipher using the PixelFlow graphics engine running at 100 MHz.

Early commodity graphics cards, however, were not well suited for cryptographic applications due to the lack of a flexible programming environment as well as hardware deficiencies like missing integer support. This has changed in recent years with the latest graphics cards from manufacturers like NVIDIA, which offer more flexibility for general purpose computing applications. The inception of CUDA and the open parallel programming standard, OpenCLTM, have made GPUs even more attractive platforms for parallel computing applications. A number of recent papers have dealt specifically with implementing AES on NVIDIA's GPU using CUDA (see Resources). In all of these papers the core encryption algorithm is implemented on the GPU, leaving tasks like key expansion on the CPU, which simplifies the implementation. We have, however, also investigated the key scheduling on the GPU using shared memory available on the latest hardware, without any significant performance overhead.

The CUDA implementation of AES

Now let's discuss the parallel implementation and optimization of AES using the CUDA framework. Begin with a basic implementation of AES transformations and the required indexing scheme followed by a discussion on the performance limiting factors. Then look at the possible optimization strategies to mitigate these bottlenecks and develop an efficient encryption kernel that fully benefits from the parallel computing power of GPUs.

For the AES implementation, consider the simplest case of a single CUDA thread operating on a 128-bit state block. The first step is to allocate memory on the GPU device for input, output, expanded key, and AES tables using the cudaMalloc API call. These buffers are created in the global memory or D-RAM of the GPU. Input data, expanded key, and the AES tables are then copied from the host to the device using the cudaMemcpy API call. Now, you can launch the AES kernel and analyze the steps undertaken by your CUDA thread for the encryption and decryption process.

Copy input to state

At the beginning of the Forward or the Inverse Cipher, the input array is copied to the state array according to the following convention: S [ r , c ] = In [ r + 4c ] for 0 <= r < 4; and 0 <= c Nb, where Nb = 4 for the example case. Figure 1 depicts this column major access pattern. (View a text version of the diagram in Figure 1.)

Figure 1. Copying input array to state array
Diagram showing how the input array of single values translates to the state array with two values

Generalizing this indexing scheme to accommodate more threads requires some mechanism of identifying which thread is being executed. A new variable named idx is introduced, which queries the CUDA runtime for a Global Id of each thread. The net offset for each thread is 16*idx, as each thread handles 16 elements (bytes) of the input array. The same holds true for writing the data to the output array after completion of the encryption or decryption process.

AES transformations

All AES transformations are implemented as device functions to optimize the resource usage and improve the readability of the code. Two blocks of state size are created in the register files, since all of the AES transformations can't be performed in place. We will explain how the transformations are implemented in the order they are performed while encrypting input data.

  • AddRoundKey

    This transformation is applied by combining each byte of the state with a corresponding byte from the expanded key array using a bit-wise XOR operation. The expanded key is available on the GPU in a Global Memory buffer. This transformation is applied during each round using a different portion of the expanded key depending upon the round number. The AddRoundKey transformation is applied according to the equation in Figure 2.

    Figure 2. AddRoundKey transformation is applied according to this equation
    Complex mathematical equation for transformation

    Each thread requires 16 global memory accesses from random locations resulting in un-coalesced memory transfers. Global memory is the least bandwidth memory on the GPU and is not backed by a cache. This introduces long latency per access and effectively blocks the whole warp from executing until all memory requests are complete.

  • SubBytes

    This transformation substitutes each byte of the state array with an element from AES S-Box located at an index equal to the byte value itself. For example, if S0,0 = {64}, then the substitution value from the S-Box would be determined by the intersection of the row with index 6 and the column with index 4. The S-Box is a fixed table that resides in the Global Memory of the GPU. Here, the situation is similar to the AddRoundKey transformation. Global memory accesses are so far the performance bottleneck in your basic implementation.

  • ShiftRows

    In a ShiftRows transformation, each byte of the second, third, and fourth row is shifted left by an offset of one, two, and three bytes, respectively. The shift rows transformation uses both n_state and the state arrays as it is not an in-place transform, as in Figure 3.

    Figure 3. Applying a ShiftRows transformation
    Diagram showing how the bits are shifted during transformation

    The ShiftRows transformation does not include any global memory access. Both the state and the n_state array reside in the registers and therefore can be accessed at the highest available bandwidth without any performance issues.

    The initial round applies only to the AddRoundKey transformation. The middle rounds start with SubBytes transformation and respectively apply ShiftRows, MixColumns and the AddRoundKey for the specific number of rounds calculated from the key-length. The final round performs the SubBytes and ShiftRows transformation followed by AddRoundKey. The encrypted data is then copied back from state to the global output buffer as described next.

Copy state to the output buffer

At the end of the Forward or the Inverse Cipher, the state array is copied back to the output array in global memory as in Figure 4. The convention followed for this pattern can be defined as Out [ r + 4c ] = S [ r , c ] for 0 <= r < 4; and 0 <= c <Nb. (View a text version of the diagram in Figure 4.)

Figure 4. Copying state to output array
Diagram showing how the state array with two values is copied into the output array with single values

Inverse AES transformations

In this section, you will look at the inverse AES transformations and the order in which they are applied for the decryption process. The Inverse Cipher incorporates minor changes in the transformations, the order of execution, and the required AES-Tables. Figure 5 depicts the order in which transformations are applied for the decryption process.

Figure 5. Inverse AES flow graph
Diagram showing how the cipher text passes through the various functions to be turned into plain text

The InvSubBytes transform, which is the inverse of the SubBytes transform, requires an Inverse S-box table instead of the S-box table for byte substitution. The substitution logic remains the same as in the SubBytes transform.

The InvShiftRows transformation shifts each byte of the second, third and fourth row respectively by one, two and three bytes to the right instead as compared to the left shift in the ShiftRows transformation. Figure 6 depicts how the InvShiftRows transformation is applied.

Figure 6. InvShiftRows transformation
Diagram showing the bit shifting pattern for transformation

The other two transformations, MixColumns and AddRoundKey, remain unchanged.

Optimizing AES for the CUDA architecture

In the basic implementation discussed so far, you used the Global memory (D-RAM) available on the GPU and the Register memory available to each thread. Remember, Global memory has the least memory bandwidth compared to other memory spaces available on the GPU like Constant and Shared memory. Your kernel performance is limited by global memory bandwidth at one end. The main disadvantage of low memory bandwidth is long latency access. A quick optimization is to reduce the number of global memory accesses required in each transformation. The second factor that degrades kernel performance is a poor device occupancy level, resulting from excessive register usage per thread. In compute capability 2.0, the hardware maximum allowed registers per Symmetric Multiprocessor (SM) are limited to 32768. Higher register usage reduces the number of in-flight thread-blocks on each SM and results in under-utilization of the computing resources. To achieve maximum device occupancy level, you need to bring the register usage per thread within limits, thus allowing more thread-blocks to be processed per SM. To achieve this, offload the state and n_state arrays to the shared memory.

Reducing global memory access

The first performance degrading factor in this AES implementation is the frequent global memory accesses. The data being accessed from global memory includes the fixed AES tables and the expanded key. For faster, cached access, move this read-only data to the constant memory of GPUs.

Constant memory is a high bandwidth read-only memory available on the GPUs. This is an ideal memory space to store constant data for read-only access as it is visible to all threads from all thread-blocks within a kernel. The host processor is responsible for allocating and initializing the memory objects that reside in constant memory space. The constant memory is cache baked, so successive reads from the constant memory do not suffer a memory latency penalty as do the reads from global memory. Once the data is written to constant memory, all the threads can access this data for read operations with minimal latency. The size of constant cache is limited to 64 KB on current hardware. In AES, you have four constant tables that are only read within the kernel. These are the logt, ilogt, sbox, and the isbox. Each table is 256 Bytes in size, making 1 KB for 4 tables, so you can place all the tables in constant cache. Another read-only data for your kernel is the expanded key, which takes up 240 more bytes of the constant cache with ease.

The kernel code does not require any modifications, as constant memory is allocated and initialized by the host processor. Moving AES tables and the expanded key to constant memory results in a significant performance increase, since it drastically reduces the number of Global memory accesses per thread. The global memory is now accessed only for copying data from the input buffer to state and back to the output buffer after completing all the transformations.

Improving device occupancy and the AES kernel using shared memory

The second step in AES optimization is to improve device occupancy, which is currently limited by an excessive register usage per thread. Each thread maintains two arrays of state size (16 Bytes) in the register memory to efficiently compute the AES transformations without returning back to global memory. To save register memory without incurring significant overheads, offload these arrays to an appropriate memory space that offers comparable performance to registers. Shared memory is here to save the day for you.

Shared memory is a high-bandwidth memory on GPUs used for data-sharing among threads within the same thread-block. NVIDIA GPUs with compute capability 2.x have a maximum 48KB of shared memory per Symmetric Multiprocessor (SM). Shared memory offers very high bandwidth, roughly an order of magnitude higher than the global memory. Another advantage of shared memory is that it does not require coalescing; once the data is loaded into local memory, it can be accessed in any pattern without performance degradation, making it an ideal substitute of register memory.

Place thestate and n_state buffers in the shared memory instead of the registers. Now all threads in a thread-block share state and n_state buffers. Each thread will load 16 elements from the global input buffer to the state buffer in shared memory at a defined offset. Two buffers in shared memory are still required, as all the AES transformations are not in place. This is not a caveat though, as you have 48 KB of local memory per SM at your disposal, and each thread-block at max use 8 KB at largest work-group size of 256, thus allowing six thread-blocks to be processed per SM.

The use of shared memory requires some modification in the indexing scheme. Now you need to track the local-id of each thread, as the local state buffer is now being cooperatively filled by all the threads within a thread-block. A new variable named tid is introduced, which queries the local-id of each thread.

Each thread will write 16 elements to the state array with an offset of 16*tid. The same holds for the indexing scheme of output array. The use of shared memory for state buffers reduces the register file usage per thread and improves the device occupancy. This allows more warps to run concurrently in each SM improving the overall kernel performance.

Loop unrolling

GPUs are Single Instruction Multiple Data (SIMD) processors. The better your computation matches the SIMD model, the higher throughput you can achieve from a GPU. In AES implementation, each thread is dealing with one state block (16 Bytes) independently. Therefore, all AES transformations use a for loop to operate on all 16 bytes of the state, as in the sample codes for various steps. The GPU implementation can take advantage of loop unrolling as it removes all of the overhead associated with index calculations at run-time.

You can manually unroll each loop or simply use the #pragma unroll directive before a loop and the compiler will do the same for you. Loop unrolling increases the Register usage per thread but you are still within the allowed size of 32768 registers per block. Performance is better from unrolled loops as the instruction overhead is completely removed and the kernel better matches the SIMD processing model.

Key scheduling on GPU

The basic AES implementation expands the key on Host processor and copies the expanded key to constant memory on the GPU. Now you will investigate key scheduling on the GPU itself using high bandwidth shared memory. You need to expand the key for each thread-block individually as the shared memory cannot be accessed across multiple thread-blocks. The unexpanded cipher key is copied from a host to a constant memory buffer on the GPU. For each thread-block, an array equal in size to the expanded key is created in shared memory. Expanded key size depends on the key-length being used and is 240 Bytes for 256 Bit key. The first thread (tidx=0) of each thread-block copies the cipher-key into this shared array and calls the key expansion function. Figure 7 below depicts the path followed by each thread-block. A barrier is placed in the kernel after the key expansion function. This is required to make sure that no thread proceeds with the AES transformations before the key expansion is complete.

Figure 7. On-device key scheduling
Flow diagram showing how key expansion is bypassed if there is no tidx

Key scheduling on GPU increases the shared memory usage per thread block, by 240 Bytes, making it 8432 Bytes. The device occupancy is now limited by the shared memory to 5 thread-blocks per SM. The Register usage per thread-block is no more the bottleneck as it still allows for 6 thread-blocks per SM. The computation overhead incurred per thread block is not significant as the key expansion is not a compute intensive function.

The overall kernel performance has decreased as a result of lower device occupancy. The performance degradation is not very significant as the lower device occupancy is, to some extent, being compensated by relatively faster access to the expanded key in shared memory.


The inherent parallelism in the AES algorithm shows better performance gains for large data sizes, hence it is well suited for bulk encryption. In the benchmarks, you will validate this through performance results taken on various input sizes. Figure 8 shows a performance comparison of Tesla C2050 against a core i7 CPU @2.8 GHz and indicates a sharp rise of encryption time on Intel for bigger files with a minimal time rise on the graphics processor. The graph has been plotted with input size on the horizontal axis (Mega Bytes) and the kernel execution time (milliseconds) for 256-bit AES on the vertical axis.

Figure 8. AES performance on NVIDIA TESLA C2050
Line chart of sharp rise of encryption time on Intel with bigger files with minimal time rise on the graphics processor

Figure 9 shows the performance comparison of various hardware for a fully optimized AES kernel. The benchmarks for Core i7 CPU are obtained using 8 threads in OpenCLTM. The results indicate low values on the Intel and 25+ times greater values on NVIDIA.

Figure 9. AES performance comparison
Bar chart of comparitive performance with low values on the Intel and 25+ times greater values on NVIDIA


We have described a CUDA implementation of AES and demonstrated a step-by-step optimization procedure for this implementation. The results prove the enormous performance improvement through GPUs and shows considerable speedups compared to the current generation Intel processor or commodity graphics cards. These optimization techniques are employed in the gKrypt application and engine to accelerate computation-intensive encryption jobs. The techniques make gKrypt easy to use and the wide platform support makes it safe to deploy with consumer products.



Get products and technologies

  • Get the latest CUDA capable drivers from NVIDIA.
  • Download the CUDA Occupancy Calculator, which allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into Linux on developerWorks

Zone=Linux, Open source
ArticleTitle=Protect your data at the speed of light with gKrypt, Part 2