Protect your data at the speed of light with gKrypt, Part 1

Accelerate your application's encryption with the GPU-enabled gKrypt engine


In today's world, people and institutions store almost every type of information, process it electronically, and transmit it frequently across local networks or the Internet. As a result, the risk of unauthorized access to this information has increased dramatically and the challenges for information security grow. The unauthorized access to intercepted transmissions can compromise sensitive personal and financial information of individuals and organizations, resulting in a serious security breach. For businesses, a breach usually entails huge financial penalties, expensive law suits, and a loss of reputation and business. For individuals, a breach can lead to identity theft and damage to financial history or a credit rating. Recovering from such information breaches can take years and the costs are huge. Encryption is an effective solution to protect the sensitive and vital information of individuals and organizations against information breaches, thus becoming an essential element of data security systems.

Bulk encryption technology provides a method to encrypt large amounts of data during transmission or storage. But in recent years, the growth in the amount of sensitive information that must be encrypted simultaneously has outpaced the growth in computing power of CPUs. Currently, the processing power requirements for bulk encryption are being met by specialized hardware solutions, in the form of cryptographic accelerators (see Related topics). Most encryption algorithms like Data Encryption Standard (DES) and Advanced Encryption Standard (AES) have an inherent high level of data parallelism and computational density, making them suitable for parallel implementations on GPUs, and effectively replacing the dedicated hardware cryptographic solutions (see Related topics).

gKrypt engine

The gKrypt engine is an off-the-shelf crypto module that provides a simple API to use the Rijndael Advanced Encryption Standard with GPUs (see Related topics). It can take advantage of the underlying parallel hardware in the form of graphics cards and computing processors, offloading the compute-intensive encryption workload. The simple C/C++ interface of gKrypt with bindings for Java™ and C# interfaces gives a great advantage to developers who wish to deploy a GPU solution with minimalistic knowledge of the parallel languages such as CUDA and OpenCL.

The framework supports both AMD- and NVIDIA-based GPUs, as well as multi-core CPUs, making it easier to adapt to the benefits of parallel processors. The next section explains the underlying approach with the parallel breakdown of AES.

Advanced Encryption Standard (AES) algorithm

The Advanced Encryption Standard (AES) is the National Institute of Standards and Technology (NIST)-approved encryption standard, based on the Rijndael cipher (see Related topics). Rijndael is a symmetric-key block-cipher proposed by Belgian cryptographers, Joan Daemen and Vincent Rijmen, for a NIST-initiated AES selection process (see Related topics). Rijndael was approved by NIST as the Advanced Encryption Standard (AES) in November, 2001, and it effectively replaced the aging Data Encryption Standard (DES). The AES standard comprises three block ciphers with key sizes of 128, 192, and 256 bits, and each with the same 128-bit data blocks, hence the names, AES-128, AES-192 and AES-256, respectively.

The AES block-cipher operates on a two-dimensional array of bytes called the state, denoted by s. The size of the state array is 16 bytes (128 bits), arranged as four rows and four columns. This is represented by Nb = 4, which is essentially the number of columns in a state array. The supported key sizes for AES are 128, 192, and 256 bits, represented respectively by Nk = 4, 6, or 8. AES applies byte-oriented transformations (commonly referred to as AES Transformations) to encrypt or decrypt the state array. These transformations are applied for a specific number of rounds Nr, which depends upon the key size. For 128, 192 and 256 bit keys, Nr = 10, 12 and 14 respectively.

Figure 1 below shows various phases of the AES algorithm.

Figure 1. AES algorithm phases
Diagram of AES algorithm phases
Diagram of AES algorithm phases

Before starting with AES transformations, the user-provided cipher-key is expanded using the Rijndael Key-Schedule. The Key Expansion step, shown in Figure 1, implements the Key-Schedule and provides the expanded key for all rounds of the subsequent stages. The expanded key size is calculated as Nb(Nr + 1), which is the number of 32-bit words (columns). A different portion of the expanded key is used in the AddRoundKey step of each round.

The first step of the AES encryption process is the AddRoundKey transformation, where each byte of the state is combined with the round-key using bit-wise Exclusive OR (xor) operations as defined by the equation in Figure 2.

Figure 2. First step of AES encryption process
Formula showing the first step of AES encryption process
Formula showing the first step of AES encryption process

This is depicted in Figure 3 below. State, New State and the Round Key are all represented as 4x4 arrays as bytes (128-bits) as Nb = 4. The byte data is in the form of hexadecimal words.

Figure 3. AddRoundKey transformation
Diagram of the AddRoundKey transformation
Diagram of the AddRoundKey transformation

In the next phase of AES, all four transformations are applied repeatedly Nr - 1 times, starting with Nr = 1 until Nr - 1, in the order shown in Figure 3. I will explain each of these transformations in the same order they are applied.

SubBytes transformation is a non-linear byte-substitution step where each byte of the state is replaced according to a lookup table called S-Box. Every byte of the state is replaced by an entry from S-Box which has the same index as the hexadecimal value of the byte to be replaced. Figure 4 illustrates this process better.

Figure 4. SubBytes transformation
Diagram of the non-linear byte-substitution step
Diagram of the non-linear byte-substitution step

Here, the first byte of the state has a hexadecimal value. The substitution value is determined by the intersection of the row with index 1 and the column with index 9. Similarly, the byte {be} will be substituted by the byte from S-Box with a row index b and column index e.

In ShiftRows transformation, each row of the state array shifts cyclically through a different offset according to the conventions of the equation in Figure 5.

Figure 5. Each row of the state array shifts cyclically through a different offset
Diagram showing each row of the state array as it shifts cyclically through a different offset
Diagram showing each row of the state array as it shifts cyclically through a different offset

The shift value shift(r, Nb) depends on the row number r, as in the equation in Figure 6.

Figure 6. The shift value depends on the row number
diagram of equation showing the shift value depends on the row number

Solving the above equations for each row reveals that the first row is not shifted, whereas each byte of the second, third and fourth row is shifted to the left by an offset of one, two and three bytes, respectively. Figure 7 below, explains how each row is shifted along with the resulting new state array.

Figure 7. ShiftRows transformation
diagram of the ShiftRows transformation
diagram of the ShiftRows transformation

Think of the shift pattern as moving bytes to lower positions in the row, and wrapping the lowest bytes back into the top of the row.

The MixColumns transformation operates on the columns of the state, treating each column as a four-term polynomial over GF(28). Each column is then multiplied modulo x4 + 1 by the coefficient polynomial equation in Figure 8.

Figure 8. The coefficient polynomial
Diagram of the coefficient polynomial

This is equivalent to replacing each column of state with the matrix multiplication of that column with the coefficient matrix, as in Figure 9.

Figure 9. MixColumns transformation
Diagram of the MixColumns transformation
Diagram of the MixColumns transformation

Parallelizing AES algorithm for CUDA architecture

Now you have some understanding of the AES encryption algorithm. You are ready to start an implementation using the CUDA framework. In the following sections, I briefly discuss the CUDA framework followed by a detailed discussion of the design approach and implementation mechanism. This is prep work so you can run an example in Part 2 of this series.

The CUDA architecture

CUDA, introduced in 2006, is a parallel computing architecture developed by NVIDIA that accelerates computing performance by leveraging the power of the GPUs in NVIDIA (see Related topics). More than a hundred million installed GPUs in the market support CUDA, so the applications that run on CUDA already have a strong potential user base. Since its release, developers have achieved dramatic acceleration in numerous applications, especially related to Image Processing and HD video playback (see Figure 10).

Figure 10. CUDA thread model
Diagram of the CUDA thread model with two serial code CPUs and two GPU parallel kernels
Diagram of the CUDA thread model with two serial code CPUs and two GPU parallel kernels

The concept of CUDA programming model is to generate thousands of threads that perform in a Single Program Multiple Data (SPMD) manner, each on a small chunk of data in parallel. As GPUs beat CPUs in the number of cores (although a GPU core is much weaker than that of CPU), they are ideal for executing thousands of threads in parallel, where each thread computes on a small chunk of data. These threads are organized into blocks and grids, where a block is a collection of threads and a grid is a collection of blocks. The CUDA architecture, with all of its features and memory hierarchy, needs much explanation, but that is outside the scope of this article.

Design strategy

As mentioned, AES is a block cipher algorithm that operates on 128-bit data blocks (state) independently. Since the AES cipher performs the same operations on every state block without any dependencies in between (ECB mode), it provides a high level of data parallelism. I implemented a fork-and-join strategy to exploit this data level parallelism in an AES block cipher. In this approach, multiple state blocks can be encrypted simultaneously, one state block per thread, resulting in a higher throughput. As the data size increases so does the level of parallelism, thus making GPUs more efficient for bulk encryption.

Figure 11 shows the Host-Device work division and describes the nature of various AES constituents. As you can see, the only serial operation in AES is the key-scheduling that provides the round keys for subsequent AES rounds. All AES transformations are highly parallel, since they are applied to each 128-bit state block independently. You can therefore divide the work among the CPU Host processor and the GPU Device, to get maximum throughput from the implementation. Thus, the key-scheduling is performed one time on the CPU device and passed on to the GPU Device as a kernel argument. The core AES algorithm, that is, the AES transformations, are implemented using CUDA to fully utilize the parallel computing resources of the GPU.

Figure 11. Host-device work division
Diagram of the host-device work division

In the implementation covered in Part 2 of this series, each CUDA thread takes one state block (16 Bytes) as input and converts it to cipher-text by applying the byte-oriented AES transformations for a specified number of rounds. The encryption of one 128-bit state thus remains serial; however, you use GPU specific optimization techniques like loop unrolling to optimize the code. There is a net offset of 16-Bytes (one state block) for each CUDA thread while accessing the input data and writing back encrypted data. Figure 12 shows the work-division among threads and the thread assignment model that this implementation will follow.

Figure 12. CUDA thread assignment
Diagram of the CUDA thread assignment
Diagram of the CUDA thread assignment

Thread execution on the CUDA device

Theoretically, you might think the amount of time required to encrypt 100 MBs of data would be equal to the amount of time required to encrypt 128 bits, since all threads operate simultaneously, but this is not the case. You already know that threads are bundled in blocks that execute on the CUDA device. In the current generation of CUDA architecture, the execution resources (cores) are divided in Streaming Multiprocessors (SMs), and each SM can execute 112 blocks at a time. (See Figure 13. If the device has 14 SMs, as in Tesla C2050 with 32 cores each, then only 16 blocks will execute at a time provided there's no other limiting factor. If the number of blocks is more than 16 they will have to wait for their turn. This is the most simplistic picture of the execution of threads on the device. Many other factors include the number of warps per SM, the usage of registers per thread, the shared memory per block, and so on. For more information on how CUDA assigns thread blocks to SMs, see the CUDA occupancy calculator (see Related topics).

Figure 13. Thread blocks assignment to streaming multiprocessors (SMs)
Diagram of thread blocks assignment to streaming multiprocessors (SMs)
Diagram of thread blocks assignment to streaming multiprocessors (SMs)

Once a thread block is assigned to a streaming multiprocessor, it's further broken down into warps, each of which is an assembly of 32 threads in the current generation of ever-evolving CUDS architecture. These warps might not execute in parallel as well, but they do operate in a highly efficient fashion. For example, if one warp is waiting for a memory transfer, the next one performs the data execution during that time on the same core. If more than one warp is ready for execution, the driver sets a priority mechanism between them. Although the number of cores might be less than the number of threads launched in a kernel, the CUDA architecture and driver makes sure that all of the hardware resources are being used to give maximum performance.

Configuring CUDA on Linux

The CUDA framework requires NVIDIA-based Graphics Processing Units (GPUs) to take advantage of the underlying parallel hardware and generate performance. The CUDA Runtime comes with a NVIDIA developer video driver. To use your CUDA capable graphics card, the first step is to reconfigure the basic video driver. Open the terminal and execute the following command: $ sudo apt-get --purge remove nvidia*.

The next step is to blacklist the default nouveau driver that conflicts with the CUDA developer driver. Make a new file in /etc/modprobe.d and insert the following from Listing 1.

Listing 1. Blacklisting the default nouveau driver that conflicts with the CUDA developer driver
# /etc/modprobe.d/blacklist-nouveau.conf 
blacklist nvidiafb 
blacklist nouveau 
blacklist rivafb 
blacklist rivatv 
blacklist vga16fb 
options nouveau modeset=0

To update the kernel file, run this command: $ sudo update-initramfs -u

Then reboot the system and let the modification take effect. Once the system restarts, you are ready to install the CUDA development driver.

Installing the CUDA driver

You can download the official release of the latest CUDA capable driver directly from NVIDIA (see Related topics). You must turn off the graphics display manager before the installation can proceed. To switch to console mode, press CTRL+ALT+F2. Now start the automated installation of the driver file from the console: $ sudo sh

Once the driver installation is complete, simply reboot your system and verify that the X Server is running from the Administration menu showing currently installed GPUs.

You are now ready to run the optimization techniques in Part 2 of this series.


The gKrypt engine comes bundled with platform binaries as well as code snippets guiding the integration with C/C++ based projects. It is evident that the right usage of system resources is vital. The idea is to make use of optimized plug-ins that will instantly boost performance without sacrificing compatibility or ease of use. In Part 2, get the details on optimization techniques that highlight some integrations demonstrating a running example.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Linux, Open source
ArticleTitle=Protect your data at the speed of light with gKrypt, Part 1