# POWER8 in-core cryptography

An introduction to using AES instructions

POWER8 is a family of super-scalar symmetric multiprocessors based on the POWER architecture. The POWER8 series introduced enhancements in its cryptographic capabilities, which implement in-core enhancements by using the Advanced Encryption Standard (AES) symmetric key cryptography standard.

The POWER8 AES instruction set provides five vector instructions to process AES block cipher encryption/decryption. POWER8 also provides instructions for multiplication in Galois Field, used to implement the Galois Counter Mode (GCM) and GHASH algorithms [1].

This article introduces these cryptographic instructions and shows simple examples to demonstrate how you can use them to implement AES or AES Modes in your application or driver.

## What is AES?

The Advanced Encryption Standard is also known as Rijndael. It was established as a standard for the encryption of electronic data by the U.S National Institute of Standards and Technology (NIST) in 2001. It's a symmetric key algorithm that processes data blocks of 16 bytes/128 bits. In other words, it is a block cipher algorithm. The 128-bit block fits in a VMX/VSX 128-bit register. Keys for this algorithm can be 128, 192 or 256 bits long. The POWER8 architecture lets you implement the AES algorithm with five instructions to run critical steps in the AES algorithm in-core, especially the expansion key and AES encryption/decryption rounds parts of the algorithm.

## Vector Multimedia eXtension (VMX)

One of the POWER8 enhancements is the implementation of an integrated multi-pipeline vector SIMD-type instruction, which supports 32, 128-bit VMX vector registers. Vector data can be represented in different ways, as shown in the following table.

qword | |||||||||||||||

dword | dword | ||||||||||||||

word | word | word | word | ||||||||||||

hword | hword | wword | hword | hword | hword | hword | hword | ||||||||

0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 | 0x08 | 0x09 | 0x0a | 0x0b | 0x0c | 0x0d | 0x0e | 0x0f |

* hword = 2 bytes, word = 4 bytes, dword = 8 bytes, qword = 16 bytes.

For the purpose of using AES, consider using the full 16-bytes vector, which can handle the largest AES key or the state/cipher text during the encryption/decryption steps.

## AES Algorithm

The AES algorithm can be split into the follow steps:

- KeyExpansion/Generate Round keys
- RotWord
- SubBytes
- Rcon Xor

- InitialRound
- AddKeyRound

- Rounds
- SubBytes
- ShiftRows
- MixColumns
- AddRoundKey

- Final Round (no Mixcolumns)
- SubBytes
- ShiftRows
- AddRoundKey

Key Expansion/Generate Round Keys shows an overview of the algorithm. Each of the steps are described in the following sections.

##### Figure 1. AES fluxogram

### Key Expansion/Generate Round Keys

The Key Expansion/Generate Round Keys step starts with a given key and expands it to multiple keys. 128-bit keys are expanded to 11 keys. 192-bit keys are expanded to 13 keys. 256-bit keys are expanded to 15 keys.

The first expanded key is generated from the last word of the original key and is processed in three steps that produce a word, which is then used to generate all 4 words of an expanded key. The next round uses the last word from the key generated in the previous round. This process repeats until all keys are generated. Keep in mind that regardless of the size of the initial key, AES always uses 16-byte keys internally.

#### RotWord Step

The Rotate Word (RotWord) step processes a word and rotates its bytes as follows:

Bytes 0 1 2 3 -> 1 2 3 0

Example:

Given a word:

`79 d2 85 46`

The RotWord step would result in:

`d2 85 46 79`

#### SubBytes Step

The SubBytes step uses a Substitution box (S-Box) that replaces bytes in a word by
the word's multiplicative inverse in Galois Field GF(28) =
GF(2)[x]/(x^{8}+X^{4}+x^{3}+x+1) [2]. For decryption, it uses an Inverse S-Box.

##### Figure 2. S-Box

For example, using the S-Box, byte `0x9a`

is replaced by `0xb8`

.
This step is done internally using the vcipher and vcipherlast
instructions. Inverse S-Box operations are done internally with the vncipher and
vncipherlast instructions.

For example, given the word:

`af 7f 67 98`

The SubBytes(word)operation would yield

`79 d2 85 46`

#### Rcon Xor Step

The Rcon or Round Counter step is the exponential of 2 to a user-specified value [3]. In AES, this value is the round number.

The number of AES rounds needed depends on the size of the key. For 128-bit keys, AES requires up to rcon(10); for 192-bit keys up to rcon(8); and for 256-bit keys up to rcon(7). Thus, for all AES possibilities, we need to have rcon 1 to 10 saved:

Rcon(1-10) = ```
0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b,
0x36
```

To generate the first expand key, add with rcon, we proceed with key-word Xor
rcon(1) or key-word xor `01 00 00 00`

.

Here's a full example of how the expand key works

Key: 0xac 0x2b 0x3c 0xdd 0xee 0x04 0x11 0x44 0xa1 0x4b 0x5c 0xd1 0x6a 0xb9 0x1c 0xdd Splitting key into words for the first key in key expand buffer: W0 = ac 2b 3c dd W1 = ee 04 11 44 W2 = a1 4b 5c d1 W3 = 6a b9 1c dd Key expand algorithm always uses the last word to execute steps, in our case W3: X1 <- RotWord(w3) X1 = b9 1c dd 6a Y1 <- SubBytes(X1) Y1 = 56 9c c1 02 Rcon(1) = 01 00 00 00 Z1 <- Y1 xor 01 00 00 00 Z1 = 57 9c c1 02 -------------------- Second key in key expand buffer: W4 = W0 xor Z1 = fb b7 fd df W5 = (W4 xor W1) = 15 b3 ec 9b W6 = (W5 xor W2) = b4 f8 b0 4a W7 = (W6 xor W3) = de 41 ac 97 ______________________________ X2 <- RodWord(W7) X2 = 41 ac 97 de Y2 <- SubBytes(X2) Y2 = 83 91 88 1d Rcon(2) = 02 00 00 00 Z2 <- Y2 xor 02 00 00 00 Z2 = 81 91 88 1d ------------------------------ Third key in key expand buffer: W8 = W4 xor z2 = 7a 26 75 c2 W9 = (W8 xor W5) = 6f 95 99 59 W10 = (W9 xor W6) = db 6d 29 13 W11 = (W10 xor W7) = 52 c8 58 40 (...)

Now let's look at creating the expand-keys by using the POWER8 vector and AES instructions.

First, we need to look at the AES instructions. As previously described, P8 comes with five
AES instructions: `vcipher`

, `vcipherlast`

, `vncipher`

, `vncipherlast`

, and
`vsbox`

. Let's focus on the first two:

vcipher VRT,VRA,VRB State ← VR[VRA] RoundKey ← VR[VRB] vtemp1 ← SubBytes(State) vtemp2 ← ShiftRows(vtemp1) vtemp3 ← MixColumns(vtemp2) VR[VRT] ← vtemp3 ^ RoundKey

`State `

is the cipher text in the current round or even the plain text
in the first step of AES. Both values are 16 bytes or an AES block size.
`RoundKey`

, as the name suggests, is the Key in the current round.
Because VMX vectors are 16 bytes, they can handle the full size round keys and cipher
text.

vcipherlast VRT,VRA,VRB State ←VR[VRA] RoundKey ← VR[VRB] vtemp1 ← SubBytes(State) vtemp2 ← ShiftRows(vtemp1) VR[VRT] ← vtemp2 ^ RoundKey

`vcipherlast`

is the same as `vcipher`

, except it has
one step less,` MixColumns`

.

`vncipher`

and `vncipherlast`

, are exactly the same as
`vcipher`

and `vcipherlast`

, except they use
inverse steps and are intended for decryption.

Power8 does not have a specific instruction for Key Expand. But the
`vcipherlast`

, with some additional steps, can be used to achieve
the Key Expand operations.

The following steps show an example of how to use `vcipherlast`

to
perform an Expand Key operation:

In this example, the rcon pointer is already loaded into a vector—you may want to look in the vmx-crypto driver for more information [5]. Also note that in Power PC Assembly code registers are referenced by numbers, not by names. For example, vperm 3,1,1,5 is taking vr3 as the register result and using vr1 and vr5 as parameters. See [4] for more details.

/** * vr1 is the first key = 0xac 0x2b 0x3c 0xdd 0xee 0x04 0x11 0x44 0xa1 0x4b 0x5c 0xd1 0x6a 0xb9 0x1c 0xdd * vr5 is a mask to rotate a word in applied for all four words in our key. * vr5 = 0x0d0e0f0c 0d0e0f0c 0d0e0f0c 0d0e0f0c * vr3 is the key in use destination * vr4 is the first rcon loaded: 01 00 00 00 01 00 00 00 01 00 00 00 * 01 00 00 00 **/ Loop128: 1 vperm 3,1,1,5 2 vsldoi 6,0,1,12 3 vcipherlast 3,3,4 4 vxor 1,1,6 5 vsldoi 6,0,6,12 6 vxor 1,1,6 7 vsldoi 6,0,6,12 8 vxor 1,1,6 9 vadduwm 4,4,4 10 vxor 1,1,3 11 bdnz Loop128

Line 1 applies a mask against the key. In this case, after the vperm instruction vr3 will be:

0xb91cdd6a 0xb91cdd6a 0xb91cdd6a 0xb91cdd6a

Line 2 results in:

0x0000000ac2b3cddee041144a14b5cd1

Line 3 calls `vcipherlast`

to execute SubBytes, ShiftRows, and an xor with Rcon(n). By
the definition of the ShiftRows function in Power ISA 2.07B [4],
ShiftRows has no effect when applied in this vector. In this particular scenario, it
performs only SubBytes and xor Rcon(n). In other words, it generates the first
word Z or Z1. Thus, after `vcipherlast`

we have:

Z1: 0x579cc102 579cc102 579cc102 579cc102

Lines 4 to 8 perform the math behind the key words generation in the key expansion algorithm.

W4 = (W0 xor Z1) W5 = (W1 xor W4) W6 = (W2 xor W5) W7 = (W3 xor W6)

This can be rewritten as:

W4' = W0 W5' = W1 xor W0 W6' = W2 xor W1 or W0 W7' = W3 xor W2 Xor W1 xor W0

Where `'`

means a temporary word before Z1 is applied.

The detailed operations of lines 4 to 8 are:

W0 W1 W2 W3 vr1 0xac2b3cdd 0xee041144 0xa14b5cd1 0x6ab91cdd W0 W1 W2 w3 vr6 0x00000000 0xac2b3cdd 0xee041144 0xa14b5cd1 ------------------------------------------- W4' W5' temp-W6' temp-W7' vr1 0xac2b3cdd 0x422f2d99 0x4f4f4d95 0xcbf2400c W0 W1 vr6 0x00000000 0x00000000 0xac2b3cdd 0xee041144 ------------------------------------------- W4' W5' W6' temp-W7' vr1 0xac2b3cdd 0x422f2d99 0xe3647148 0x25f65148 W0 vr6 0x00000000 0x00000000 0x00000000 0xac2b3cdd ------------------------------------------- W4' W5' W6' W7' vr1 0xac2b3cdd 0x422f2d99 0xe3647148 0x89dd6d95

Line 9 adds the rcon for the next round:

vr4 02 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00

Finally, Line 10 applies Z1 words to generate the first key that is expanded.

W4' W5' W6' W7' vr1 0xac2b3cdd 0x422f2d99 0xe3647148 0x89dd6d95 Z1 Z1 Z1 Z1 vr3 0x579cc102 0x579cc102 0x579cc102 0x579cc102 ------------------------------------------- W4 W5 W6 W7 vr1 0xfbb7fddf 0x15b3ec9b 0xb4f8b04a 0xde41ac97

Line 11 jumps back to the beginning of the loop and repeats all previous steps according to the number of needed round numbers.

In comparison with Key Expand, AES rounds are simple because they require only the expanded keys and the data to be encrypted or decrypted.

Following is a simple example that shows how to use the in-core instructions. For a more accurate code example, see Appendix A.

/** * vr0 is our state or the vector register where our * plaintext/point address resides. * vr1 is the key0 provided by the user or first key * vr2 is the second generated by expand key * vr3 is the third and so on till vr11 **/ 1 vxor 0,0,1 2 vcipher 0,0,2 3 vcipher 0,0,3 4 vcipher 0,0,4 5 vcipher 0,0,5 ... 11 vcipherlast 0,0,11

Line 1 key0 is added to the initial state of AES.

Line 2 is the first round of AES with key1.

Line 3 is the second round of AES, and so on.

Line 11 is the last round of AES with the last key10.

## Kernel Driver that uses POWER8 in-core instructions

vmx-crypto is the Kernel driver that supports AES in-core for POWER8. Initially, the driver supports AES in CBC and CTR modes. It also supports the GHASH algorithm. It is available in Kernel 4.1 and later. It's both little- and big-endian capable.

To verify if your kernel is using vmx-crypto, you can run: ```
lsmod | grep
vmx
```

. If your machine is not using it already, you can ```
modprobe
vmx-crypto
```

and then verify again with `lsmod`

or even
`cat /proc/crypto | less`

and look for the p8 prefix. The
algorithms/modes supported by the driver are:

name : ghash

driver : p8_ghash

module : vmx_crypto

name : aes

driver : p8_aes

module : vmx_crypto

name : cbc(aes)

driver : p8_aes_cbc

module : vmx_crypto

name : ctr(aes)

driver : p8_aes_ctr

module : vmx_crypto

## POWER8 in-core instructions in user space

Many projects use OpenSSL as their crypto provider. Starting with Version 1.0.2 of OpenSSL, the code implements the SSL cryptography by using in-core P8 instructions. If enabled on the running system, this version of OpenSSL (and later) provides better performance by using the VMX POWER8 assembly codes and hardware optimization. Because so many applications use Open SSL for their cryptography, this enhanced OpenSSL enables a wide variety of applications to take advantage of the POWER8 AES instructions.

## Conclusion

In-core instructions on POWER8 systems give you the ability to implement a cryptography stack that uses the power of POWER8 hardware directly, which helps your code to perform well on cryptography benchmark tests.

## References

[1] Brian Hall; Ryan Arnold; Peter Bergner; Wainer dos Santos Moschetta; Robert Enenkel; Pat Haugen; Michael R. Meissner; Alex Mericas; Philipp Oehler; Berni Shiefer; Brian F. Veale; Suresh Warrier; Daniel Zabawa; Adhemerval Zanella. Performance Optimization and Tuning Techniques for IBM Processors, including IBM POWER8 An IBM Redbooks publication, 2014.

[2] G. David Forney, Principles of Digital Communication II - Spring 2005. Introduction to Finite Fields, 2005.

[3] Federal Information Processing Standards Publication – Announcing the Advanced Encryption Standard – AES, 2001.

[4] IBM. Power ISA 2.07B. Vector Facilities, 2015. p.217.

[5] Cerri, M. VMX-crypto driver. Available in: http://lxr.free-electrons.com/source/drivers/crypto/vmx/.

## Appendix A

/** * vmx-crypto AES encrypt/OpenSSL aes encryption * At kernel: /drivers/crypto/vmx/aesp8-ppc.S * At OpenSSL: /crypto/aes/aesp8-ppc.s **/ .aes_p8_encrypt: lwz 6,240(5) lis 0,0xfc00 mfspr 12,256 li 7,15 mtspr 256,0 lvx 0,0,3 neg 11,4 lvx 1,7,3 lvsl 2,0,3 lvsl 3,0,11 li 7,16 vperm 0,0,1,2 lvx 1,0,5 lvsl 5,0,5 srwi 6,6,1 lvx 2,7,5 addi 7,7,16 subi 6,6,1 vperm 1,1,2,5 vxor 0,0,1 lvx 1,7,5 addi 7,7,16 mtctr 6 .Loop_enc: vperm 2,2,1,5 vcipher 0,0,2 lvx 2,7,5 addi 7,7,16 vperm 1,1,2,5 vcipher 0,0,1 lvx 1,7,5 addi 7,7,16 bdnz .Loop_enc vperm 2,2,1,5 vcipher 0,0,2 lvx 2,7,5 vperm 1,1,2,5 vcipherlast 0,0,1 vspltisb 2,-1 vxor 1,1,1 li 7,15 vperm 2,1,2,3 lvx 1,0,4 vperm 0,0,0,3 vsel 1,1,0,2 lvx 4,7,4 stvx 1,0,4 vsel 0,0,4,2 stvx 0,7,4 mtspr 256,12 blr /** * vmx-crypto AES decrypt * At kernel: /drivers/crypto/vmx/aesp8-ppc.S * At OpenSSL: /crypto/aes/aesp8-ppc.s **/ .aes_p8_decrypt: lwz 6,240(5) lis 0,0xfc00 mfspr 12,256 li 7,15 mtspr 256,0 lvx 0,0,3 neg 11,4 lvx 1,7,3 lvsl 2,0,3 lvsl 3,0,11 li 7,16 vperm 0,0,1,2 lvx 1,0,5 lvsl 5,0,5 srwi 6,6,1 lvx 2,7,5 addi 7,7,16 subi 6,6,1 vperm 1,1,2,5 vxor 0,0,1 lvx 1,7,5 addi 7,7,16 mtctr 6 .Loop_dec: vperm 2,2,1,5 vncipher 0,0,2 lvx 2,7,5 addi 7,7,16 vperm 1,1,2,5 vncipher 0,0,1 lvx 1,7,5 addi 7,7,16 bdnz .Loop_dec vperm 2,2,1,5 vncipher 0,0,2 lvx 2,7,5 vperm 1,1,2,5 vncipherlast 0,0,1 vspltisb 2,-1 vxor 1,1,1 li 7,15 vperm 2,1,2,3 lvx 1,0,4 vperm 0,0,0,3 vsel 1,1,0,2 lvx 4,7,4 stvx 1,0,4 vsel 0,0,4,2 stvx 0,7,4 mtspr 256,12 blr