Contents


POWER8 in-core cryptography

An introduction to using AES instructions

Comments

POWER8 is a family of super-scalar symmetric multiprocessors based on the POWER architecture. The POWER8 series introduced enhancements in its cryptographic capabilities, which implement in-core enhancements by using the Advanced Encryption Standard (AES) symmetric key cryptography standard.

The POWER8 AES instruction set provides five vector instructions to process AES block cipher encryption/decryption. POWER8 also provides instructions for multiplication in Galois Field, used to implement the Galois Counter Mode (GCM) and GHASH algorithms [1].

This article introduces these cryptographic instructions and shows simple examples to demonstrate how you can use them to implement AES or AES Modes in your application or driver.

What is AES?

The Advanced Encryption Standard is also known as Rijndael. It was established as a standard for the encryption of electronic data by the U.S National Institute of Standards and Technology (NIST) in 2001. It's a symmetric key algorithm that processes data blocks of 16 bytes/128 bits. In other words, it is a block cipher algorithm. The 128-bit block fits in a VMX/VSX 128-bit register. Keys for this algorithm can be 128, 192 or 256 bits long. The POWER8 architecture lets you implement the AES algorithm with five instructions to run critical steps in the AES algorithm in-core, especially the expansion key and AES encryption/decryption rounds parts of the algorithm.

Vector Multimedia eXtension (VMX)

One of the POWER8 enhancements is the implementation of an integrated multi-pipeline vector SIMD-type instruction, which supports 32, 128-bit VMX vector registers. Vector data can be represented in different ways, as shown in the following table.

qword
dword dword
word word word word
hword hword wword hword hword hword hword hword
0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0a 0x0b 0x0c 0x0d 0x0e 0x0f

* hword = 2 bytes, word = 4 bytes, dword = 8 bytes, qword = 16 bytes.

For the purpose of using AES, consider using the full 16-bytes vector, which can handle the largest AES key or the state/cipher text during the encryption/decryption steps.

AES Algorithm

The AES algorithm can be split into the follow steps:

  • KeyExpansion/Generate Round keys
    • RotWord
    • SubBytes
    • Rcon Xor
  • InitialRound
    • AddKeyRound
  • Rounds
    • SubBytes
    • ShiftRows
    • MixColumns
    • AddRoundKey
  • Final Round (no Mixcolumns)
    • SubBytes
    • ShiftRows
    • AddRoundKey

Key Expansion/Generate Round Keys shows an overview of the algorithm. Each of the steps are described in the following sections.

Figure 1. AES fluxogram

Key Expansion/Generate Round Keys

The Key Expansion/Generate Round Keys step starts with a given key and expands it to multiple keys. 128-bit keys are expanded to 11 keys. 192-bit keys are expanded to 13 keys. 256-bit keys are expanded to 15 keys.

The first expanded key is generated from the last word of the original key and is processed in three steps that produce a word, which is then used to generate all 4 words of an expanded key. The next round uses the last word from the key generated in the previous round. This process repeats until all keys are generated. Keep in mind that regardless of the size of the initial key, AES always uses 16-byte keys internally.

RotWord Step

The Rotate Word (RotWord) step processes a word and rotates its bytes as follows:

Bytes 0 1 2 3 -> 1 2 3 0

Example:

Given a word:

79 d2 85 46

The RotWord step would result in:

d2 85 46 79

SubBytes Step

The SubBytes step uses a Substitution box (S-Box) that replaces bytes in a word by the word's multiplicative inverse in Galois Field GF(28) = GF(2)[x]/(x8+X4+x3+x+1) [2]. For decryption, it uses an Inverse S-Box.

Figure 2. S-Box

For example, using the S-Box, byte 0x9a is replaced by 0xb8. This step is done internally using the vcipher and vcipherlast instructions. Inverse S-Box operations are done internally with the vncipher and vncipherlast instructions.

For example, given the word:

af 7f 67 98

The SubBytes(word)operation would yield

79 d2 85 46

Rcon Xor Step

The Rcon or Round Counter step is the exponential of 2 to a user-specified value [3]. In AES, this value is the round number.

The number of AES rounds needed depends on the size of the key. For 128-bit keys, AES requires up to rcon(10); for 192-bit keys up to rcon(8); and for 256-bit keys up to rcon(7). Thus, for all AES possibilities, we need to have rcon 1 to 10 saved:

Rcon(1-10) = 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1b, 0x36

To generate the first expand key, add with rcon, we proceed with key-word Xor rcon(1) or key-word xor 01 00 00 00.

Here's a full example of how the expand key works

	Key: 0xac 0x2b 0x3c 0xdd 0xee 0x04 0x11 0x44 0xa1 0x4b 0x5c 0xd1 0x6a 0xb9 0x1c 0xdd

	Splitting key into words for the first key in key expand buffer:
	W0 = ac 2b 3c dd
	W1 = ee 04 11 44
	W2 = a1 4b 5c d1
	W3 = 6a b9 1c dd

	Key expand algorithm always uses the last word to execute steps, in our case W3:

	X1 <- RotWord(w3)
	X1 = b9 1c dd 6a
	Y1 <- SubBytes(X1)
	Y1 = 56 9c c1 02
	Rcon(1) = 01 00 00 00
	Z1 <- Y1 xor 01 00 00 00
	Z1 = 57 9c c1 02
	--------------------
	Second key in key expand buffer:
	W4 =  W0 xor Z1  = fb b7 fd df
	W5 = (W4 xor W1) = 15 b3 ec 9b
	W6 = (W5 xor W2) = b4 f8 b0 4a
	W7 = (W6 xor W3) = de 41 ac 97
	______________________________
	X2  <- RodWord(W7)
	X2 = 41 ac 97 de
	Y2 <- SubBytes(X2)
	Y2 = 83 91 88 1d
	Rcon(2) = 02 00 00 00
	Z2 <- Y2 xor 02 00 00 00
	Z2 = 81 91 88 1d
	------------------------------
	Third key in key expand buffer:
	W8  =   W4 xor z2  = 7a 26 75 c2
	W9  =  (W8 xor W5) = 6f 95 99 59
	W10 =  (W9 xor W6) = db 6d 29 13
	W11 = (W10 xor W7) = 52 c8 58 40
			(...)

Now let's look at creating the expand-keys by using the POWER8 vector and AES instructions.

First, we need to look at the AES instructions. As previously described, P8 comes with five AES instructions: vcipher, vcipherlast, vncipher, vncipherlast, and vsbox. Let's focus on the first two:

	vcipher VRT,VRA,VRB
		State ← VR[VRA]
		RoundKey ← VR[VRB]
		vtemp1 ← SubBytes(State)
		vtemp2 ← ShiftRows(vtemp1)
		vtemp3 ← MixColumns(vtemp2)
		VR[VRT] ← vtemp3 ^ RoundKey

State is the cipher text in the current round or even the plain text in the first step of AES. Both values are 16 bytes or an AES block size. RoundKey, as the name suggests, is the Key in the current round. Because VMX vectors are 16 bytes, they can handle the full size round keys and cipher text.

	vcipherlast VRT,VRA,VRB
		State ←VR[VRA]
		RoundKey ← VR[VRB]
		vtemp1 ← SubBytes(State)
		vtemp2 ← ShiftRows(vtemp1)
		VR[VRT] ← vtemp2 ^ RoundKey

vcipherlast is the same as vcipher, except it has one step less, MixColumns.

vncipher and vncipherlast, are exactly the same as vcipher and vcipherlast, except they use inverse steps and are intended for decryption.

Power8 does not have a specific instruction for Key Expand. But the vcipherlast, with some additional steps, can be used to achieve the Key Expand operations.

The following steps show an example of how to use vcipherlast to perform an Expand Key operation:

In this example, the rcon pointer is already loaded into a vector—you may want to look in the vmx-crypto driver for more information [5]. Also note that in Power PC Assembly code registers are referenced by numbers, not by names. For example, vperm 3,1,1,5 is taking vr3 as the register result and using vr1 and vr5 as parameters. See [4] for more details.

/**
* vr1 is the first key =  0xac 0x2b 0x3c 0xdd 0xee 0x04 0x11 0x44 0xa1 0x4b 0x5c 0xd1 0x6a 0xb9 0x1c 0xdd
* vr5 is a mask to rotate a word in applied for all four words in our key.
* vr5 = 0x0d0e0f0c 0d0e0f0c 0d0e0f0c 0d0e0f0c
* vr3 is the key in use destination
* vr4 is the first rcon loaded: 01 00 00 00 01 00 00 00 01 00 00 00   
* 01 00 00 00
**/
Loop128:
1	vperm 3,1,1,5   
2	vsldoi 6,0,1,12
3	vcipherlast 3,3,4	
4	vxor 1,1,6
5	vsldoi 6,0,6,12
6	vxor 1,1,6
7	vsldoi 6,0,6,12
8	vxor 1,1,6
9	vadduwm 4,4,4
10	vxor 1,1,3
11	bdnz Loop128

Line 1 applies a mask against the key. In this case, after the vperm instruction vr3 will be:

		0xb91cdd6a  0xb91cdd6a 0xb91cdd6a 0xb91cdd6a

Line 2 results in:

		0x0000000ac2b3cddee041144a14b5cd1

Line 3 calls vcipherlast to execute SubBytes, ShiftRows, and an xor with Rcon(n). By the definition of the ShiftRows function in Power ISA 2.07B [4], ShiftRows has no effect when applied in this vector. In this particular scenario, it performs only SubBytes and xor Rcon(n). In other words, it generates the first word Z or Z1. Thus, after vcipherlast we have:

Z1: 0x579cc102 579cc102 579cc102 579cc102

Lines 4 to 8 perform the math behind the key words generation in the key expansion algorithm.

     W4 = (W0 xor Z1)
     W5 = (W1 xor W4)
     W6 = (W2 xor W5)
     W7 = (W3 xor W6)

This can be rewritten as:

	W4' = W0
	W5' = W1 xor W0
	W6' = W2 xor W1 or W0
	W7' = W3 xor W2 Xor W1 xor W0

Where ' means a temporary word before Z1 is applied.

The detailed operations of lines 4 to 8 are:

             W0         W1        W2         W3
vr1	0xac2b3cdd 0xee041144 0xa14b5cd1 0x6ab91cdd
	     W0         W1         W2        w3
vr6	0x00000000 0xac2b3cdd 0xee041144 0xa14b5cd1
	-------------------------------------------
          W4'       W5'      temp-W6'  temp-W7'
vr1	0xac2b3cdd 0x422f2d99 0x4f4f4d95 0xcbf2400c
                               W0          W1
vr6	0x00000000 0x00000000 0xac2b3cdd 0xee041144
	-------------------------------------------
	    W4'        W5'        W6'     temp-W7'
vr1	0xac2b3cdd 0x422f2d99 0xe3647148 0x25f65148
                                          W0
vr6	0x00000000 0x00000000 0x00000000 0xac2b3cdd
	-------------------------------------------
		W4'       W5'        W6'        W7'
vr1	0xac2b3cdd 0x422f2d99 0xe3647148 0x89dd6d95

Line 9 adds the rcon for the next round:

vr4	02 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00

Finally, Line 10 applies Z1 words to generate the first key that is expanded.

	      W4'       W5'        W6'        W7'
vr1	0xac2b3cdd 0x422f2d99 0xe3647148 0x89dd6d95
	      Z1        Z1         Z1         Z1
vr3	0x579cc102 0x579cc102 0x579cc102 0x579cc102
	-------------------------------------------
	      W4        W5	    W6		 W7
vr1	0xfbb7fddf 0x15b3ec9b 0xb4f8b04a 0xde41ac97

Line 11 jumps back to the beginning of the loop and repeats all previous steps according to the number of needed round numbers.

In comparison with Key Expand, AES rounds are simple because they require only the expanded keys and the data to be encrypted or decrypted.

Following is a simple example that shows how to use the in-core instructions. For a more accurate code example, see Appendix A.

/**
* vr0 is our state or the vector register where our
* plaintext/point address resides.
* vr1 is the key0 provided by the user or first key 
* vr2 is the second generated by expand key
* vr3 is the third and so on till vr11
**/
1	vxor 0,0,1
2	vcipher 0,0,2
3	vcipher 0,0,3
4	vcipher 0,0,4
5	vcipher 0,0,5
     ...
11	vcipherlast 0,0,11

Line 1 key0 is added to the initial state of AES.

Line 2 is the first round of AES with key1.

Line 3 is the second round of AES, and so on.

Line 11 is the last round of AES with the last key10.

Kernel Driver that uses POWER8 in-core instructions

vmx-crypto is the Kernel driver that supports AES in-core for POWER8. Initially, the driver supports AES in CBC and CTR modes. It also supports the GHASH algorithm. It is available in Kernel 4.1 and later. It's both little- and big-endian capable.

To verify if your kernel is using vmx-crypto, you can run: lsmod | grep vmx. If your machine is not using it already, you can modprobe vmx-crypto and then verify again with lsmod or even cat /proc/crypto | less and look for the p8 prefix. The algorithms/modes supported by the driver are:

  • name : ghash

    driver : p8_ghash

    module : vmx_crypto

  • name : aes

    driver : p8_aes

    module : vmx_crypto

  • name : cbc(aes)

    driver : p8_aes_cbc

    module : vmx_crypto

  • name : ctr(aes)

    driver : p8_aes_ctr

    module : vmx_crypto

POWER8 in-core instructions in user space

Many projects use OpenSSL as their crypto provider. Starting with Version 1.0.2 of OpenSSL, the code implements the SSL cryptography by using in-core P8 instructions. If enabled on the running system, this version of OpenSSL (and later) provides better performance by using the VMX POWER8 assembly codes and hardware optimization. Because so many applications use Open SSL for their cryptography, this enhanced OpenSSL enables a wide variety of applications to take advantage of the POWER8 AES instructions.

Conclusion

In-core instructions on POWER8 systems give you the ability to implement a cryptography stack that uses the power of POWER8 hardware directly, which helps your code to perform well on cryptography benchmark tests.

References

[1] Brian Hall; Ryan Arnold; Peter Bergner; Wainer dos Santos Moschetta; Robert Enenkel; Pat Haugen; Michael R. Meissner; Alex Mericas; Philipp Oehler; Berni Shiefer; Brian F. Veale; Suresh Warrier; Daniel Zabawa; Adhemerval Zanella. Performance Optimization and Tuning Techniques for IBM Processors, including IBM POWER8 An IBM Redbooks publication, 2014.

[2] G. David Forney, Principles of Digital Communication II - Spring 2005. Introduction to Finite Fields, 2005.

[3] Federal Information Processing Standards Publication – Announcing the Advanced Encryption Standard – AES, 2001.

[4] IBM. Power ISA 2.07B. Vector Facilities, 2015. p.217.

[5] Cerri, M. VMX-crypto driver. Available in: http://lxr.free-electrons.com/source/drivers/crypto/vmx/.

Appendix A

/**
* vmx-crypto AES encrypt/OpenSSL aes encryption
* At kernel: /drivers/crypto/vmx/aesp8-ppc.S
* At OpenSSL: /crypto/aes/aesp8-ppc.s
**/
.aes_p8_encrypt:
	lwz	6,240(5)
	lis	0,0xfc00
	mfspr	12,256
	li	7,15
	mtspr	256,0

	lvx	0,0,3
	neg	11,4
	lvx	1,7,3
	lvsl	2,0,3

	lvsl	3,0,11

	li	7,16
	vperm	0,0,1,2
	lvx	1,0,5
	lvsl	5,0,5
	srwi	6,6,1
	lvx	2,7,5
	addi	7,7,16
	subi	6,6,1
	vperm	1,1,2,5

	vxor	0,0,1
	lvx	1,7,5
	addi	7,7,16
	mtctr	6

.Loop_enc:
	vperm	2,2,1,5
	vcipher 0,0,2
	lvx	2,7,5
	addi	7,7,16
	vperm	1,1,2,5
	vcipher 0,0,1
	lvx	1,7,5
	addi	7,7,16
	bdnz	.Loop_enc

	vperm	2,2,1,5
	vcipher 0,0,2
	lvx	2,7,5
	vperm	1,1,2,5
	vcipherlast 0,0,1

	vspltisb	2,-1
	vxor	1,1,1
	li	7,15
	vperm	2,1,2,3

	lvx	1,0,4
	vperm	0,0,0,3
	vsel	1,1,0,2
	lvx	4,7,4
	stvx	1,0,4
	vsel	0,0,4,2
	stvx	0,7,4

	mtspr	256,12
	blr	

/**
* vmx-crypto AES decrypt
* At kernel: /drivers/crypto/vmx/aesp8-ppc.S
* At OpenSSL: /crypto/aes/aesp8-ppc.s
**/
.aes_p8_decrypt:
	lwz	6,240(5)
	lis	0,0xfc00
	mfspr	12,256
	li	7,15
	mtspr	256,0

	lvx	0,0,3
	neg	11,4
	lvx	1,7,3
	lvsl	2,0,3

	lvsl	3,0,11

	li	7,16
	vperm	0,0,1,2
	lvx	1,0,5
	lvsl	5,0,5
	srwi	6,6,1
	lvx	2,7,5
	addi	7,7,16
	subi	6,6,1
	vperm	1,1,2,5

	vxor	0,0,1
	lvx	1,7,5
	addi	7,7,16
	mtctr	6

.Loop_dec:
	vperm	2,2,1,5
	vncipher 0,0,2
	lvx	2,7,5
	addi	7,7,16
	vperm	1,1,2,5
	vncipher 0,0,1
	lvx	1,7,5
	addi	7,7,16
	bdnz	.Loop_dec

	vperm	2,2,1,5
	vncipher 0,0,2
	lvx	2,7,5
	vperm	1,1,2,5
	vncipherlast 0,0,1

	vspltisb	2,-1
	vxor	1,1,1
	li	7,15
	vperm	2,1,2,3

	lvx	1,0,4
	vperm	0,0,0,3
	vsel	1,1,0,2
	lvx	4,7,4
	stvx	1,0,4
	vsel	0,0,4,2
	stvx	0,7,4

	mtspr	256,12
	blr

Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Security
ArticleID=1015278
ArticleTitle=POWER8 in-core cryptography
publish-date=09212015