Implementation of the AES algorithm on Deeply Pipelined DSP/RISC Processor
A more efficient implementation of the Advanced Encryption Standard algorithm on a deeply pipelined RISC/DSP engine reduces overall pipeline stalls during its execution.
By Yosi Stein and Hazarathaiah Malepati
Embedded.com
(08/19/08, 02:18:00 PM EDT)
In the modern digital computer world, cryptography algorithms play an important role in securing important data. In early days, cryptography was used to protect classified information in government and military applications. Now, with Internet access is a common commodity, government agencies, various industries ranging from Hollywood to corporate, financial institutions and universities make widespread use of cryptography algorithms in everything from storage management to web based online transaction processing, etc.

Various cryptography algorithms are used in practice, depending on the application type and the required level of security. The block diagram of a general secured communication system is shown in Figure 1 below.

Figure 1: Communication security modules (a) encryption (b) decryption

For communications security, plan text is encrypted using a secret (or shared) key at the transmitter side and then decrypting the ciphered text at the receiver side using the same key. In this secured communication system, tapping the data by adversaries is almost impossible unless he/she gets the secret key pattern. In the present day, an embedded processor is commonly used to implement the physical layer of such a secured communication system.

Advanced Encryption Standard (AES) is the latest data security standard adopted worldwide in most of the public and private sector for secure data communications and data storage purposes. The AES algorithm is a symmetric key algorithm, standardized by the National Institute of Science and Technology (NIST).

The AES standard [1] specifies the Rijndael algorithm that can process data blocks of 128 bits at a time using the key sizes of 128, 192 or 256 bits length. AES encryption and decryption are based on four different transformations that are applied repeatedly in a certain sequence on input data. The AES standard also specifies a key expansion module to supply keys for multiple iterations of AES algorithm. Depending on input key length, the number of iterations (or complexity) of the AES algorithm will vary.

Since its standardization in 2001, much research has been done to efficiently implement the AES algorithm both in software and hardware. In [2] and [3], various techniques are proposed to implement AES on embedded processors with limited resources.

In [4], an efficient implementation of AES on 32-bit platforms is discussed. In this paper, we discuss a more efficient implementation of AES for use on deeply pipelined embedded processors, keeping in mind the possibility of processor pipeline stalls and stringent memory requirements.

A review of the current AES algorithm
The data flow diagram of the AES algorithm is shown in Figure 2 below. The main transformations in the AES Rijndael's algorithm are:

1) AddRoundKey (AR),
2) SubstiteBytes (SB),
3) ShiftRows (SR) and
4) MixColumns (MC).

All these transformations work on a matrix called 'State' that is formed with byte elements (or 8-bit data units) using the input data of 128-bits. AES State is updated in multiple iterations using the above transformations. The key expansion (KE) module expands the given key for supplying the keys to all iterations of the AES engine.

The number of times the State is iterated in a loop of the AES algorithm depends on the chosen key length (Nk). For example, if we choose the key length of 128 bits (i.e. Nk=4 32-bit words), then we iterate the data (Nr-1) times, where Nr = Nb + Nk + 2 and Nb= 4.

In the AES algorithm, the parameter Nb(= 4) corresponds to the number of rows of State. In the AES cipher (or encryption) process, the AR transformation is applied before the start of the loop. The transformations present within the AES cipher loop are SB, SR, MC and AR.

Also, the transformations SB, SR and AR are applied after the loop, but before outputting the cipher text. Since the AES algorithm works on byte elements, we form the AES State of 4x4 bytes size by filling the 4x4 matrix column wise using the input data of 128 bits. The key expansion module (KE) expands the input Nk words to (Nr+1)*Nb words. (See [1] for more details of AES key expansion process.)

In the Add Round key transformation (AR), we add 16 bytes (or 4 words) of the expanded key to 16 bytes of AES State whenever AR is called. In the Substitute Bytes (SB) transformation, we replace each byte of AES State with the substitute byte look-up table S-Box elements using State bytes as an offset to the look-up table. Refer to [1] for substitute byte look-up table values.

In the Shift Rows (SR) transformation, we rotate 4x4 State rows (except the first row) to the left by the row index number of bytes (i.e. rotate left first row by 1 byte, second row by 2 bytes and third row by 3 bytes). In the Mix Columns (MC) transformation, we multiply the State with the MC matrix in Galois field GF(28).

Figure 2: Flow diagram of AES Encryption engine

In the inverse cipher (or decryption) process, we take cipher text as input and perform all inverse operations (i.e. Inverse Substitute Bytes (ISB), Inverse Shift Rows (ISR), Inverse Add Round key (IAR) and Inverse Mix Columns (IMC) transformations) of the cipher transformation to get the plain text back.

We use different look-up table values for implementing SB and ISB (see [1]). The way we add the round keys to State is the same except that in AR we get keys indexing from start of the buffer whereas in IAR we get keys from the end of expanded key buffer.

We rotate the State row elements to the left in SR whereas in ISR we rotate the State rows to the right. We use the following matrices A and B to multiply the State elements with MC and IMC transformations.

The matrices A and B contain the same four basic elements in each column or in each row. With respect to first column or row the other columns or rows elements are relatively rotated.

A more efficient AES Implementation
In the original AES proposal [2], two approaches are suggested to implement AES, with and without using pre-computed look-up tables. In [3], a method to efficiently implement AES using pre-computed intermediate look-up tables for MC Galois field multiplications is discussed.

The data flow of AES rounds transformations using the approaches discussed in [2] and [3] is shown in Figure 3 below. These methods are suitable for implementing AES on 8-bit non-pipelined platforms. In these approaches, we apply the transformations SB, SR, MC and AR sequentially one after the other and repeat the loop Nr times.

Figure 3: Flow of AES transformations in the main loop

In Figure 3, transformation inputs and outputs crossing shows the dependency of current transformation inputs on previous transformation outputs. If we use the above approach to implement AES on 32-bit pipelined processors, then the cycle count of AES on a 32-bit pipelined processor may be larger when compared to 8-bit non pipelined processors. Due to data dependencies, the code generates many stalls in the instruction pipeline.

When using the look-up table based approach presented in [2] to efficiently implement AES on 32-bit pipelined processors, this method uses 1 kB of memory to store Galois field multiplication values {0x02, 0x03, 0x01, 0x01}-- {S-Box[0x00 to 0xff]} in case of cipher and another 1 kB of memory to store Galois field multiplication values {0x0e, 0x0b, 0x0d, 0x09},{Inv-S-Box[0x00 to 0xff]} in case of inverse cipher.

Therefore, we need 2.5 kB of data memory to store S-Box table values and to store pre-computed look-up table values for both cipher and inverse cipher.

In our proposed method, we combine several of the AES transformations into one. This changes the flow, eliminating intermediate input-output dependency stalls and reducing the overall number of operations. The pre-computed look-up tables are generated based on the following formula.

T = AR(MC(SB(SR(S))))

where S: input State, T: output State of AES loop.

Figure 4 below shows the efficient implementation of above equation in the right way.

Figure 4: Flow of AES encryption using the look-up table method

Analytical background of the proposed method
Let M be the Mix Column matrix elements, with S being the input vector and S' being the output of the Mix Columns transformation.

We pre-compute Li for (Galois field multiplication "." of si with the first column of M) as below and store in memory.

To compute the Mix Column transformation for one column of state s'i we load Li for from memory corresponding to si. Next, we get from Li by rotating Li to right by i bytes, followed by XORing all Li's as shown below.

Where

Finally, we get the output for one iteration of the AES loop by XORing the Mix Columns output with Round Key T= S '(+) K (here to reduce the number of XORs for AR, we transpose AES round keywords in the key expansion module). To compute one column of state matrix, we require four extracts (to get individual State elements after SR), four loads (of Li for both SB and MC), three rotations (for MC) and four XOR's (for both MC and AR).

From Figure 4 above, we can see that the outputs T00 to T33 do not depend on any intermediate results, which allow us to interleave accesses and avoid all the stalls present with the memory (or look-up table) accesses.

In MC, it is more efficient to hold the elements of one in one 32-bit register. For this, we transpose the State matrix before entering the loop and transpose back the AES State matrix after the loop to work with last three transformations outside the loop.

A sample c-code to simulate one row of butterflies using this approach is shown in Figure 5 below. Here, we interleave four butterflies at a time to avoid the pipeline stalls due to memory accesses or due to any ALU operations.

Figure 5: Look-up table based efficient AES encryption implementation

AES implementation on BF5xx
The. Blackfin BF5xx core is a deeply pipelined RISC-DSP processor with 2 MAC units and 2 DAG units. The optimized AES code for this device is shown in Figure 6 below.

On Blackfin, the Shift Row operation is carried out by using the extract instruction and Rotate operation is carried out by using pack and align instructions. We interleaved the code to avoid the pipeline stalls result in accessing of look-up table values with arbitrary offsets.

Figure 6: Efficient implementation of AES encryption transformations on BF5xx core

Here's an explatation of the assembly code flow, shown in Figure 6 above.

Assume at the beginning of code that the AES State 16 bytes are present in four 32-bit registers r0, r1, r2, r3.

We extract four bytes or one column of AES State after SR transformation by using the extract with the location information that is loaded to r4 register in advance. Since we can perform ALU and DAG operations in parallel, we load (DAG operation) the next extract location information to r4 in parallel with the current byte extract (ALU operation) from State.

Then we use the extracted bytes as the offset to look-up table sbmc[] whose address is stored in the p4 pointer register. We move look-up table offsets to DAG registers and this introduces pipeline latency. To avoid pipeline stalls, we interleave the code and use the DAG register after four cycles to compute absolute address for getting a 32-bit data Li (which contains SB transformation and MC intermediate values).

Then, we rotate Li using instructions align24 (for one byte left rotation), pack (for two bytes left rotation) and align8 (for three bytes left rotation). Next, we XOR all four outputs to get the MC transformation output for one column of AES State and also XOR with the key data in r0 to perform AR. We concurrently load data to the ALU registers for processing the next column vector.

With the program code shown in Figure 6 above, we consume 21 cycles for one column and 84 cycles per iteration of the AES loop. We consume a total of 856 (=9*84 + overhead cycles for transformations before and after the loop) cycles for encryption or decryption (using the equivalent inverse cipher flow given in [1]) of 128-bit data using 128-bit secret key.

Hazarathaiah Malepati joined Analog Devices in 2003, and is currently working on embedded algorithm software development for the Blackfin family of processors. From 2000 to 2003, he worked as a Research Engineer in HIRP (HFCL-IISc research program), Bangalore, India. He received his Masters degree in Industrial Electronics from KREC, Surathkal, in 2000. His research interests include data, signal, image and video processing applications for telecommunications.

Yosi Stein serves as DSP Principal System Architect/Advanced Technologies Manager, working in the Digital Media Technology Center on the development of broadband communication and image compression enhanced instruction set for Analog Devices fixed point DSP family. Yosi holds a B.S.c in Electrical Engineering from Technion - Israel Institute of Technology.

References
[1]. FIPS PUB-197, "Advanced Encryption Standard", Nov. 2001.
[2]. Joan Daemen and Vincent Rijmen, "AES Proposal: Rijndael"
[3]. B.Gladman, "A Specification for The AES Algorithm", Sept 2003
[4]. Guido Bertoni, et al. "Efficient Software Implementation of AES on 32-bit Platforms", 4th international workshop on CHES, p.159-171, August, 2002.