# Implementation of the AES algorithm on Deeply Pipelined DSP/RISC Processor

In the modern digital computer world, cryptographyalgorithms play an important role in securing importantdata. In early days, cryptography was used to protect classifiedinformation in government and military applications. Now, with Internetaccess is a common commodity,government agencies, various industries ranging from Hollywood tocorporate, financial institutions and universities make widespread useof cryptography algorithms in everything from storage management to webbased online transaction processing, etc.

Various cryptography algorithms are used in practice, depending onthe application type and the required level of security. The blockdiagram of a general secured communication system is shown in Figure 1 below.

Figure1: Communication security modules (a) encryption (b) decryption |

For communications security, plan text is encrypted using asecret (or shared) key at the transmitter side and then decrypting thecipheredtext at the receiver side using the same key. In this securedcommunication system, tapping the data by adversaries is almostimpossible unless he/she gets the secret key pattern. In the presentday, an embedded processor is commonly used to implement the physicallayer of such a secured communication system.

AdvancedEncryption Standard (AES) is the latest data securitystandard adopted worldwide in most of the public and private sector forsecure data communications and data storage purposes. The AES algorithmis a symmetric key algorithm, standardized by the NationalInstitute of Science and Technology (NIST).

The AES standard [1] specifies the Rijndael algorithm that can process data blocks of 128bits at a time using the key sizes of 128, 192 or 256 bits length. AESencryption and decryption are based on four different transformationsthat are applied repeatedly in a certain sequence on input data. TheAES standard also specifies a key expansion module to supply keys formultiple iterations of AES algorithm. Depending on input key length,the number of iterations (or complexity) of the AES algorithm willvary.

Since its standardization in 2001, much research has been done toefficiently implement the AES algorithm both in software and hardware.In [2] and [3] , various techniques areproposed to implement AES on embedded processors with limitedresources.

In [4], an efficientimplementation of AES on 32-bit platforms is discussed. In this paper,we discuss a more efficient implementation of AES for use on deeplypipelinedembedded processors, keeping in mind the possibility of processorpipeline stalls andstringent memory requirements.

A review of the current AESalgorithm

The data flow diagram of the AES algorithm is shown in Figure 2 below . The maintransformations in the AES Rijndael's algorithm are:

1) AddRoundKey (AR),

2) SubstiteBytes(SB),

3) ShiftRows(SR) and

4) MixColumns(MC).

All these transformations work on a matrix called 'State' that isformed with byte elements (or 8-bit data units) using the input data of128-bits. AES State is updated in multiple iterations using the abovetransformations. The key expansion (KE) module expands the given keyfor supplying the keys to all iterations of the AES engine.

The number of times the State is iterated in a loop of the AESalgorithm depends on the chosen key length (N_{k} ). Forexample, if wechoose the key length of 128 bits (i.e. N_{k} =4 32-bit words),then weiterate the data (N_{r-1} ) times, where N_{r} = N_{b} + N_{k} + 2 and N_{b} = 4.

In the AES algorithm, the parameter N_{b} (= 4) correspondsto thenumber of rows of State . Inthe AES cipher (or encryption) process, theAR transformation is applied before the start of the loop. Thetransformations present within the AES cipher loop are SB, SR, MC andAR.

Also, the transformations SB, SR and AR are applied after the loop,but before outputting the cipher text. Since the AES algorithm works onbyte elements, we form the AES State of 4×4 bytes size by filling the4x4 matrix column wise using the input data of 128 bits. The keyexpansion module (KE) expands the input N_{k} words to (N_{r+1} )*N_{b} words. (See [1] for more details of AES key expansionprocess .)

In the Add Round key transformation (AR), we add 16 bytes (or 4words) of the expanded key to 16 bytes of AES State whenever AR iscalled. In the Substitute Bytes (SB) transformation, we replace eachbyte of AES State with thesubstitute byte look-up table S-Box elementsusing State bytes as anoffset to the look-up table. Refer to[1] for substitute byte look-uptable values.

In the Shift Rows (SR) transformation, we rotate 4×4 State rows(except the first row) to the left by the row index number of bytes(i.e. rotate left first row by 1 byte, second row by 2 bytes and thirdrow by 3 bytes). In the Mix Columns (MC) transformation, we multiplythe State with the MC matrix in Galois field GF(2^{8} ).

Figure2: Flow diagram of AES Encryption engine |

In the inverse cipher (or decryption) process, we take cipher textas input and perform all inverse operations (i.e. Inverse SubstituteBytes (ISB), Inverse Shift Rows (ISR), Inverse Add Round key (IAR) andInverse Mix Columns (IMC) transformations) of the cipher transformationto get the plain text back.

We use different look-up table values for implementing SB and ISB (see [1] ). The way we add the roundkeys to State is the same except that in AR we get keys indexing fromstart of the buffer whereas in IAR we get keys from the end of expandedkey buffer.

We rotate the State rowelements to the left in SR whereas in ISR werotate the State rows to theright. We use the following matrices A andB to multiply the State elements with MC and IMC transformations.

The matrices A and B contain the same four basic elements in eachcolumn or in each row. With respect to first column or row the othercolumns or rows elements are relatively rotated.

A more efficient AES Implementation

In the original AES proposal [2] ,two approaches are suggested toimplement AES, with and without using pre-computed look-up tables. In[3] , a method to efficientlyimplement AES using pre-computedintermediate look-up tables for MC Galois field multiplications isdiscussed.

The data flow of AES rounds transformations using the approachesdiscussed in [2] and [3] is shown in Figure 3 below . These methods aresuitable for implementing AES on 8-bit non-pipelined platforms. Inthese approaches, we apply the transformations SB, SR, MC and ARsequentially one after the other and repeat the loop N_{r} times.

Figure3: Flow of AES transformations in the main loop |

In Figure 3, transformation inputs and outputs crossing shows thedependency of current transformation inputs on previous transformationoutputs. If we use the above approach to implement AES on 32-bitpipelined processors, then the cycle count of AES on a 32-bit pipelinedprocessor may be larger when compared to 8-bit non pipelinedprocessors. Due to data dependencies, the code generates many stalls inthe instruction pipeline.

When using the look-up table based approach presented in [2] to efficiently implement AES on32-bit pipelined processors, this method uses 1 kB of memory to storeGalois field multiplication values {0x02, 0x03, 0x01, 0x01}–{S-Box[0x00 to 0xff]} in case of cipher and another 1 kB of memory tostore Galois field multiplication values {0x0e, 0x0b, 0x0d,0x09},{Inv-S-Box[0x00 to 0xff]} in case of inverse cipher.

Therefore, we need 2.5 kB of data memory to store S-Box table valuesand to store pre-computed look-up table values for both cipher andinverse cipher.

In our proposed method, we combine several of the AEStransformations into one. This changes the flow, eliminatingintermediate input-output dependency stalls and reducing the overallnumber of operations. The pre-computed look-up tables are generatedbased on the following formula.

T = AR(MC(SB(SR(S))) )

where S: input State , T:output State of AES loop.

Figure 4 below shows theefficient implementation of above equation in the right way.

Figure4: Flow of AES encryption using the look-up table method |

Analytical background of theproposed method

Let M be the Mix Column matrix elements, with S being the input vectorand S' being the output of the Mix Columns transformation.

We pre-compute Li for (Galois field multiplication “. ” of s_{i} with thefirst column of M) as below and store in memory.

To compute the Mix Column transformation for one column of state s'_{i} we load L_{i} for from memory corresponding to s_{i} .Next, we get from L_{i} byrotating L_{i } to right by i bytes, followed by XORing all L_{i} 'sas shownbelow.

Where

Finally, we get the output for one iteration of the AES loop byXORing the Mix Columns output with Round Key T= S '(+) K (here to reduce the numberof XORs for AR, we transpose AES round keywords in the key expansionmodule ). To compute one column of state matrix, we require fourextracts (to get individual Stateelements after SR ), four loads (ofL _{i} for both SB and MC ), three rotations (for MC ) and four XOR's (for bothMC and AR ).

From Figure 4 above , we cansee that the outputs T_{00} to T_{33} do not depend onany intermediateresults, which allow us to interleave accesses and avoid all the stallspresent with the memory (or look-up table) accesses.

In MC, it is more efficient to hold the elements of one in one32-bit register. For this, we transpose the State matrix beforeentering the loop and transpose back the AES State matrix after theloop to work with last three transformations outside the loop.

A sample c-code to simulate one row of butterflies using thisapproach is shown in Figure 5 below. Here, we interleave four butterflies at a time to avoid the pipelinestalls due to memory accesses or due to any ALU operations.

Figure5: Look-up table based efficient AES encryption implementation |

AES implementation on BF5xx

The. Blackfin BF5xx core is a deeplypipelined RISC-DSP processor with 2 MACunits and 2 DAG units. The optimized AES code for this device is shownin Figure 6 below.

On Blackfin, the Shift Row operation is carried out by using theextract instruction and Rotate operation is carried out by using packand align instructions. We interleaved the code to avoid the pipelinestalls result in accessing of look-up table values with arbitraryoffsets.

Figure6: Efficient implementation of AES encryption transformations on BF5xxcore |

Here's an explatation of the assembly code flow, shown in Figure 6above .

Assume at the beginning of code that the AES State 16 bytes arepresent in four 32-bit registers r0, r1, r2, r3.

We extract four bytes or one column of AES State after SRtransformation by using the extract with the location information thatis loaded to r4 register in advance. Since we can perform ALU and DAGoperations in parallel, we load (DAG operation) the next extractlocation information to r4 in parallel with the current byte extract(ALU operation) from State .

Then we use the extracted bytes as the offset to look-up tablesbmc[] whose address is storedin the p4 pointer register. We movelook-up table offsets to DAG registers and this introduces pipelinelatency. To avoid pipeline stalls, we interleave the code and use theDAG register after four cycles to compute absolute address for gettinga 32-bit data L_{i } (which contains SB transformation and MCintermediatevalues).

Then, we rotate L_{i} using instructions align24 (for one byte leftrotation), pack (for two bytes left rotation) and align8 (for threebytes left rotation). Next, we XOR all four outputs to get the MCtransformation output for one column of AES State and also XOR with thekey data in r0 to perform AR. We concurrently load data to the ALUregisters for processing the next column vector.

With the program code shown in Figure6 above , we consume 21 cycles for one column and 84 cycles periteration of the AES loop. We consume a total of 856 (=9*84 + overheadcycles for transformations before and after the loop ) cycles forencryption or decryption (using the equivalent inverse cipher flowgiven in [1] ) of 128-bit datausing 128-bit secret key.

**Hazarathaiah Malepati joined Analog Devices in 2003, and iscurrently working on embedded algorithm software development for theBlackfin family of processors. From 2000 to 2003, he worked as aResearch Engineer in HIRP (HFCL-IISc research program), Bangalore,India. He received his Masters degree in Industrial Electronics fromKREC, Surathkal, in 2000. His research interests include data, signal,image and video processing applications for telecommunications. **

Yosi Stein serves as DSP PrincipalSystem Architect/Advanced Technologies Manager, working in the DigitalMedia Technology Center on the development of broadband communicationand image compression enhanced instruction set for Analog Devices fixedpoint DSP family. Yosi holds a B.S.c in Electrical Engineering fromTechnion – Israel Institute of Technology.

** **

References

[1]. FIPS PUB-197, “Advanced Encryption Standard “ ,Nov. 2001.

[2]. Joan Daemen and VincentRijmen, “AES Proposal:Rijndael“

[3]. B.Gladman, “ASpecification for The AES Algorithm”, Sept 2003

[4]. Guido Bertoni, et al. “Efficient SoftwareImplementation of AES on 32-bit Platforms“, 4th internationalworkshop on CHES, p.159-171, August, 2002.