CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Implementation of the AES algorithm on Deeply Pipelined DSP/RISC Processor
A more efficient implementation of the Advanced Encryption Standard algorithm on a deeply pipelined RISC/DSP engine reduces overall pipeline stalls during its execution.



Embedded.com

AES implementation on BF5xx
The. Blackfin BF5xx core is a deeply pipelined RISC-DSP processor with 2 MAC units and 2 DAG units. The optimized AES code for this device is shown in Figure 6 below.

On Blackfin, the Shift Row operation is carried out by using the extract instruction and Rotate operation is carried out by using pack and align instructions. We interleaved the code to avoid the pipeline stalls result in accessing of look-up table values with arbitrary offsets.

Figure 6: Efficient implementation of AES encryption transformations on BF5xx core

Here's an explatation of the assembly code flow, shown in Figure 6 above.

Assume at the beginning of code that the AES State 16 bytes are present in four 32-bit registers r0, r1, r2, r3.

We extract four bytes or one column of AES State after SR transformation by using the extract with the location information that is loaded to r4 register in advance. Since we can perform ALU and DAG operations in parallel, we load (DAG operation) the next extract location information to r4 in parallel with the current byte extract (ALU operation) from State.

Then we use the extracted bytes as the offset to look-up table sbmc[] whose address is stored in the p4 pointer register. We move look-up table offsets to DAG registers and this introduces pipeline latency. To avoid pipeline stalls, we interleave the code and use the DAG register after four cycles to compute absolute address for getting a 32-bit data Li (which contains SB transformation and MC intermediate values).

Then, we rotate Li using instructions align24 (for one byte left rotation), pack (for two bytes left rotation) and align8 (for three bytes left rotation). Next, we XOR all four outputs to get the MC transformation output for one column of AES State and also XOR with the key data in r0 to perform AR. We concurrently load data to the ALU registers for processing the next column vector.

With the program code shown in Figure 6 above, we consume 21 cycles for one column and 84 cycles per iteration of the AES loop. We consume a total of 856 (=9*84 + overhead cycles for transformations before and after the loop) cycles for encryption or decryption (using the equivalent inverse cipher flow given in [1]) of 128-bit data using 128-bit secret key.

Hazarathaiah Malepati joined Analog Devices in 2003, and is currently working on embedded algorithm software development for the Blackfin family of processors. From 2000 to 2003, he worked as a Research Engineer in HIRP (HFCL-IISc research program), Bangalore, India. He received his Masters degree in Industrial Electronics from KREC, Surathkal, in 2000. His research interests include data, signal, image and video processing applications for telecommunications.

Yosi Stein serves as DSP Principal System Architect/Advanced Technologies Manager, working in the Digital Media Technology Center on the development of broadband communication and image compression enhanced instruction set for Analog Devices fixed point DSP family. Yosi holds a B.S.c in Electrical Engineering from Technion - Israel Institute of Technology.

References
[1]. FIPS PUB-197, "Advanced Encryption Standard", Nov. 2001.
[2]. Joan Daemen and Vincent Rijmen, "AES Proposal: Rijndael"
[3]. B.Gladman, "A Specification for The AES Algorithm", Sept 2003
[4]. Guido Bertoni, et al. "Efficient Software Implementation of AES on 32-bit Platforms", 4th international workshop on CHES, p.159-171, August, 2002.

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :