A more efficient implementation of the Advanced Encryption Standard algorithm on a deeply pipelined RISC/DSP engine reduces overall pipeline stalls during its execution.
AES implementation on BF5xx
The. Blackfin BF5xx core is a deeply
pipelined RISC-DSP processor with 2 MAC
units and 2 DAG units. The optimized AES code for this device is shown
in Figure 6 below.
On Blackfin, the Shift Row operation is carried out by using the
extract instruction and Rotate operation is carried out by using pack
and align instructions. We interleaved the code to avoid the pipeline
stalls result in accessing of look-up table values with arbitrary
offsets.
 |
| Figure
6: Efficient implementation of AES encryption transformations on BF5xx
core |
Here's an explatation of the assembly code flow, shown in Figure 6
above.
Assume at the beginning of code that the AES State 16 bytes are
present in four 32-bit registers r0, r1, r2, r3.
We extract four bytes or one column of AES State after SR
transformation by using the extract with the location information that
is loaded to r4 register in advance. Since we can perform ALU and DAG
operations in parallel, we load (DAG operation) the next extract
location information to r4 in parallel with the current byte extract
(ALU operation) from State.
Then we use the extracted bytes as the offset to look-up table
sbmc[] whose address is stored
in the p4 pointer register. We move
look-up table offsets to DAG registers and this introduces pipeline
latency. To avoid pipeline stalls, we interleave the code and use the
DAG register after four cycles to compute absolute address for getting
a 32-bit data Li (which contains SB transformation and MC
intermediate
values).
Then, we rotate Li using instructions align24 (for one byte left
rotation), pack (for two bytes left rotation) and align8 (for three
bytes left rotation). Next, we XOR all four outputs to get the MC
transformation output for one column of AES State and also XOR with the
key data in r0 to perform AR. We concurrently load data to the ALU
registers for processing the next column vector.
With the program code shown in Figure
6 above, we consume 21 cycles for one column and 84 cycles per
iteration of the AES loop. We consume a total of 856 (=9*84 + overhead
cycles for transformations before and after the loop) cycles for
encryption or decryption (using the equivalent inverse cipher flow
given in [1]) of 128-bit data
using 128-bit secret key.
Hazarathaiah Malepati joined Analog Devices in 2003, and is
currently working on embedded algorithm software development for the
Blackfin family of processors. From 2000 to 2003, he worked as a
Research Engineer in HIRP (HFCL-IISc research program), Bangalore,
India. He received his Masters degree in Industrial Electronics from
KREC, Surathkal, in 2000. His research interests include data, signal,
image and video processing applications for telecommunications.
Yosi Stein serves as DSP Principal
System Architect/Advanced Technologies Manager, working in the Digital
Media Technology Center on the development of broadband communication
and image compression enhanced instruction set for Analog Devices fixed
point DSP family. Yosi holds a B.S.c in Electrical Engineering from
Technion - Israel Institute of Technology.
References
[1]. FIPS PUB-197, "Advanced Encryption Standard",
Nov. 2001.
[2]. Joan Daemen and Vincent
Rijmen, "AES Proposal:
Rijndael"
[3]. B.Gladman, "A
Specification for The AES Algorithm", Sept 2003
[4]. Guido Bertoni, et al. "Efficient Software
Implementation of AES on 32-bit Platforms", 4th international
workshop on CHES, p.159-171, August, 2002.