A more efficient implementation of the Advanced Encryption Standard algorithm on a deeply pipelined RISC/DSP engine reduces overall pipeline stalls during its execution.
In the modern digital computer world,
cryptography
algorithms play an important role in securing important
data. In early days, cryptography was used to protect classified
information in government and military applications. Now, with Internet
access is a common commodity,
government agencies, various industries ranging from Hollywood to
corporate, financial institutions and universities make widespread use
of cryptography algorithms in everything from storage management to web
based online transaction processing, etc.
Various cryptography algorithms are used in practice, depending on
the application type and the required level of security. The block
diagram of a general secured communication system is shown in Figure 1 below.
 |
| Figure
1: Communication security modules (a) encryption (b) decryption |
For communications security, plan text is encrypted using a
secret (or shared) key at the transmitter side and then decrypting the
ciphered
text at the receiver side using the same key. In this secured
communication system, tapping the data by adversaries is almost
impossible unless he/she gets the secret key pattern. In the present
day, an embedded processor is commonly used to implement the physical
layer of such a secured communication system.
Advanced
Encryption Standard (AES) is the latest data security
standard adopted worldwide in most of the public and private sector for
secure data communications and data storage purposes. The AES algorithm
is a symmetric key algorithm, standardized by the National
Institute of Science and Technology (NIST).
The AES standard [1]
specifies the Rijndael algorithm that can process data blocks of 128
bits at a time using the key sizes of 128, 192 or 256 bits length. AES
encryption and decryption are based on four different transformations
that are applied repeatedly in a certain sequence on input data. The
AES standard also specifies a key expansion module to supply keys for
multiple iterations of AES algorithm. Depending on input key length,
the number of iterations (or complexity) of the AES algorithm will
vary.
Since its standardization in 2001, much research has been done to
efficiently implement the AES algorithm both in software and hardware.
In [2] and [3], various techniques are
proposed to implement AES on embedded processors with limited
resources.
In [4], an efficient
implementation of AES on 32-bit platforms is discussed. In this paper,
we discuss a more efficient implementation of AES for use on deeply
pipelined
embedded processors, keeping in mind the possibility of processor
pipeline stalls and
stringent memory requirements.
A review of the current AES
algorithm
The data flow diagram of the AES algorithm is shown in Figure 2 below. The main
transformations in the AES Rijndael's algorithm are:
1) AddRoundKey (AR),
2) SubstiteBytes
(SB),
3) ShiftRows
(SR) and
4) MixColumns
(MC).
All these transformations work on a matrix called 'State' that is
formed with byte elements (or 8-bit data units) using the input data of
128-bits. AES State is updated in multiple iterations using the above
transformations. The key expansion (KE) module expands the given key
for supplying the keys to all iterations of the AES engine.
The number of times the State is iterated in a loop of the AES
algorithm depends on the chosen key length (Nk). For
example, if we
choose the key length of 128 bits (i.e. Nk=4 32-bit words),
then we
iterate the data (Nr-1) times, where Nr = Nb
+ Nk + 2 and Nb= 4.
In the AES algorithm, the parameter Nb(= 4) corresponds
to the
number of rows of State. In
the AES cipher (or encryption) process, the
AR transformation is applied before the start of the loop. The
transformations present within the AES cipher loop are SB, SR, MC and
AR.
Also, the transformations SB, SR and AR are applied after the loop,
but before outputting the cipher text. Since the AES algorithm works on
byte elements, we form the AES State
of 4x4 bytes size by filling the
4x4 matrix column wise using the input data of 128 bits. The key
expansion module (KE) expands the input Nk words to (Nr+1)*Nb
words. (See [1] for more details of AES key expansion
process.)
In the Add Round key transformation (AR), we add 16 bytes (or 4
words) of the expanded key to 16 bytes of AES State whenever AR is
called. In the Substitute Bytes (SB) transformation, we replace each
byte of AES State with the
substitute byte look-up table S-Box elements
using State bytes as an
offset to the look-up table. Refer to
[1] for substitute byte look-up
table values.
In the Shift Rows (SR) transformation, we rotate 4x4 State rows
(except the first row) to the left by the row index number of bytes
(i.e. rotate left first row by 1 byte, second row by 2 bytes and third
row by 3 bytes). In the Mix Columns (MC) transformation, we multiply
the State with the MC matrix in Galois field GF(28).
 |
| Figure
2: Flow diagram of AES Encryption engine |
In the inverse cipher (or decryption) process, we take cipher text
as input and perform all inverse operations (i.e. Inverse Substitute
Bytes (ISB), Inverse Shift Rows (ISR), Inverse Add Round key (IAR) and
Inverse Mix Columns (IMC) transformations) of the cipher transformation
to get the plain text back.
We use different look-up table values for implementing SB and ISB (see [1]). The way we add the round
keys to State is the same except that in AR we get keys indexing from
start of the buffer whereas in IAR we get keys from the end of expanded
key buffer.
We rotate the State row
elements to the left in SR whereas in ISR we
rotate the State rows to the
right. We use the following matrices A and
B to multiply the State elements with MC and IMC transformations.

The matrices A and B contain the same four basic elements in each
column or in each row. With respect to first column or row the other
columns or rows elements are relatively rotated.