A more efficient implementation of the Advanced Encryption Standard algorithm on a deeply pipelined RISC/DSP engine reduces overall pipeline stalls during its execution.
A more efficient AES Implementation
In the original AES proposal [2],
two approaches are suggested to
implement AES, with and without using pre-computed look-up tables. In
[3], a method to efficiently
implement AES using pre-computed
intermediate look-up tables for MC Galois field multiplications is
discussed.
The data flow of AES rounds transformations using the approaches
discussed in [2] and [3] is shown in Figure 3 below. These methods are
suitable for implementing AES on 8-bit non-pipelined platforms. In
these approaches, we apply the transformations SB, SR, MC and AR
sequentially one after the other and repeat the loop Nr
times.
 |
| Figure
3: Flow of AES transformations in the main loop |
In Figure 3, transformation inputs and outputs crossing shows the
dependency of current transformation inputs on previous transformation
outputs. If we use the above approach to implement AES on 32-bit
pipelined processors, then the cycle count of AES on a 32-bit pipelined
processor may be larger when compared to 8-bit non pipelined
processors. Due to data dependencies, the code generates many stalls in
the instruction pipeline.
When using the look-up table based approach presented in [2] to efficiently implement AES on
32-bit pipelined processors, this method uses 1 kB of memory to store
Galois field multiplication values {0x02, 0x03, 0x01, 0x01}--
{S-Box[0x00 to 0xff]} in case of cipher and another 1 kB of memory to
store Galois field multiplication values {0x0e, 0x0b, 0x0d,
0x09},{Inv-S-Box[0x00 to 0xff]} in case of inverse cipher.
Therefore, we need 2.5 kB of data memory to store S-Box table values
and to store pre-computed look-up table values for both cipher and
inverse cipher.
In our proposed method, we combine several of the AES
transformations into one. This changes the flow, eliminating
intermediate input-output dependency stalls and reducing the overall
number of operations. The pre-computed look-up tables are generated
based on the following formula.
T = AR(MC(SB(SR(S))))
where S: input State, T:
output State of AES loop.
Figure 4 below shows the
efficient implementation of above equation in the right way.
 |
| Figure
4: Flow of AES encryption using the look-up table method |
Analytical background of the
proposed method
Let M be the Mix Column matrix elements, with S being the input vector
and S' being the output of the Mix Columns transformation.

We pre-compute Li for (Galois field multiplication "." of si
with the
first column of M) as below and store in memory.

To compute the Mix Column transformation for one column of state s'i
we load Li for from memory corresponding to si.
Next, we get from Li by
rotating Li to right by i bytes, followed by XORing all Li's
as shown
below.

Where

Finally, we get the output for one iteration of the AES loop by
XORing the Mix Columns output with Round Key T= S '(+) K (here to reduce the number
of XORs for AR, we transpose AES round keywords in the key expansion
module). To compute one column of state matrix, we require four
extracts (to get individual State
elements after SR), four loads (of
Li
for both SB and MC), three rotations (for MC) and four XOR's (for both
MC and AR).
From Figure 4 above, we can
see that the outputs T00 to T33 do not depend on
any intermediate
results, which allow us to interleave accesses and avoid all the stalls
present with the memory (or look-up table) accesses.
In MC, it is more efficient to hold the elements of one in one
32-bit register. For this, we transpose the State matrix before
entering the loop and transpose back the AES State matrix after the
loop to work with last three transformations outside the loop.
A sample c-code to simulate one row of butterflies using this
approach is shown in Figure 5 below.
Here, we interleave four butterflies at a time to avoid the pipeline
stalls due to memory accesses or due to any ALU operations.
 |
| Figure
5: Look-up table based efficient AES encryption implementation |