CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Implementation of the AES algorithm on Deeply Pipelined DSP/RISC Processor
A more efficient implementation of the Advanced Encryption Standard algorithm on a deeply pipelined RISC/DSP engine reduces overall pipeline stalls during its execution.



Embedded.com

A more efficient AES Implementation
In the original AES proposal [2], two approaches are suggested to implement AES, with and without using pre-computed look-up tables. In [3], a method to efficiently implement AES using pre-computed intermediate look-up tables for MC Galois field multiplications is discussed.

The data flow of AES rounds transformations using the approaches discussed in [2] and [3] is shown in Figure 3 below. These methods are suitable for implementing AES on 8-bit non-pipelined platforms. In these approaches, we apply the transformations SB, SR, MC and AR sequentially one after the other and repeat the loop Nr times.

Figure 3: Flow of AES transformations in the main loop

In Figure 3, transformation inputs and outputs crossing shows the dependency of current transformation inputs on previous transformation outputs. If we use the above approach to implement AES on 32-bit pipelined processors, then the cycle count of AES on a 32-bit pipelined processor may be larger when compared to 8-bit non pipelined processors. Due to data dependencies, the code generates many stalls in the instruction pipeline.

When using the look-up table based approach presented in [2] to efficiently implement AES on 32-bit pipelined processors, this method uses 1 kB of memory to store Galois field multiplication values {0x02, 0x03, 0x01, 0x01}-- {S-Box[0x00 to 0xff]} in case of cipher and another 1 kB of memory to store Galois field multiplication values {0x0e, 0x0b, 0x0d, 0x09},{Inv-S-Box[0x00 to 0xff]} in case of inverse cipher.

Therefore, we need 2.5 kB of data memory to store S-Box table values and to store pre-computed look-up table values for both cipher and inverse cipher.

In our proposed method, we combine several of the AES transformations into one. This changes the flow, eliminating intermediate input-output dependency stalls and reducing the overall number of operations. The pre-computed look-up tables are generated based on the following formula.

T = AR(MC(SB(SR(S))))

where S: input State, T: output State of AES loop.

Figure 4 below shows the efficient implementation of above equation in the right way.

Figure 4: Flow of AES encryption using the look-up table method

Analytical background of the proposed method
Let M be the Mix Column matrix elements, with S being the input vector and S' being the output of the Mix Columns transformation.

We pre-compute Li for (Galois field multiplication "." of si with the first column of M) as below and store in memory.

To compute the Mix Column transformation for one column of state s'i we load Li for from memory corresponding to si. Next, we get from Li by rotating Li to right by i bytes, followed by XORing all Li's as shown below.

Where

Finally, we get the output for one iteration of the AES loop by XORing the Mix Columns output with Round Key T= S '(+) K (here to reduce the number of XORs for AR, we transpose AES round keywords in the key expansion module). To compute one column of state matrix, we require four extracts (to get individual State elements after SR), four loads (of Li for both SB and MC), three rotations (for MC) and four XOR's (for both MC and AR).

From Figure 4 above, we can see that the outputs T00 to T33 do not depend on any intermediate results, which allow us to interleave accesses and avoid all the stalls present with the memory (or look-up table) accesses.

In MC, it is more efficient to hold the elements of one in one 32-bit register. For this, we transpose the State matrix before entering the loop and transpose back the AES State matrix after the loop to work with last three transformations outside the loop.

A sample c-code to simulate one row of butterflies using this approach is shown in Figure 5 below. Here, we interleave four butterflies at a time to avoid the pipeline stalls due to memory accesses or due to any ALU operations.

Figure 5: Look-up table based efficient AES encryption implementation

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :