Enhancing system efficiency of embedded encryption applicationsTo ensure the safe and secure transfer of data from source to destination, encryption has become a mandatory technology for secure applications. The most commonly used encryption techniques utilize a deterministic algorithm with an unvarying transformation operating on fixed-length blocks of data. Examples of such techniques include Advanced Encryption Standard (AES), Data Encryption Standard (DES), International Data Encryption Algorithm (IDEA) and Rivest Cipher (RC5) .
Such a “block cipher” approach, however, puts a constraint on the throughput, data processing, and buffering capacities of the hardware since encryption has to be performed before the next chunk of data arrives. A good number of industrial encryption systems support data rates higher than 200 Mbps, but the hardware to achieve this, generally an ASIC, is very expensive as compared to a simple microcontroller (MCU).
Though it is possible to implement encryption on a simple 8-bit MCU with external memory, such as an 8051, the time taken to perform encryption is on an order of magnitude of the time required by an ASIC. This article explains how an SoC with programmable logic can make use of the MCU core along with additional hardware features like Universal Digital Blocks (UDBs) and Direct Memory Access (DMA) to efficiently implement encryption and improve the overall timing of the system.
AES is one of the most commonly used block cipher techniques to implement symmetric key cryptography. We will use AES-128 as an example, which operates on a chunk of 16 bytes (128 bits) of data processed with a 128-bit cipher key, to demonstrate the requirements of an encryption application and potential implementation options. With AES-128, the input bytes are arranged in the form of a block before the processing begins, as illustrated in Figure 1. In the figure, in0 is the first byte and in15 is the 16th and last byte of the input block.
Figure 1: Input bytes
Byte substitution is the first operation. At this stage, each byte of the input block is replaced with a byte selected from the already known substitution table. The selected value is present at the location at the table pointed to by the two nibbles of the input byte as shown in Figure 2. The substitution of any byte in row and column can be expressed as:
Figure 2: Byte substitution
The substitution table is generally hard-coded in the device (Flash, SRAM, etc). When the CPU is assigned the task of byte substitution, it will fetch the input byte from the program memory and pass it on as an address to the SRAM. The SRAM will then return the byte present at that location. This procedure will take a lot of time before substitution for the whole block is complete.
In order to offload the CPU from doing all these operations, substitution can take place concurrently with the help of DMAs to free the CPU for other tasks. Only the source and the destination address of the memory has to be assigned to the DMA, and it will take care of the data transfers. Moreover, instead of passing on these values to some specific memory locations, DMA can directly transfer the data to the UDBs for further processing without any CPU intervention.
The next stage in AES is Row Shifting. At this stage, each row of the byte substituted input block is shifted left by one byte. The byte shifted out then takes place of the right-most byte. For the first row, no shifting takes place. For the second row, byte shifting takes place once, twice for the third row and three times for the fourth row. This procedure is illustrated in Figure 3.
Figure 3: Row shifting
The CPU can only perform 8-bit operations and cannot keep the whole block in its view. Effectively, row shifting is essentially changing the location of a byte. For example, after row shifting, byte S1,0 takes the place of S1,3. Thus, DMA can prove to be much more efficient in picking a byte from one address and transferring it to another.
After row shifting, column mixing is the next step. AES column mixing involves transformation of the data block such that a whole column (4 bytes) is processed to generate a byte. The transformation is effectively multiplication in GF(28) with the polynomial p(x)=x8 + x4 + x3 + x + 1. The matrix representation of column mixing is shown in Figure 4.
Figure 4: Column mixing
Mathematically, a byte A is generated from a,b,c, and d according to the following equation:
Implementation of multiplication in hardware has always been a challenging task, which is the reason why this equation is generally not implemented in this form. According to the book, Cryptography and Network Security, multiplication of a value by x (i.e. by 02) can be implemented as a 1-bit left shift followed by a conditional bitwise XOR with 0x1B (00011011), if the left-most bit of the original value (before the shift) is 1. By this rule, the above equation simplifies to
where “check_msb” returns 0x1B if the MSB of the byte is 1 and returns 0x0 if the MSB is 0. This simple manipulation can greatly reduce the hardware resource consumption for column mixing.
An SoC with a programmable architecture can implement this process efficiently in hardware. For example, with the PSoC architecture from Cypress, Universal Digital Blocks (UDBs) serve as an ideal candidate for implementing the column mixing operation. Figure 5 shows the UDBs architecture from the PSoC Technical Reference Manual (TRM):
Figure5: PSoC’s Universal Digital Blocks (UDBs)
It can be seen that all of the above mentioned byte-wide operations (shift by one bit, XORing) can be performed in a Data Path in a single clock cycle. Before moving on to the actual implementation on UDBs, it is important to understand the internal structure of the Data Path.
Page 1 of 2Next >
Currently no items