To ensure the safe and secure transfer of data from source to destination, encryption has become a mandatory technology for secure applications. The most commonly used encryption techniques utilize a deterministic algorithm with an unvarying transformation operating on fixed-length blocks of data. Examples of such techniques include Advanced Encryption Standard (AES), Data Encryption Standard (DES), International Data Encryption Algorithm (IDEA) and Rivest Cipher (RC5) .
Such a “block cipher” approach, however, puts a constraint on the throughput, data processing, and buffering capacities of the hardware since encryption has to be performed before the next chunk of data arrives. A good number of industrial encryption systems support data rates higher than 200 Mbps, but the hardware to achieve this, generally an ASIC, is very expensive as compared to a simple microcontroller (MCU).
Though it is possible to implement encryption on a simple 8-bit MCU with external memory, such as an 8051, the time taken to perform encryption is on an order of magnitude of the time required by an ASIC. This article explains how an SoC with programmable logic can make use of the MCU core along with additional hardware features like Universal Digital Blocks (UDBs) and Direct Memory Access (DMA) to efficiently implement encryption and improve the overall timing of the system.
AES is one of the most commonly used block cipher techniques to implement symmetric key cryptography. We will use AES-128 as an example, which operates on a chunk of 16 bytes (128 bits) of data processed with a 128-bit cipher key, to demonstrate the requirements of an encryption application and potential implementation options. With AES-128, the input bytes are arranged in the form of a block before the processing begins, as illustrated in Figure 1 . In the figure, in0 is the first byte and in15 is the 16th and last byte of the input block.
Byte substitution is the first operation. At this stage, each byte of the input block is replaced with a byte selected from the already known substitution table. The selected value is present at the location at the table pointed to by the two nibbles of the input byte as shown in Figure 2 . The substitution of any byte in row and column can be expressed as:
The substitution table is generally hard-coded in the device (Flash, SRAM, etc). When the CPU is assigned the task of byte substitution, it will fetch the input byte from the program memory and pass it on as an address to the SRAM. The SRAM will then return the byte present at that location. This procedure will take a lot of time before substitution for the whole block is complete.
In order to offload the CPU from doing all these operations, substitution can take place concurrently with the help of DMAs to free the CPU for other tasks. Only the source and the destination address of the memory has to be assigned to the DMA, and it will take care of the data transfers. Moreover, instead of passing on these values to some specific memory locations, DMA can directly transfer the data to the UDBs for further processing without any CPU intervention.
The next stage in AES is Row Shifting. At this stage, each row of the byte substituted input block is shifted left by one byte. The byte shifted out then takes place of the right-most byte. For the first row, no shifting takes place. For the second row, byte shifting takes place once, twice for the third row and three times for the fourth row. This procedure is illustrated in Figure 3 .
The CPU can only perform 8-bit operations and cannot keep the whole block in its view. Effectively, row shifting is essentially changing the location of a byte. For example, after row shifting, byte S1,0 takes the place of S1,3 . Thus, DMA can prove to be much more efficient in picking a byte from one address and transferring it to another.
After row shifting, column mixing is the next step. AES column mixing involves transformation of the data block such that a whole column (4 bytes) is processed to generate a byte. The transformation is effectively multiplication in GF(28 ) with the polynomial p(x)=x8 + x4 + x3 + x + 1 . The matrix representation of column mixing is shown in Figure 4 .
Mathematically, a byte A is generated from a,b,c, and d according to the following equation:
Implementation of multiplication in hardware has always been a challenging task, which is the reason why this equation is generally not implemented in this form. According to the book, Cryptography and Network Security , multiplication of a value by x (i.e. by 02) can be implemented as a 1-bit left shift followed by a conditional bitwise XOR with 0x1B (00011011), if the left-most bit of the original value (before the shift) is 1. By this rule, the above equation simplifies to
where “check_msb” returns 0x1B if the MSB of the byte is 1 and returns 0x0 if the MSB is 0. This simple manipulation can greatly reduce the hardware resource consumption for column mixing.
An SoC with a programmable architecture can implement this process efficiently in hardware. For example, with the PSoC architecture from Cypress, Universal Digital Blocks (UDBs) serve as an ideal candidate for implementing the column mixing operation. Figure 5 shows the UDBs architecture from the PSoC Technical Reference Manual (TRM):
It can be seen that all of the above mentioned byte-wide operations (shift by one bit, XORing) can be performed in a Data Path in a single clock cycle. Before moving on to the actual implementation on UDBs, it is important to understand the internal structure of the Data Path.A data path in UDB is comprised of two 4-byte deep FIFOs, two dataregisters, two accumulator registers and an 8-bit wide ALU. Thesehardware resources can be made to operate with the help of a statemachine. These 8 present states can be configured with the help of DataPath Configuration tool. Figure 6 from PSoC’s TRM shows the Data Path:
Figure 7 shows a state machine for implementation of Equation iii (Column Mixing operation) using UDBs:
It can be seen from Equation iii that a, b, c and d bytes are required to generate byte A . Here, 4-byte deep FIFOs can be used. The Data Path remains in the Check FIFO state until all 4 bytes have been received and the input FIFO is full. Then the Data Path moves to the Load state where it fetches a byte from the FIFO and transfers it to anaccumulator for further processing. A counter can be implemented usingadditional hardware (PLDs) to keep track of each byte, since each byteis treated differently. Moreover, as byte a needs to be multiplied by 2 (check_msb(a<<1 ), the decision state passes it on to the Shift state where it is shifted left by one bit. The shifted out bit, (so inFigure 7), determines whether or not it needs to be XORed with 0x1B .Similarly, the counter increments and the state machine operates oneach byte according to Equation-iii. When the counter hits 5, meaningall the bytes have been loaded and the input FIFO is now empty, theresult can be loaded in the output FIFO. The CPU can fetch the mixedbyte from the output FIFO when the (output FIFO not empty) interrupt isgenerated. Thus, offloading Column Mixing to UDBs can offloadsignificant processing from the CPU.
A whole column (4 bytes) can be generated using 3 more similar Data Paths. The only difference will be the counter checks.
Key Expansion and Round Key Addition
Thelast steps for AES are Key Expansion and Round Key Addition. In KeyExpansion for AES-128, the 128-bit key is processed to generate eleven128-bit Round Key (RK) blocks. Each block is XORed, byte by byte, withthe data block. The key expansion process is shown in Figure 8 . Each RK is generated using the previous RK. The first RK is generated with the help of the actual 128-bit key. If RK(n-1) is the previous RK and RK(n) is the current RK, then the first column of RK(n) is generated by the first byte substituting the last column of R4K(n-1) and vertically circular shifting it upward. This column is then byte-by-byte XORed with the first column of RK(n-1) to give the first column of RK(n) . The second column of RK(n) is the result of a byte-by-byte XOR of the first column of RK(n) and the second column of RK(n-1) . Similarly, the third column of RK(n ) is the result of a byte-by-byte XOR of the second column of RK(n) and the third column of RK(n-1) . The fourth column of RK(n) is generated by a byte-by-byte XOR of the third column of RK(n) and the fourth column of RK(n-1) .
KeyExpansion and Round Key Addition require memory to store the previousand the current 128-bit keys, as well as all of the intermediateresults. It also requires byte-level XOR operations. DMA can be used tofetch bytes from the S-Box pointed to by the 4th column of the 128-bitkey pre-stored in the memory and input one byte at a time to the UDBs.With the help of a simple Data Path state machine, the UDB can shiftthis column vertically. The output can be read either by the CPU or bythe DMA. This column can be input again to a Data Path FIFO along withthe first column for a byte-by-byte XOR.
Figure 10 shows the resource consumption of the optimized column mixing design(polling based) as described in the above state machine in Figure 7. Bymaking use of the additional hardware resources of the PSoC, this designuses approximately 34% fewer machine cycles than a conventional CPUimplementation. Introducing interrupts will further reduce the load onthe CPU, and by offloading all other processes partially or fully to theexternal hardware resources will result in an even faster and efficientAES implementation.
Ahmed Majeed Khan an engineeringrole at Cypress Semiconductor Corporation where among other projects heparticipated in the development of a secure magnetic card readersolution His work experience in mainly in consumer electronics withparticular focus on embedded systems and wireless multimediacommunication. He holds an MS in Electrical Engineering from MichiganState University and has over 8 years of experience working withmicrocontrollers and embedded applications.
Asma Afzal iscurrently pursuing her Masters degree in Electrical Engineering fromNational University of Sciences and Technology (NUST), Pakistan, and shehas been working on embedded systems development for a couple of years.Currently her work involves the optimization of various types ofencryption schemes on PSoC.
Khawar Khurshid is theDirector of NUST-CYPRESS Joint Research Center in Islamabad, Pakistan.He has a doctorate degree from Michigan State University in BiomedicalImaging Systems and currently he is heading the Department of ElectricalEngineering at School of Electrical Engineering and Computer Science(SEECS), National University of Sciences and Technology (NUST),Pakistan. He specializes in Medical Imaging, Computer Vision, PatternRecognition, Image and Signal Processing with research interests in theareas of Positron Emission Tomography, Computed Tomography, MagneticResonance Imaging, Ultrasound Imaging, Thermal Imaging and X-Rays.