# Acceleration of Symmetric-Key Algorithms in Software

Symmetric-key ciphers can be found in a range of security protocols.They are primarily used for encrypting communication, data, media, anda host of other content types.

Many of these ciphers rely on bit-manipulating operations in orderto transform back and forth between plaintext and ciphertext. Whileeasily implemented in dedicated hardware blocks, some of thesebit-manipulating operations are computationally expensive whenperformed in software.

In systems that do a fair amount of encryption/decryption,implementing the symmetric-key ciphers inhardware is the preferred way to accelerate their execution. There are,however, systems in which hardware cryptographic acceleration isunnecessary or inappropriate.

Examples may include systems that perform only a moderate amount ofencryption/decryption, systems that demand more cryptographicprocessing flexibility than that achieved by hardware acceleration,and/or systems that cannot afford the cost of cryptographicacceleration.

In cases where acceleration is not the answer, intelligent softwareimplementations of cryptographic algorithms in general and ofsymmetric-key ciphers in particular may prove essential.

This article discusses software implementations of symmetric-keyciphers, taking a look at several algorithm-level optimizations as wellas low-level instruction selection approaches. Results obtained fromrunning un-optimized and optimized symmetric-key ciphers on our targetplatform, Analog Devices' Blackfin processor, prove that significantacceleration of symmetric-key ciphers can be attained in software.

A review of Symmetric-Key Cipherbasics

Symmetric-key ciphers (Figure 1, below )are so named because both ends to a secure communication require accessto the same key. This key is used both for encryption, transforming theclear plain text into obfuscated cipher text, and for decryption,recovering the plain text from the obfuscated cipher text.

Figure1 – Symmetric-Key Ciphers |

Both parties to a secure communication should maintain the secrecyof the symmetric-key. If an unauthorized party gained access to thesymmetric-key, such a party would be able to decrypt the ciphertext andconsequently break the confidentiality of the communication. Because ofthe secrecy requirement on the key used in symmetric-key ciphers, suchciphers are also referred to as secret-key, or private-key, ciphers.

Symmetric-key ciphers can be found in a range of security protocols.InternetProtocol security (IPSec) ,which is a framework of open standards for protecting communicationsover IP networks, uses symmetric-key ciphers to implement, among otherthings, data confidentiality.

Every time an employee establishes a VirtualPrivate Network (VPN) connection to remotely access companycomputing resources, the symmetric-key ciphers operating within theIPSec framework provide the confidentiality of the communication link.

Symmetric-key ciphers are by no means only limited to servers andPCs, but extend to embedded systems. The widespread use of smart phonesand PDAs to access emails and to carry out e-commerce transactionsnecessitates the execution of symmetric-key ciphers on these devices.Other embedded systems that handle confidential data, such as ATMs,slot machines, POS terminals, and some medical devices also rely onsymmetric-key ciphers for data confidentiality.

The most common symmetric-key ciphers in use today includeDES/Triple DES and AES. The DataEncryption Standard (DES) became effective in July 1977.The secret key used in DES has a length of 56 bits. This key length wassufficient to provide adequate security in the past.

However, for many applications, a 56-bit key is too short to besecure against today's cryptanalysts. The DES has now been withdrawn. Amore secure variant of DES is Triple DES, or officially Triple Data Encryption Algorithm (TDEA). Triple DES isessentially a serial combination of three DES blocks. Therefore, froman implementation point of view, DES and Triple DES are almostidentical.

The AdvancedEncryption Standard (AES) superseded DES and becameeffective in 2002. It supports secret-key lengths of 128, 192, and 256bits. AES is enjoying widespread adoption and use worldwide. In many ofthe security protocols today, AES and Triple DES are the predominantsymmetric-key ciphers used.

Data Encryption Standard (DES andTriple DES)

The structures of DES and Triple DES are similar. At the heart of bothis the function depicted in Figure 2below. The E transformation expands the 32-bit input Rproducing a 48-bit value E(R).

Figure2 – The Critical Function in DES and Triple DES |

Figure 3 below shows thebit-selection table, according to which, E expands the bits of R intoE(R). If, for example, the input R was equal to 0x00001000, then E(R)would be equal to 0x0000000A0000. (Notice that the bit-selection tablerefers to the leftmost bit in the input as bit 1 and refers to therightmost input bit as bit 32).

The 48-bit value E(R) is confused through an XOR operation with a48-bit round-key K. The result, which is 48-bits long, is subdividedinto eight 6-bit long elements. Each of these eight elements is used toindex one of eight lookup tables, producing eight 4-bit long elements.These eight 4-bit long elements are concatenated to form a 32-bitvalue, which is then permuted to produce the 32-bit output.

Figure3 – E Bit-Selection Table |

A trivial way of implementing the E transformation may be throughbit manipulations (AND, shift/rotate, and OR instructions). Figure 4 below shows such animplementation.

Figure4 – A Trivial Implementation of the E Transformation |

The implementation of E shown in Figure 4 is simple. For the commoncase, it takes 3 Blackfin instructions (3 cycles) to produce 6 bits ofE(R). However, a better implementation may be one that makes use of theBlackfin's EXTRACT instruction. The second argument of the EXTRACTinstruction is a 16-bit value.

The least significant byte of that value is the number of bits toextract from the target register. The most significant byte of thatvalue is the offset within the target register to extract from.

For example, if the second argument to the EXTRACT instruction was0x1A06, then the operation will extract 6 bits starting from offset0x1A (i.e. the most significant 6 bits of a 32-bit register). Animplementation of E based on Blackfin's EXTRACT instruction is shown inFigure 5 below .

Figure5 – A Trivial Implementation of the E Transformation |

The last two lines in Figure 5 demonstrate that only twoinstructions are necessary to produce 6 bits of E(R). Furthermore, thetwo instructions may be issued in parallel, as shown in Figure 6 below, bringing the numberof cycles necessary to produce 6 bits of E(R) to 1.

Figure6 – Parallel Execution of Instructions in the Implementation of E(R) |

Figure 2 earlier, which illustrates the critical function at theheart of DES and Triple DES, shows that the 4-bit elements at theoutput of the eight substitution boxes (S1 ” S8) are concatenatedtogether to produce a 32-bit value that is then permuted to produce theoutput.

Typically, the bit manipulations involved in concatenation andpermutation operations are not efficiently implemented in generalpurpose embedded processors and DSPs. Concatenations are typicallyperformed using a shift and an OR. Permutations may either be performedone bit at a time using shift and OR instructions or several bits at atime using lookup tables.

A much more efficient implementation of DES may be one where thesubstitution tables (S1 ” S8) are replaced with (P(S1) ” P(S8)). Inother words, the output of each lookup table P(Sx) produces the 32-bitvalue that corresponds to the permutation of the 4-bit output of theoriginal lookup table Sx.

Figure7 – Combine the Permutation and the Substitution in a Single Step |

In addition to this method's advantage of performing both thesubstitution and the permutation in a single step, this methodeliminates the need to concatenate the eight 4-bit outputs of thesubstitution tables (S1 ” S8).

Instead, the final result is obtained through addition of the eight32-bit outputs of the modified tables (P(S1) ” P(S8)), which combinethe substitution and permutation operations together. Pictorially, thisis shows in Figure 7 above.

Advanced Encryption Standard (AES)

AES is a block cipher that operates on input blocks of size 128 bits. Ablock is stored in a 4×4 state array of 8-bit elements, and isprocessed iteratively a specified number of rounds Nr. The pseudo codein Figure 8 below shows thecipher stage that processes the state array.

Figure8 – The AES Cipher Stage |

SubBytes(state) is an operation in which the elements of the statearray are replaced according to a substitution table as shown in Figure 9 below.

Figure9 – The AES SubBytes() Operation |

SubBytes() is typically implemented as a lookup from a 256-entrytable. A sample implementation is shown in Figure 10 below .

Figure10 – A Trivial Implementation of AES' SubBytes() |

ShiftRows(state) is shown in Figure11 below . If every four elements in a row were thought of as a32-bit word, ShiftRows(state) can be implemented simply on 32-bitarchitectures as a word rotate instruction.

Figure11 – The AES ShiftRows() Operation |

In MixColumns(state), which is shown in Figure 12 below, every state columnis transformed through a matrix multiplication. The operations(multiplications and additions) are performed in a finite field.Addition in the Galois field 28 is achieved through an XOR of theterms.

In AES' MixColumns(), only multiplications by {02} and by {03} takeplace. In GF(28), multiplication of Sxx by {02}, denoted as ({02}.Sxx),is achieved by a left shift of Sxx followed by a conditional XOR.Multiplication in GF(28) by {03}, denoted as ({03}.Sxx), is achieved as({02}.Sxx) + Sxx. Hence, multiplication by {03} is achieved through aleft shift of Sxx, a conditional XOR, and an XOR.

Figure12 – The AES MixColumns() Operation |

Instead of applying the described sequence of shifts and XORs inorder to achieve the finite field multiplications, a better solutionmay be to statically generate a table that contains all the possiblemultiplication outcomes (no multiplication, multiplication by {02}, andmultiplication by {03}). Hence the multiplications in GF(28) areachieved through table lookups.

Implementing the finite-field multiplications through table lookupsenables further optimization. Instead of having two separate tablelookups for SubBytes() and MixColumns(), one is sufficient.

A statically generated 1024-byte table, in which the four elements({02}.SubBytes(Sxx)), SubBytes(Sxx), SubBytes(Sxx),({03}.SubBytes(Sxx)) are indexed by each possible Sxx, may be used tocombine the SubBytes() and MixColumns() operations into one.

Figure13 – A Lookup Table Based Implementation of AES' SubBytes(),ShiftRows(), and MixColumns() |

Figure 13 above shows apossible implementation based on table lookups. To make table indexingsimpler, the table used in Figure 13 contains 4 entries for eachpossible byte in the state array. Those entries are({02}.SubBytes(Sxx)), SubBytes(Sxx), SubBytes(Sxx), and({03}.SubBytes(Sxx)).

Hence, the table used is 1024-bytes long and combines the SubBytes()with the finite-field multiplications of MixColumns(). Furthermore, theuse of index with a small shift addressing mode enables the selectionof the appropriate elements in the state array to implicitly performthe ShiftRows() operation.

Figure 14 below attempts toillustrate how the implementation presented in Figure 13 combinesSubBytes(), ShiftRows(), and MixColumns() intelligently to achieverapid execution.

Figure14 – Combining SubBytes(), ShiftRows() and MixColumns() into a SingleOperation |

Summary

Software implementations of the AES and Triple-DES symmetric-keyciphers were developed for the Blackfin embedded processor.Implementations that took advantage of the above describedoptimizations and instruction selection techniques achieved, onaverage, 200% improvement in run-time.

Therefore, while hardware-acceleration of cryptographic algorithmshas its applications, intelligent software “acceleration” can attainexcellent results and has its place in the development portfolio.

WassimBassalee is a seniorapplications engineer at Analog Devices Inc . Hisinterests include security, cryptography, and their applications inembedded systems. Yosef Stein serves as DSP principal systemarchitect/advanced technologies manager in the Digital Media TechnologyCenter working on the development of broadband communication and imagecompression enhanced instructions for Analog Devices' fixed point DSPs.