Many of these ciphers rely on bit-manipulating operations in order to transform back and forth between plaintext and ciphertext. While easily implemented in dedicated hardware blocks, some of these bit-manipulating operations are computationally expensive when performed in software.
In systems that do a fair amount of encryption/decryption, implementing the symmetric-key ciphers in hardware is the preferred way to accelerate their execution. There are, however, systems in which hardware cryptographic acceleration is unnecessary or inappropriate.
Examples may include systems that perform only a moderate amount of encryption/decryption, systems that demand more cryptographic processing flexibility than that achieved by hardware acceleration, and/or systems that cannot afford the cost of cryptographic acceleration.
In cases where acceleration is not the answer, intelligent software implementations of cryptographic algorithms in general and of symmetric-key ciphers in particular may prove essential.
This article discusses software implementations of symmetric-key ciphers, taking a look at several algorithm-level optimizations as well as low-level instruction selection approaches. Results obtained from running un-optimized and optimized symmetric-key ciphers on our target platform, Analog Devices' Blackfin processor, prove that significant acceleration of symmetric-key ciphers can be attained in software.
A review of Symmetric-Key Cipher
basics
Symmetric-key ciphers (Figure 1, below)
are so named because both ends to a secure communication require access
to the same key. This key is used both for encryption, transforming the
clear plain text into obfuscated cipher text, and for decryption,
recovering the plain text from the obfuscated cipher text.
![]() |
| Figure 1 - Symmetric-Key Ciphers |
Both parties to a secure communication should maintain the secrecy of the symmetric-key. If an unauthorized party gained access to the symmetric-key, such a party would be able to decrypt the ciphertext and consequently break the confidentiality of the communication. Because of the secrecy requirement on the key used in symmetric-key ciphers, such ciphers are also referred to as secret-key, or private-key, ciphers.
Symmetric-key ciphers can be found in a range of security protocols. Internet Protocol security (IPSec), which is a framework of open standards for protecting communications over IP networks, uses symmetric-key ciphers to implement, among other things, data confidentiality.
Every time an employee establishes a Virtual Private Network (VPN) connection to remotely access company computing resources, the symmetric-key ciphers operating within the IPSec framework provide the confidentiality of the communication link.
Symmetric-key ciphers are by no means only limited to servers and PCs, but extend to embedded systems. The widespread use of smart phones and PDAs to access emails and to carry out e-commerce transactions necessitates the execution of symmetric-key ciphers on these devices. Other embedded systems that handle confidential data, such as ATMs, slot machines, POS terminals, and some medical devices also rely on symmetric-key ciphers for data confidentiality.
The most common symmetric-key ciphers in use today include DES/Triple DES and AES. The Data Encryption Standard (DES) became effective in July 1977. The secret key used in DES has a length of 56 bits. This key length was sufficient to provide adequate security in the past.
However, for many applications, a 56-bit key is too short to be secure against today's cryptanalysts. The DES has now been withdrawn. A more secure variant of DES is Triple DES, or officially Triple Data Encryption Algorithm (TDEA). Triple DES is essentially a serial combination of three DES blocks. Therefore, from an implementation point of view, DES and Triple DES are almost identical.
The Advanced Encryption Standard (AES) superseded DES and became effective in 2002. It supports secret-key lengths of 128, 192, and 256 bits. AES is enjoying widespread adoption and use worldwide. In many of the security protocols today, AES and Triple DES are the predominant symmetric-key ciphers used.
Data Encryption Standard (DES and
Triple DES)
The structures of DES and Triple DES are similar. At the heart of both
is the function depicted in Figure 2
below. The E transformation expands the 32-bit input R
producing a 48-bit value E(R).
![]() |
| Figure 2 - The Critical Function in DES and Triple DES |
Figure 3 below shows the bit-selection table, according to which, E expands the bits of R into E(R). If, for example, the input R was equal to 0x00001000, then E(R) would be equal to 0x0000000A0000. (Notice that the bit-selection table refers to the leftmost bit in the input as bit 1 and refers to the rightmost input bit as bit 32).
The 48-bit value E(R) is confused through an XOR operation with a 48-bit round-key K. The result, which is 48-bits long, is subdivided into eight 6-bit long elements. Each of these eight elements is used to index one of eight lookup tables, producing eight 4-bit long elements. These eight 4-bit long elements are concatenated to form a 32-bit value, which is then permuted to produce the 32-bit output.
![]() |
| Figure 3 - E Bit-Selection Table |
A trivial way of implementing the E transformation may be through bit manipulations (AND, shift/rotate, and OR instructions). Figure 4 below shows such an implementation.
![]() |
| Figure 4 - A Trivial Implementation of the E Transformation |
The implementation of E shown in Figure 4 is simple. For the common case, it takes 3 Blackfin instructions (3 cycles) to produce 6 bits of E(R). However, a better implementation may be one that makes use of the Blackfin's EXTRACT instruction. The second argument of the EXTRACT instruction is a 16-bit value.
The least significant byte of that value is the number of bits to extract from the target register. The most significant byte of that value is the offset within the target register to extract from.
For example, if the second argument to the EXTRACT instruction was 0x1A06, then the operation will extract 6 bits starting from offset 0x1A (i.e. the most significant 6 bits of a 32-bit register). An implementation of E based on Blackfin's EXTRACT instruction is shown in Figure 5 below.
![]() |
| Figure 5 - A Trivial Implementation of the E Transformation |
The last two lines in Figure 5 demonstrate that only two instructions are necessary to produce 6 bits of E(R). Furthermore, the two instructions may be issued in parallel, as shown in Figure 6 below, bringing the number of cycles necessary to produce 6 bits of E(R) to 1.
![]() |
| Figure 6 - Parallel Execution of Instructions in the Implementation of E(R) |
Figure 2 earlier, which illustrates the critical function at the heart of DES and Triple DES, shows that the 4-bit elements at the output of the eight substitution boxes (S1 " S8) are concatenated together to produce a 32-bit value that is then permuted to produce the output.
Typically, the bit manipulations involved in concatenation and permutation operations are not efficiently implemented in general purpose embedded processors and DSPs. Concatenations are typically performed using a shift and an OR. Permutations may either be performed one bit at a time using shift and OR instructions or several bits at a time using lookup tables.
A much more efficient implementation of DES may be one where the substitution tables (S1 " S8) are replaced with (P(S1) " P(S8)). In other words, the output of each lookup table P(Sx) produces the 32-bit value that corresponds to the permutation of the 4-bit output of the original lookup table Sx.
![]() |
| Figure 7 - Combine the Permutation and the Substitution in a Single Step |
In addition to this method's advantage of performing both the substitution and the permutation in a single step, this method eliminates the need to concatenate the eight 4-bit outputs of the substitution tables (S1 " S8).
Instead, the final result is obtained through addition of the eight 32-bit outputs of the modified tables (P(S1) " P(S8)), which combine the substitution and permutation operations together. Pictorially, this is shows in Figure 7 above.
Advanced Encryption Standard (AES)
AES is a block cipher that operates on input blocks of size 128 bits. A
block is stored in a 4x4 state array of 8-bit elements, and is
processed iteratively a specified number of rounds Nr. The pseudo code
in Figure 8 below shows the
cipher stage that processes the state array.
![]() |
| Figure 8 - The AES Cipher Stage |
SubBytes(state) is an operation in which the elements of the state array are replaced according to a substitution table as shown in Figure 9 below.
![]() |
| Figure 9 - The AES SubBytes() Operation |
SubBytes() is typically implemented as a lookup from a 256-entry table. A sample implementation is shown in Figure 10 below.
![]() |
| Figure 10 - A Trivial Implementation of AES' SubBytes() |
ShiftRows(state) is shown in Figure 11 below. If every four elements in a row were thought of as a 32-bit word, ShiftRows(state) can be implemented simply on 32-bit architectures as a word rotate instruction.
![]() |
| Figure 11 - The AES ShiftRows() Operation |
In MixColumns(state), which is shown in Figure 12 below, every state column is transformed through a matrix multiplication. The operations (multiplications and additions) are performed in a finite field. Addition in the Galois field 28 is achieved through an XOR of the terms.
In AES' MixColumns(), only multiplications by {02} and by {03} take place. In GF(28), multiplication of Sxx by {02}, denoted as ({02}.Sxx), is achieved by a left shift of Sxx followed by a conditional XOR. Multiplication in GF(28) by {03}, denoted as ({03}.Sxx), is achieved as ({02}.Sxx) + Sxx. Hence, multiplication by {03} is achieved through a left shift of Sxx, a conditional XOR, and an XOR.
![]() |
| Figure 12 - The AES MixColumns() Operation |
Instead of applying the described sequence of shifts and XORs in order to achieve the finite field multiplications, a better solution may be to statically generate a table that contains all the possible multiplication outcomes (no multiplication, multiplication by {02}, and multiplication by {03}). Hence the multiplications in GF(28) are achieved through table lookups.
Implementing the finite-field multiplications through table lookups enables further optimization. Instead of having two separate table lookups for SubBytes() and MixColumns(), one is sufficient.
A statically generated 1024-byte table, in which the four elements
({02}.SubBytes(Sxx)), SubBytes(Sxx), SubBytes(Sxx),
({03}.SubBytes(Sxx)) are indexed by each possible Sxx, may be used to
combine the SubBytes() and MixColumns() operations into one.
![]() |
| Figure 13 - A Lookup Table Based Implementation of AES' SubBytes(), ShiftRows(), and MixColumns() |
Figure 13 above shows a possible implementation based on table lookups. To make table indexing simpler, the table used in Figure 13 contains 4 entries for each possible byte in the state array. Those entries are ({02}.SubBytes(Sxx)), SubBytes(Sxx), SubBytes(Sxx), and ({03}.SubBytes(Sxx)).
Hence, the table used is 1024-bytes long and combines the SubBytes() with the finite-field multiplications of MixColumns(). Furthermore, the use of index with a small shift addressing mode enables the selection of the appropriate elements in the state array to implicitly perform the ShiftRows() operation.
Figure 14 below attempts to illustrate how the implementation presented in Figure 13 combines SubBytes(), ShiftRows(), and MixColumns() intelligently to achieve rapid execution.
![]() |
| Figure 14 - Combining SubBytes(), ShiftRows() and MixColumns() into a Single Operation |
Summary
Software implementations of the AES and Triple-DES symmetric-key
ciphers were developed for the Blackfin embedded processor.
Implementations that took advantage of the above described
optimizations and instruction selection techniques achieved, on
average, 200% improvement in run-time.
Therefore, while hardware-acceleration of cryptographic algorithms has its applications, intelligent software "acceleration" can attain excellent results and has its place in the development portfolio.
Wassim
Bassalee is a senior
applications engineer at Analog Devices Inc. His
interests include security, cryptography, and their applications in
embedded systems. Yosef Stein
serves as DSP principal system
architect/advanced technologies manager in the Digital Media Technology
Center working on the development of broadband communication and image
compression enhanced instructions for Analog Devices' fixed point DSPs.