CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Understanding Crypto Performance in Embedded Systems: Part 1
Introduction to Cryptography in Embedded Systems



Embedded.com
Crypto Raw Performance
The raw performance of an algorithm accelerator has the potential to be one of the larger variables in system-level performance, particularly at larger data sizes where software overheads are less of a factor.

Despite apparently significant differences in operating frequencies, ASIC-style individual algorithm accelerators manufactured with the same process technology node have similar raw performance.

The operating frequency of an algorithm accelerator is not a good indicator of its raw performance, because the number of internal permutations per clock cycle is merely an implementation trade off between more or fewer pipeline stages.

Most accelerator vendors have a good understanding of their algorithms' raw performance, and this value likely varies by algorithm and key length. AES accelerators tend to be faster than 3DES accelerators, and it is typically faster to encrypt with a 128b key than a 256b key.

To reach higher levels of raw performance, a cryptographic accelerator may implement multiple copies of an individual algorithm accelerator (i.e., four AES accelerators to achieve a total of 10 Gbps). This aggregate raw performance doesn't mean that a single piece of data can be encrypted at 10 Gbps, but rather four pieces of data can be encrypted at 2.5 Gbps each.

The variability of software overheads described earlier in this article should make clear the difficulty accelerator vendors face when trying to provide an accurate system level performance estimate.

Rather than trying, some vendors simply publish the raw performance of their algorithm accelerators. So long as the performance is identified as "raw," "peak," "theoretical," etc. and the algorithm is identified, this is a legitimate marketing simplification. Beware of vendors publishing raw performance numbers and implying they are achievable system-level performance results.

As previously mentioned, ASIC implementations of algorithm accelerators in a given technology node are likely to have similar raw performance, but if the implementation does not fully implement the algorithm in hardware, raw performance can be considerably lower. Some of the first look-aside crypto accelerators were implemented with DSPs.

These had better performance than software implementations on general purpose processors, but far lower performance than ASIC implementations. Partially (or fully) microcoded/assembly coded implementations also have much lower raw performance than full ASIC implementations.

Very low-level accelerators are particularly prone to being implemented as hardware/assembly code hybrids, where certain algorithms are accelerated via a crypto APU, while others are implemented as assembly code in the accelerator library.

The accelerator may only implement the innermost loop of the algorithm, and rely on assembly code to modify processing context (such as counters or initialization vectors) between processing rounds in the accelerator.

Similar hybridization in very low-level accelerators may also be used to perform HMACs, where the accelerator supports the base hash and assembly code loads the key, I-Pad and O-Pad at the appropriate time in the HMAC generation.

DSP-based and hybrid low-level hardware/assembly code implementations offer a level of adaptability to new algorithms and modes, but the user must determine whether this adaptability is worth a substantial reduction in raw performance (and higher power consumption) compared to ASIC-style accelerators.

Although academia constantly publishes new algorithms and modes, the historical trend is that one new crypto algorithm achieves widespread commercial adoption per decade.

The need for compatibility with existing equipment, and the long cycle of confidence-building required before an algorithm is certified by the U.S. National Institute of Standards and Technology (NIST) and peer agencies worldwide, lead to a rather slow roll-out of new algorithms.

Bus Bandwidth
The amount of memory bandwidth consumed during look-aside crypto processing can significantly exceed the bandwidth used in plaintext operations. Whether the data to be cryptographically processed is a large file or a small packet, the data must originally be moved from a network or peripheral interface to system memory, from system memory to the accelerator and back, and finally from system memory to a network or peripheral interface.

In addition to the data movement, loading a security context (at a minimum the crypto key) also consumed bus bandwidth.

Table 1 below provides a comparison of the bandwidth consumption between plaintext IPv4 forwarding and IPsec ESP Tunnel mode. Note that this comparison only considers the extra bandwidth consumed by security protocol headers and trailers, and the movement of data and keys to and from the accelerator (assuming a single-pass high-level accelerator).

Table 1. Bandwidth Utilization

Increased memory bandwidth consumption associated with additional security protocol table look-ups, instruction fetches, or architecturally specific data structures (such as Ethernet and accelerator descriptors) are not included in the table, although these can consume significant additional bandwidth per operation.

The percentage of additional memory bandwidth consumed depends on the specific security operation and the size of the data being processed. Table 2 below shows, across packet size, the increased percentage of bandwidth consumed by IPsec ESP mode compared to simple IPv4 forwarding.

Note that a 1518-byte input packet size has a higher percentage increase than other large packet sizes because most networks have a Maximum Transmission Unit (MTU) of this size. The extra header bytes found in an IPsec packet lead to fragmentation, which consumes additional memory bandwidth.

Table 2. Bus bandwidth percentage increase by packet size

Crypto performance can be constrained if the accelerator isn't properly integrated into the embedded processor's bus structure, or the embedded processor itself has internal or external memory bus bottlenecks.

The PowerQUICC SEC is always connected to the main system bus to minimize the potential for bus bandwidth to limit system level cryptographic performance.

Next in Part 2: Standards & Industry Practices for Measuring Cryptographic Performance

Author Geoff Waters Senior Systems Engineer, joined Freescale Semiconductor in 1997 to work in the company's Networking and Computing Systems Group in Austin, Texas. Initially focused on security acceleration technologies, Geoff has served as a senior systems engineer for Freescale's Digital Systems Division for the last 5 years. Prior to working at Freescale, Geoff was a contractor to the US Defense Threat Reduction Agency [formerly known as the Defense Nuclear Agency (DNA)]. He is a graduate of the University of Houston Honors Program.

Editor Kurt Stammberger, CISSP, is VP of Development for Mocana. which focuses on the security of non-PC devices. He has over 19 years of experience in the security industry. He joined cryptography startup RSA Security as employee #7, where he led their marketing organization for eight years, helped launch spin-off company VeriSign, and created the brand for the technology that now protects virtually every electronic commerce transaction on the planet. Kurt founded and still serves on the Program Committee of the annual RSA Conference. He holds a BS in Mechanical Engineering from Stanford University, and an MS in Management from the Stanford Graduate School of Business, where he was an Alfred P. Sloan Fellow.

1 | 2 | 3 | 4 | 5 | 6

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS



WEBINAR
TECH PAPER
TECH PAPER
TECH PAPER




 :