Understanding Crypto Performance in Embedded Systems: Part 1 - Embedded.com

Understanding Crypto Performance in Embedded Systems: Part 1

This article is the first in a two part series covering both hardware and software variables impacting cryptographic performance. It introduces the reader to the basics of cryptography in embedded systems.

Part 2 covers standardized methodologies for measuring system-level security protocol performance, using specific measurements from the Freescale PowerQUICC embedded processors running Mocana NanoSec.

The basics of crypto on embedded systems
Cryptography is the art and science of manipulating data so that outside parties cannot undo or mimic the manipulation without knowledge of a secret. It enables high-level functions such as:

Confidentiality of information during storage and transmission
Authentication of users
Integrity of received/retrieved information
Non-repudiation of transactions
Availability of data and resources
Controlled access to information and resources

Network security protocols such as IPsec and SSL, and key negotiation and management applications such as IKE (Internet Key Exchange), use a variety of cryptographic algorithms to achieve these high-level goals.

Variables Impacting Cryptographic Performance
While there are many benefits to the security protocols enabled by cryptography, there is a downside. Cryptography is very computationally intensive. The extra processing steps involved in security protocols (vs. their non-secure analogs) create a heavy CPU utilization tax on systems that use cryptographic security frequently.

To address this issue, semiconductor vendors such as Freescale integrate cryptographic accelerators into their processors for embedded networking and communications systems.

While this might seem like the end of the story, it is actually just the beginning. The presence of a cryptographic accelerator in an embedded processor doesn't automatically improve security protocol performance.

There can be vast differences between the theoretical cryptographic performance of a system (or embedded processor) and its performance in a given application.

The objectives of this article are to identify and explain the variables that affect system-level security performance and to demonstrate how these variables manifest themselves in measured throughputs, using Freescale PowerQUICC integrated communications processors as examples.

Acceleration Architecture There are many accelerator implementations, but only a few basic architectures. Two basic architectures are flow-through and look-aside .

Flow-Through Accelerators
A flow-through accelerator performs cryptographic operations on data as it is “flowing” from one location to another. In a storage system, this flow could be from system memory to a hard drive; in a networking system, the flow could be between a network interface and system memory.

A defining characteristic of flow-through security processors is a level of autonomy from the embedded processor's CPU. Networking examples of flow-through security processors are generally capable of terminating IPsec.

From the perspective of software running on the embedded processor's CPU, IPsec doesn't exist, and all packet and payload processing is performed on cleartext data. Termination of IPsec means the flow-through security accelerator is capable of classifying packets, determining whether the packet requires IPsec processing and, if so, which tunnel or security association it belongs to.

The flow-through accelerator must also be capable of performing all the IPsec header and trailer processing, and maintaining security session state. The flow-through accelerator, or a flow-through network processing block in front of the flow-through accelerator, must be able to handle Layer 2 and Layer 3 headers and conditions such as IP fragmentation.

Treating normal lower-layer options as exceptions creates a split programming model (most packets go through the accelerator, some go up to software on the CPU), which can be a significant complication in session state management.

Flow-through accelerator implementations are typically ASIC or network processor-like, meaning adaptability to security protocol changes (or the protocols below the security protocol) can be limited or non-existent.

The nature of the implementations tends to push flow-through crypto accelerators to opposite ends of the usage spectrum. On one end are high-performance (10 Gbps), high-cost (>$150), discrete security processors.

The more NPU-like devices can support multiple security protocols (although generally only one at a time) through microcode updates. These devices are used in high-end systems because these systems can absorb the cost of redundant classification capability and the likely cost of redundant memory buses for the accelerator.

On the other end of the spectrum are flow-through application-specific crypto accelerators. These accelerators are generally integrated into SoCs that are themselves ASSPs.

A chipset for a cable modem may have a DOCSIS MAC/PHY with DES decryption acceleration, while a SATA controller may have integrated AES for disk sector encryption. These implementations may have some configurability, but no programmability.

Although flow-through accelerations can achieve a very high percentage of theoretical performance, they are rarely integrated into general purpose embedded processors because of their lack of programmability, costly redundant silicon area or a combination of the two.

Look-Aside Accelerators
In contrast to a flow-through accelerator, look-aside accelerators have little or no autonomy. This architecture is defined by the presence of a software-driven entity such as a CPU or NPU performing packet classification as a prerequisite to security processing. The CPU also executes OS functions (buffer/memory management), and network protocol processing.

Network security protocols such as IPsec are complex, stateful, and rife with options. IPsec requires consultation of a security policy database and security association database on a per-packet basis.

This consultation identifies the algorithms that will protect the data and the encryption keys for these algorithms. Key lifetime must be monitored and key refreshes initiated.

Various modes of IPsec call for different encapsulations of the original IP packets, and all require packet defragmentation prior to cryptographic processing. The CPU runs a device driver for the accelerator to offload crypto algorithm processing.

The first widely available crypto accelerators were external look-aside devices, such as the HiFN 7901 and the Motorola (now Freescale) MPC180. These external devices connected via both proprietary and standard buses (such as PCI), and it was a natural evolution for these accelerators to be integrated into embedded communications processors. There are two major sub-categories of look-aside accelerator, low-level and high-level .

Low-Level Accelerators .There is no standard definition for a low-level accelerator. However, any accelerator that cannot read and write data (lack of DMA capability) could be called a low-level accelerator without too much debate. If the accelerator cannot fetch its own data, software on the embedded processor's CPU must program an external DMA (possibly two, one for input, one for output) to transfer data to the accelerator's FIFOs.

If these FIFOs do not support external DMA handshaking signals (DREQ, DACK), the CPU will probably find it more efficient to directly write data to the accelerator's FIFOs and read the output. While a low level accelerator can be operated asynchronously, switching to other tasks isn't practical.

Unless the accelerator has large FIFOs and the data to be processed is small, the CPU will have to run in a loop, alternating between writing data to the input FIFOs and polling/reading data from the output FIFOs.

Some look-aside accelerators are extremely low-level, and are implemented as an auxiliary processing unit (APU) to the CPU. This tight coupling of accelerator to processor has the advantage of very low set-up overheads (discussed later in this series ).

The downside is that crypto APUs require constant CPU intervention, and effectively make this architecture synchronous and blocking to other operations. Because many security protocol operations require both encryption and authentication to be performed on the data (such as 3DES-HMAC-SHA-1 for IPsec), this style of architecture becomes serial, synchronous, and blocking, where serial refers to 3DES followed by HMAC-SHA-1.

An accelerator with DMA capability could still be considered low-level if the accelerator's DMA capability were tied to a single function at a time. Single function means the DMA descriptor has the required fields to support requests such as, “Get key from location 1, get data from location 2, perform 3DES encryption and write to location 3.”

At first glance, this may seem adequate. However, most security protocol operations require both encryption and authentication to be performed on the same data in a defined order. If performing IPsec with 3DES-HMAC-SHA-1, the processor is required to create two simple descriptors (one for 3DES, the other for the HMAC-SHA-1).

The accelerator treats these descriptors as two independent operations. Because these operations are handled separately, the data to be operated upon will be read from and written to memory twice: once for 3DES encryption and a second time for the HMAC-SHA-1 integrity check.

Whether this simple DMA capability is enough to vault such an accelerator out of the low-level category is up to the reader. This “dual-pass” DMA architecture might qualify as the lowest grade of a high-level accelerator. While offering a level of asynchronicity to enable task switching, a dual-pass accelerator with DMA capability is likely have lower performance than a dual-pass crypto APU because the dual-pass DMA accelerators cannot cache data between the two independent descriptors.

High-Level Accelerators. If low-level accelerators are defined by primitive or non-existent DMA capabilities, high-level accelerators are defined by sophisticated DMA capabilities including pipelined reads and writes, scatter/gather capability and single-pass encryption and message authentication.

High-level look-aside accelerator architectures evolved as external co-processors on peripheral buses such as PCI, where memory latencies were high and bandwidths were low. In order to have value, these look-aside accelerators had to do as much work as possible with the least amount of CPU overhead and memory bus bandwidth.

To achieve these goals, the accelerators became highly asynchronous to the processing flow, so that the CPU could task-switch and do significant work before checking on the accelerator's progress.

High-level accelerators always support single-pass encryption and authentication. Some even support additional levels of protocol processing offload such as adding security protocol headers and trailers.

Accelerator Data Flow
Whether internal or external, low-level or high-level, the instruction and data flows for look-aside crypto accelerators are similar even if their performance and efficiency is not.

Figure 1 below shows a high-level block diagram of a PowerQUICC III processor to illustrate the position of a look-aside accelerator within an integrated communication processor.

Figure 1: PowerQUICC Processor with Look-Aside Crypto Accelerator

Following sequentially in Figure 2 and Figure 3 are illustrations of a typical data flow for a high-level crypto accelerator such as the Freescale SEC. The processing steps are described below.

Step #1. A packet arrives at the Ethernet interface and is placed in a buffer in main memory. PowerQUICC-specific optimizations to this step include Ethernet interrupt coalescing and packet header stashing to the L2 cache.

Step #2. Upon notification (or discovery via polling) that a packet is available for processing, the CPU reads the packet header to perform classification. Classification involves software checking the header fields against various tables.

In this specific example, IPsec classification involves look-ups in two databases: a security policy database to determine whether the packet needs to be IPsec protected, and a security association database to determine the specific IPsec tunnel and parameters to use when encapsulating the packet.

Step #3 . The CPU creates a descriptor for the security engine (SEC) that includes configuration information and pointers to the keys, context and data required for the cryptographic operation.

The amount of pre-processing the CPU performs on the packet before sending it to the crypto accelerator depends on the capabilities of the accelerator. Some accelerators perform crypto operations only. Other accelerators perform a level of protocol processing such as adding IPsec headers.

Step #4. The CPU writes a pointer to the descriptor to a SEC crypto-channel (DMA).

Figure 2: Look-Aside Security Architecture Steps 1-4

Step #5. The SEC fetches the descriptor from main memory.

Step #6. The SEC configures itself for single-pass processing per the descriptor and begins fetching keys, context and data from main memory. It writes decrypted data back to memory as it processes.

Step #7. The SEC notifies the CPU when the operation is complete (configurable options for notification by interrupt or polling bits.)

Step #8. The core performs touch-up formatting on the packet.

Step #9. The core creates a Tx buffer descriptor for the Ethernet interface.

Step #10. The Ethernet interface forwards the decrypted packet.

Figure 3. Look-Aside Security Architecture Steps 5″10

Look-aside architectures have become fairly prevalent in embedded processors for the following reasons:

They can be implemented cost-effectively because they leverage existing SoC platform resources including memory, classification resources and protocol state maintenance resources.

Software's ability to pre- and post-process the data, and to provide a wider range of crypto processing instructions, provides the flexibility required to support a variety of application and protocol use cases.

Although generally lower performance than a flow-through architecture, a look-aside accelerator provides sufficient performance for a broad range of applications.(All current PowerQUICC communications processors implement a look-aside architecture. )

Despite the co-location of security accelerators in the network interface modules of other communications processors, there are no unambiguous examples of flow-through processing in integrated communications processors. Consequently the remainder of this article will focus on variables impacting the performance of look-aside architectures.

Non-Secure vs. Secure Protocol Stack Differences
Application and protocol stack overheads are the instructions a processor executes to determine what needs to be done to a given quantity of data, and how to do it.

These instructions are distinct from accelerator device driver overheads, as these application/protocol stack overheads exist whether acceleration is available or not. The overheads of secure applications and secure protocol stacks are often shocking to users experienced with only the non-secure analogs.

Figure 4 below shows the throughput (Mbps) and CPU utilization of a cleartext forwarding scenario (IPv4 forwarding using the Linux 2.6 TCP/IP stack) and an IPsec forwarding scenario using the Linux 2.6 TCP/IP stack with OpenSwan IPsec.

To highlight the overheads of security protocol processing, IPsec is run in Encapsulating Security Protocol (ESP) mode, with both encryption and authentication disabled (null/null).

This is a non-compliant mode of IPsec (disabling authentication is not allowed) but it demonstrates the substantial performance degradation associated with security protocol processing.

Figure 4. Performance Comparison of IPv4 and IPsec ESP Tunnel (Null/Null)

The measurements shown in Figure 4 were measured on a Freescale MPC8548E device with a Power Architecture e500v2 core running at 1.33 GHz. The network test equipment used in this measurement had a throughput limit of 2 Gbps (2 X 1-Gbps Ethernet links), otherwise the IPv4 performance would vastly exceed the ESP Null/Null performance at all packet sizes.

In fact, it is possible to calculate the CPU frequency divided by the number of packets forwarded, and show that IPsec ESP Null/Null forwarding consumes about 3.2x more CPU cycles per packet forwarded than plain IPv4 forwarding.

At large packet sizes, where IPv4 and ESP Null/Null throughput converges because of test equipment limits, IPv4 has more than 80 percent idle CPU to perform other tasks while ESP Null/Null is still consuming the majority of the CPU's cycles.

Why is a security protocol so much “heavier”' than the non-secure equivalent? The specifics vary from protocol to protocol, but in almost every case, the security protocol requires more complex classification to determine whether the packet needs to be protected (and if so, using which security parameters).

Security protocols are also stateful, as the cryptographic keys that are used to encrypt and/or authenticate the data have lifetimes. These lifetimes may be measured in number of bytes encrypted, number of seconds since the key was first used, number of seconds since the key was last used (idle time-out), or all of the above (first to occur).

Finally, security protocols add header fields to packets, allowing the packet to be forwarded to the other end of the security tunnel without revealing information about the communicating parties.

These headers are often shimmed between existing headers, or they cause memory copies or memory buffer manipulations which aren't necessary in simple forwarding.

ESP Null/Null performance represents the theoretical best-case performance for IPsec ESP with look-aside crypto acceleration. A “perfect” look-aside engine, with zero driver overhead, zero processing latency and ample raw performance could equal ESP Null/Null performance, but not exceed it.

All Security Protocol Stacks Aren't Equal
The performance difference between IPv4 forwarding and IPsec ESP Null/Null forwarding is quite large, but there are also significant differences between IPsec stacks and the operating systems these stacks are running on.

Different Linux IPsec stacks running on the same device can produce very different results. For open source stacks (StrongSwan, OpenSwan, and Netkey), we found a 2x difference between the best and worst stacks, measured at small packet sizes where protocol stack overheads dominate.

While this is smaller than the 3.2x difference between IPv4 and IPsec ESP Null/Null, it is still a very large difference. Proprietary IPsec stacks can widen these differences still further.

The Mocana NanoSec IPsec stack running on Linux 2.6 performed 1.75x again better than the fastest open source stack IPsec tested, although some of this advantage comes from Mocana's crypto acceleration API (described in the next section).

Freescale doesn't have enough data to fully quantify the impact of running the same IPsec stack on the same PowerQUICC device but with different operating systems.

The limited comparison data (throughput and CPU utilization) that we do have indicates that the differences between OSs is smaller than the differences between IPsec stacks, assuming the network drivers are equally optimized for all OSs used in the comparison.

Crypto API and Device Driver Overheads
The previous analysis of application and protocol stack differences used ESP Null/Null or software encryption to eliminate hardware, crypto API and driver overheads in order to compare the stacks themselves.

This section focuses on the interface to the crypto hardware as a significant variable in look-aside crypto accelerator performance.

One of the ironies of the Linux IPsec stack comparisons we performed is that OpenSwan, the stack with the lowest throughput (and consequently the highest overheads), currently offers the best open source crypto API: the OpenBSD Crypto Framework (OCF).

The most efficient open source stack, StrongSwan, does not have any specific API, meaning that crypto device drivers are generally ported directly to the stack as direct replacements for a software encryption function call.

Netkey is supported by what is called the Linux 2.6 Native Crypto API, which has a very poor interface to modern look-aside crypto accelerators. The newer Scatterlist Crypto API addresses many of the Native Crypto API's deficiencies.

Some commercial IPsec security stacks such as Mocana's NanoSec pay a great deal of attention to efficient interaction with look-aside crypto accelerators, and therefore outperform open source stacks in both non-accelerated and accelerated use cases.

But other proprietary stacks treat crypto acceleration APIs as a necessary evil and create low-performance, lowest common denominator APIs, resulting in generally poor performance.

The “Comparison of HW Accelerated IPsec” diagram shown in >b>Figure 6 below provides a throughput comparison of several IPsec stacks, including crypto API and driver overheads. All stacks ran on Linux 2.6, performing IPsec ESP Tunnel with 3DES-HMAC-SHA-1.

The device used to take these measurements was the Freescale MPC8548E, with the e500v2 CPU at 1.33GHz, and the SEC 2.1 look-aside crypto accelerator running at 266 MHz.

The IPv4 and ESP Null/Null measurements in Figure 4 earlier were taken under the same conditions, and can be accurately compared to Figure 6 below . The specific stacks and APIs shown in Figure 6, Comparison of HW Accelerated IPsec are as follows:

Mocana NanoSec IPsec with Mocana's Acceleration Harness API and the Mocana version of Freescale's SEC driver. Highly asynchronous, supports SEC single-pass descriptors.

A commercial solution, available from Mocana, is optimized specifically for the PowerQUICC architecture. Fully featured and documented, exhaustively tested (VPNC Certified), production-quality and fully supported. (Free source code trials of Mocana's software can be downloaded from their site.

OpenSwan IPsec (open source) with OpenBSD Crypto Framework (OCF) API, with Freescale version of SEC driver. Highly asynchronous, supports SEC single-pass descriptors.

Freescale developed a subset of the SEC driver to work within OCF, and has made this available to the Linux community. OpenSwan, OCF, and the SEC OCF driver are also included in the BSPs of several newer Freescale PowerQUICC SDKs as reference code.

The OpenSwan IPsec driver integration is reference code, not a Freescale-supported software product, and subject to the evolution of OpenSwan and OCF.

StrongSwan IPsec (open source) with a direct port of a subset of the SEC's driver to replace StrongSwan software crypto routines. Supports SEC single pass descriptors, but operates synchronously.

The IPsec stack polls for SEC completion before continuing processing on the packet being encrypted, and does not work on any other packets while waiting. Freescale has released reference implementations of StrongSwan IPsec in the past, but has not done any recent development with StrongSwan. This code is not supported by Freescale.

Netkey IPsec (open source) with Linux 2.6 native crypto API, with SEC general purpose driver. This API does not support high-level crypto accelerators (no support for single-pass encryption and authentication) or asynchronous operations.

Freescale ported the SEC driver to this stack/API for evaluation purposes only. Freescale has recently released a reference implementation of the SEC driver working under the Scatterlist API, with performance similar to OpenSwan/OCF.

Figure 5. Comparison of HW Accelerated IPsec

A comparison of the throughputs in Figure 4 earlier and Figure 5 above shows that SEC driver overhead is relatively small. The performance of ESP Null/Null and ESP with HW accelerated 3DES-HMAC-SHA-1 is similar at small packet sizes, where differences in software overheads are magnified by the large number of packets processed.

As packet size increases, all software overheads (stack, API, driver) become less important, and the raw performance of the look-aside accelerator becomes more critical.

API/Driver Differences Between Low-Level/High-Level Accelerators
As discussed, many (but not all) security protocol stacks have crypto APIs supporting high-level accelerator features, specifically single-pass processing and highly asynchronous operations. Every crypto API that supports high-level accelerator features also supports low-level accelerator features.

Some low-level accelerators are so low-level that they do not require a driver to be ported to an API or protocol stack. The accelerator's functionality is exposed via a library which the user compiles with their operating system as a direct replacement for software crypto routines.

The disadvantage of having an architecture that operates as a direct replacement for software is that software encryption operates synchronously (no other processes run while crypto is running), and when the security protocol requires both encryption and authentication, those operations must operate serially.

The “API and Architecture Impacts on Processing Steps” diagram shown in Figure 6 below , illustrates the difference between single-pass asynchronous processing and dual-pass synchronous processing.

Figure 6. API and Architecture Impacts on Processing Steps

Low-level accelerators have lower software overheads to begin processing. A library call causes software to directly write keys to key registers and data to input FIFOs. This is in contrast to the high-level API which must create an intermediate data structure related to the request, then call the high-level accelerator's device driver to create a descriptor.

Descriptors represent a level of indirection (and consequently overhead) compared to direct writes to a low-level accelerator. This means that given equivalent performance processors executing the protocol stack and API/driver/library, a low-level accelerator will likely have better small-packet performance.

Figure 7 below illustrates how a low-level accelerator's quicker start time allows it to have better small-packet performance, while high-level accelerators with single-pass capability have better large-packet performance.

Figure 7. Relative Performance of Low-Level and High-Level Accelerators

Assuming crypto accelerator raw performance and IPsec stack overheads are identical, the exact performance crossover point depends on the performance of the processor executing the software (stack, API and driver/library).

If the processors are equivalent (frequency, instructions per clock), the crossover point is somewhere in the 64- to 256-byte packet size range.

If the high-level accelerator is integrated in a SoC with a higher-frequency, higher-IPC processor (such as the PowerQUICC Power Architecture CPU), there may not be a crossover point, and the device with a high-level accelerator can perform better at all packet sizes.

Crypto Raw Performance
The raw performance of an algorithm accelerator has the potential to be one of the larger variables in system-level performance, particularly at larger data sizes where software overheads are less of a factor.

Despite apparently significant differences in operating frequencies, ASIC-style individual algorithm accelerators manufactured with the same process technology node have similar raw performance.

The operating frequency of an algorithm accelerator is not a good indicator of its raw performance, because the number of internal permutations per clock cycle is merely an implementation trade off between more or fewer pipeline stages.

Most accelerator vendors have a good understanding of their algorithms' raw performance, and this value likely varies by algorithm and key length. AES accelerators tend to be faster than 3DES accelerators, and it is typically faster to encrypt with a 128b key than a 256b key.

To reach higher levels of raw performance, a cryptographic accelerator may implement multiple copies of an individual algorithm accelerator (i.e., four AES accelerators to achieve a total of 10 Gbps). This aggregate raw performance doesn't mean that a single piece of data can be encrypted at 10 Gbps, but rather four pieces of data can be encrypted at 2.5 Gbps each.

The variability of software overheads described earlier in this article should make clear the difficulty accelerator vendors face when trying to provide an accurate system level performance estimate.

Rather than trying, some vendors simply publish the raw performance of their algorithm accelerators. So long as the performance is identified as “raw,” “peak,” “theoretical,” etc. and the algorithm is identified, this is a legitimate marketing simplification. Beware of vendors publishing raw performance numbers and implying they are achievable system-level performance results .

As previously mentioned, ASIC implementations of algorithm accelerators in a given technology node are likely to have similar raw performance, but if the implementation does not fully implement the algorithm in hardware, raw performance can be considerably lower. Some of the first look-aside crypto accelerators were implemented with DSPs.

These had better performance than software implementations on general purpose processors, but far lower performance than ASIC implementations. Partially (or fully) microcoded/assembly coded implementations also have much lower raw performance than full ASIC implementations.

Very low-level accelerators are particularly prone to being implemented as hardware/assembly code hybrids, where certain algorithms are accelerated via a crypto APU, while others are implemented as assembly code in the accelerator library.

The accelerator may only implement the innermost loop of the algorithm, and rely on assembly code to modify processing context (such as counters or initialization vectors) between processing rounds in the accelerator.

Similar hybridization in very low-level accelerators may also be used to perform HMACs, where the accelerator supports the base hash and assembly code loads the key, I-Pad and O-Pad at the appropriate time in the HMAC generation.

DSP-based and hybrid low-level hardware/assembly code implementations offer a level of adaptability to new algorithms and modes, but the user must determine whether this adaptability is worth a substantial reduction in raw performance (and higher power consumption) compared to ASIC-style accelerators.

Although academia constantly publishes new algorithms and modes, the historical trend is that one new crypto algorithm achieves widespread commercial adoption per decade.

The need for compatibility with existing equipment, and the long cycle of confidence-building required before an algorithm is certified by the U.S. National Institute of Standards and Technology (NIST) and peer agencies worldwide, lead to a rather slow roll-out of new algorithms.

Bus Bandwidth
The amount of memory bandwidth consumed during look-aside crypto processing can significantly exceed the bandwidth used in plaintext operations. Whether the data to be cryptographically processed is a large file or a small packet, the data must originally be moved from a network or peripheral interface to system memory, from system memory to the accelerator and back, and finally from system memory to a network or peripheral interface.

In addition to the data movement, loading a security context (at a minimum the crypto key) also consumed bus bandwidth.

Table 1 below provides a comparison of the bandwidth consumption between plaintext IPv4 forwarding and IPsec ESP Tunnel mode. Note that this comparison only considers the extra bandwidth consumed by security protocol headers and trailers, and the movement of data and keys to and from the accelerator (assuming a single-pass high-level accelerator).

Table 1. Bandwidth Utilization

Increased memory bandwidth consumption associated with additional security protocol table look-ups, instruction fetches, or architecturally specific data structures (such as Ethernet and accelerator descriptors) are not included in the table, although these can consume significant additional bandwidth per operation.

The percentage of additional memory bandwidth consumed depends on the specific security operation and the size of the data being processed. Table 2 below shows, across packet size, the increased percentage of bandwidth consumed by IPsec ESP mode compared to simple IPv4 forwarding.

Note that a 1518-byte input packet size has a higher percentage increase than other large packet sizes because most networks have a Maximum Transmission Unit (MTU) of this size. The extra header bytes found in an IPsec packet lead to fragmentation, which consumes additional memory bandwidth.

Table 2. Bus bandwidth percentage increase by packet size

Crypto performance can be constrained if the accelerator isn't properly integrated into the embedded processor's bus structure, or the embedded processor itself has internal or external memory bus bottlenecks.

The PowerQUICC SEC is always connected to the main system bus to minimize the potential for bus bandwidth to limit system level cryptographic performance.

Next in Part 2: Standards & Industry Practices for Measuring Cryptographic Performance

Author Geoff Waters Senior Systems Engineer, joined Freescale Semiconductor in 1997 to work in the company's Networking and Computing Systems Group in Austin, Texas. Initially focused on security acceleration technologies, Geoff has served as a senior systems engineer for Freescale's Digital Systems Division for the last 5 years. Prior to working at Freescale, Geoff was a contractor to the US Defense Threat Reduction Agency [formerly known as the Defense Nuclear Agency (DNA)]. He is a graduate of the University of Houston Honors Program.

Editor Kurt Stammberger , CISSP, is VP of Development for Mocana. which focuses on the security of non-PC devices. He has over 19 years of experience in the security industry. He joined cryptography startup RSA Security as employee #7, where he led their marketing organization for eight years, helped launch spin-off company VeriSign, and created the brand for the technology that now protects virtually every electronic commerce transaction on the planet. Kurt founded and still serves on the Program Committee of the annual RSA Conference. He holds a BS in Mechanical Engineering from Stanford University, and an MS in Management from the Stanford Graduate School of Business, where he was an Alfred P. Sloan Fellow.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.