This article is the first in a two part series covering both hardware and software variables impacting cryptographic performance. It introduces the reader to the basics of cryptography in embedded systems.
Part 2 covers standardized methodologies for measuring system-level security protocol performance, using specific measurements from the Freescale PowerQUICC embedded processors running Mocana NanoSec.
The basics of crypto on embedded systems
Cryptography is the art and science of manipulating data so that outside parties cannot undo or mimic the manipulation without knowledge of a secret. It enables high-level functions such as:
Confidentiality of information during storage and transmission
Authentication of users
Integrity of received/retrieved information
Non-repudiation of transactions
Availability of data and resources
Controlled access to information and resources
Network security protocols such as IPsec and SSL, and key negotiation and management applications such as IKE (Internet Key Exchange), use a variety of cryptographic algorithms to achieve these high-level goals.
Variables Impacting Cryptographic Performance
While there are many benefits to the security protocols enabled by cryptography, there is a downside. Cryptography is very computationally intensive. The extra processing steps involved in security protocols (vs. their non-secure analogs) create a heavy CPU utilization tax on systems that use cryptographic security frequently.
To address this issue, semiconductor vendors such as Freescale integrate cryptographic accelerators into their processors for embedded networking and communications systems.
While this might seem like the end of the story, it is actually just the beginning. The presence of a cryptographic accelerator in an embedded processor doesn't automatically improve security protocol performance.
There can be vast differences between the theoretical cryptographic performance of a system (or embedded processor) and its performance in a given application.
The objectives of this article are to identify and explain the variables that affect system-level security performance and to demonstrate how these variables manifest themselves in measured throughputs, using Freescale PowerQUICC integrated communications processors as examples.
Acceleration Architecture There are many accelerator implementations, but only a few basic architectures. Two basic architectures are flow-through and look-aside.
Flow-Through Accelerators
A flow-through accelerator performs cryptographic operations on data as it is "flowing" from one location to another. In a storage system, this flow could be from system memory to a hard drive; in a networking system, the flow could be between a network interface and system memory.
A defining characteristic of flow-through security processors is a level of autonomy from the embedded processor's CPU. Networking examples of flow-through security processors are generally capable of terminating IPsec.
From the perspective of software running on the embedded processor's CPU, IPsec doesn't exist, and all packet and payload processing is performed on cleartext data. Termination of IPsec means the flow-through security accelerator is capable of classifying packets, determining whether the packet requires IPsec processing and, if so, which tunnel or security association it belongs to.
The flow-through accelerator must also be capable of performing all the IPsec header and trailer processing, and maintaining security session state. The flow-through accelerator, or a flow-through network processing block in front of the flow-through accelerator, must be able to handle Layer 2 and Layer 3 headers and conditions such as IP fragmentation.
Treating normal lower-layer options as exceptions creates a split programming model (most packets go through the accelerator, some go up to software on the CPU), which can be a significant complication in session state management.
Flow-through accelerator implementations are typically ASIC or network processor-like, meaning adaptability to security protocol changes (or the protocols below the security protocol) can be limited or non-existent.
The nature of the implementations tends to push flow-through crypto accelerators to opposite ends of the usage spectrum. On one end are high-performance (10 Gbps), high-cost (>$150), discrete security processors.
The more NPU-like devices can support multiple security protocols (although generally only one at a time) through microcode updates. These devices are used in high-end systems because these systems can absorb the cost of redundant classification capability and the likely cost of redundant memory buses for the accelerator.
On the other end of the spectrum are flow-through application-specific crypto accelerators. These accelerators are generally integrated into SoCs that are themselves ASSPs.
A chipset for a cable modem may have a DOCSIS MAC/PHY with DES decryption acceleration, while a SATA controller may have integrated AES for disk sector encryption. These implementations may have some configurability, but no programmability.
Although flow-through accelerations can achieve a very high percentage of theoretical performance, they are rarely integrated into general purpose embedded processors because of their lack of programmability, costly redundant silicon area or a combination of the two.