Crypto API and Device Driver Overheads
The previous analysis of application and protocol stack differences used ESP Null/Null or software encryption to eliminate hardware, crypto API and driver overheads in order to compare the stacks themselves.
This section focuses on the interface to the crypto hardware as a significant variable in look-aside crypto accelerator performance.
One of the ironies of the Linux IPsec stack comparisons we performed is that OpenSwan, the stack with the lowest throughput (and consequently the highest overheads), currently offers the best open source crypto API: the OpenBSD Crypto Framework (OCF).
The most efficient open source stack, StrongSwan, does not have any specific API, meaning that crypto device drivers are generally ported directly to the stack as direct replacements for a software encryption function call.
Netkey is supported by what is called the Linux 2.6 Native Crypto API, which has a very poor interface to modern look-aside crypto accelerators. The newer Scatterlist Crypto API addresses many of the Native Crypto API's deficiencies.
Some commercial IPsec security stacks such as Mocana's NanoSec pay a great deal of attention to efficient interaction with look-aside crypto accelerators, and therefore outperform open source stacks in both non-accelerated and accelerated use cases.
But other proprietary stacks treat crypto acceleration APIs as a necessary evil and create low-performance, lowest common denominator APIs, resulting in generally poor performance.
The "Comparison of HW Accelerated IPsec" diagram shown in >b>Figure 6 below provides a throughput comparison of several IPsec stacks, including crypto API and driver overheads. All stacks ran on Linux 2.6, performing IPsec ESP Tunnel with 3DES-HMAC-SHA-1.
The device used to take these measurements was the Freescale MPC8548E, with the e500v2 CPU at 1.33GHz, and the SEC 2.1 look-aside crypto accelerator running at 266 MHz.
The IPv4 and ESP Null/Null measurements in Figure 4 earlier were taken under the same conditions, and can be accurately compared to Figure 6 below. The specific stacks and APIs shown in Figure 6, Comparison of HW Accelerated IPsec are as follows:
Mocana NanoSec IPsec with Mocana's Acceleration Harness API and the Mocana version of Freescale's SEC driver. Highly asynchronous, supports SEC single-pass descriptors.
A commercial solution, available from Mocana, is optimized specifically for the PowerQUICC architecture. Fully featured and documented, exhaustively tested (VPNC Certified), production-quality and fully supported. (Free source code trials of Mocana's software can be downloaded from their site.
OpenSwan IPsec (open source) with OpenBSD Crypto Framework (OCF) API, with Freescale version of SEC driver. Highly asynchronous, supports SEC single-pass descriptors.
Freescale developed a subset of the SEC driver to work within OCF, and has made this available to the Linux community. OpenSwan, OCF, and the SEC OCF driver are also included in the BSPs of several newer Freescale PowerQUICC SDKs as reference code.
The OpenSwan IPsec driver integration is reference code, not a Freescale-supported software product, and subject to the evolution of OpenSwan and OCF.
StrongSwan IPsec (open source) with a direct port of a subset of the SEC's driver to replace StrongSwan software crypto routines. Supports SEC single pass descriptors, but operates synchronously.
The IPsec stack polls for SEC completion before continuing processing on the packet being encrypted, and does not work on any other packets while waiting. Freescale has released reference implementations of StrongSwan IPsec in the past, but has not done any recent development with StrongSwan. This code is not supported by Freescale.
Netkey IPsec (open source) with Linux 2.6 native crypto API, with SEC general purpose driver. This API does not support high-level crypto accelerators (no support for single-pass encryption and authentication) or asynchronous operations.
Freescale ported the SEC driver to this stack/API for evaluation purposes only. Freescale has recently released a reference implementation of the SEC driver working under the Scatterlist API, with performance similar to OpenSwan/OCF.
 |
| Figure 5. Comparison of HW Accelerated IPsec |
A comparison of the throughputs in Figure 4 earlier and Figure 5 above shows that SEC driver overhead is relatively small. The performance of ESP Null/Null and ESP with HW accelerated 3DES-HMAC-SHA-1 is similar at small packet sizes, where differences in software overheads are magnified by the large number of packets processed.
As packet size increases, all software overheads (stack, API, driver) become less important, and the raw performance of the look-aside accelerator becomes more critical.
API/Driver Differences Between Low-Level/High-Level Accelerators
As discussed, many (but not all) security protocol stacks have crypto APIs supporting high-level accelerator features, specifically single-pass processing and highly asynchronous operations. Every crypto API that supports high-level accelerator features also supports low-level accelerator features.
Some low-level accelerators are so low-level that they do not require a driver to be ported to an API or protocol stack. The accelerator's functionality is exposed via a library which the user compiles with their operating system as a direct replacement for software crypto routines.
The disadvantage of having an architecture that operates as a direct replacement for software is that software encryption operates synchronously (no other processes run while crypto is running), and when the security protocol requires both encryption and authentication, those operations must operate serially.
The "API and Architecture Impacts on Processing Steps" diagram shown in Figure 6 below, illustrates the difference between single-pass asynchronous processing and dual-pass synchronous processing.
 |
| Figure 6. API and Architecture Impacts on Processing Steps |
Low-level accelerators have lower software overheads to begin processing. A library call causes software to directly write keys to key registers and data to input FIFOs. This is in contrast to the high-level API which must create an intermediate data structure related to the request, then call the high-level accelerator's device driver to create a descriptor.
Descriptors represent a level of indirection (and consequently overhead) compared to direct writes to a low-level accelerator. This means that given equivalent performance processors executing the protocol stack and API/driver/library, a low-level accelerator will likely have better small-packet performance.
Figure 7 below illustrates how a low-level accelerator's quicker start time allows it to have better small-packet performance, while high-level accelerators with single-pass capability have better large-packet performance.
 |
| Figure 7. Relative Performance of Low-Level and High-Level Accelerators |
Assuming crypto accelerator raw performance and IPsec stack overheads are identical, the exact performance crossover point depends on the performance of the processor executing the software (stack, API and driver/library).
If the processors are equivalent (frequency, instructions per clock), the crossover point is somewhere in the 64- to 256-byte packet size range.
If the high-level accelerator is integrated in a SoC with a higher-frequency, higher-IPC processor (such as the PowerQUICC Power Architecture CPU), there may not be a crossover point, and the device with a high-level accelerator can perform better at all packet sizes.