Look-Aside Accelerators
In contrast to a flow-through accelerator, look-aside accelerators have little or no autonomy. This architecture is defined by the presence of a software-driven entity such as a CPU or NPU performing packet classification as a prerequisite to security processing. The CPU also executes OS functions (buffer/memory management), and network protocol processing.
Network security protocols such as IPsec are complex, stateful, and rife with options. IPsec requires consultation of a security policy database and security association database on a per-packet basis.
This consultation identifies the algorithms that will protect the data and the encryption keys for these algorithms. Key lifetime must be monitored and key refreshes initiated.
Various modes of IPsec call for different encapsulations of the original IP packets, and all require packet defragmentation prior to cryptographic processing. The CPU runs a device driver for the accelerator to offload crypto algorithm processing.
The first widely available crypto accelerators were external look-aside devices, such as the HiFN 7901 and the Motorola (now Freescale) MPC180. These external devices connected via both proprietary and standard buses (such as PCI), and it was a natural evolution for these accelerators to be integrated into embedded communications processors. There are two major sub-categories of look-aside accelerator, low-level and high-level.
Low-Level Accelerators .There is no standard definition for a low-level accelerator. However, any accelerator that cannot read and write data (lack of DMA capability) could be called a low-level accelerator without too much debate. If the accelerator cannot fetch its own data, software on the embedded processor's CPU must program an external DMA (possibly two, one for input, one for output) to transfer data to the accelerator's FIFOs.
If these FIFOs do not support external DMA handshaking signals (DREQ, DACK), the CPU will probably find it more efficient to directly write data to the accelerator's FIFOs and read the output. While a low level accelerator can be operated asynchronously, switching to other tasks isn't practical.
Unless the accelerator has large FIFOs and the data to be processed is small, the CPU will have to run in a loop, alternating between writing data to the input FIFOs and polling/reading data from the output FIFOs.
Some look-aside accelerators are extremely low-level, and are implemented as an auxiliary processing unit (APU) to the CPU. This tight coupling of accelerator to processor has the advantage of very low set-up overheads (discussed later in this series).
The downside is that crypto APUs require constant CPU intervention, and effectively make this architecture synchronous and blocking to other operations. Because many security protocol operations require both encryption and authentication to be performed on the data (such as 3DES-HMAC-SHA-1 for IPsec), this style of architecture becomes serial, synchronous, and blocking, where serial refers to 3DES followed by HMAC-SHA-1.
An accelerator with DMA capability could still be considered low-level if the accelerator's DMA capability were tied to a single function at a time. Single function means the DMA descriptor has the required fields to support requests such as, "Get key from location 1, get data from location 2, perform 3DES encryption and write to location 3."
At first glance, this may seem adequate. However, most security protocol operations require both encryption and authentication to be performed on the same data in a defined order. If performing IPsec with 3DES-HMAC-SHA-1, the processor is required to create two simple descriptors (one for 3DES, the other for the HMAC-SHA-1).
The accelerator treats these descriptors as two independent operations. Because these operations are handled separately, the data to be operated upon will be read from and written to memory twice: once for 3DES encryption and a second time for the HMAC-SHA-1 integrity check.
Whether this simple DMA capability is enough to vault such an accelerator out of the low-level category is up to the reader. This "dual-pass" DMA architecture might qualify as the lowest grade of a high-level accelerator. While offering a level of asynchronicity to enable task switching, a dual-pass accelerator with DMA capability is likely have lower performance than a dual-pass crypto APU because the dual-pass DMA accelerators cannot cache data between the two independent descriptors.
High-Level Accelerators. If low-level accelerators are defined by primitive or non-existent DMA capabilities, high-level accelerators are defined by sophisticated DMA capabilities including pipelined reads and writes, scatter/gather capability and single-pass encryption and message authentication.
High-level look-aside accelerator architectures evolved as external co-processors on peripheral buses such as PCI, where memory latencies were high and bandwidths were low. In order to have value, these look-aside accelerators had to do as much work as possible with the least amount of CPU overhead and memory bus bandwidth.
To achieve these goals, the accelerators became highly asynchronous to the processing flow, so that the CPU could task-switch and do significant work before checking on the accelerator's progress.
High-level accelerators always support single-pass encryption and authentication. Some even support additional levels of protocol processing offload such as adding security protocol headers and trailers.