Understanding Crypto Performance in Embedded Systems: Part 2 - Embedded.com

Understanding Crypto Performance in Embedded Systems: Part 2

Part 1 of this series discussed hardware and software variables impacting system-level cryptographic performance. Now in Part 2 we will focus on two methodologies for measuring the performance of a high-level look-aside accelerator: 1) driver level accelerator testing to identify accelerator or SoC memory bandwidth constraints, and 2) application/protocol stack level testing which includes a full packet ingress to egress path.

Accelerator Measurement Methodology
The most common method of measuring a look-aside accelerator's performance is a driver level benchmark. In this test, a set of test data is loaded into the main memory of the SoC hosting the accelerator.

Software running on the SoC's CPU creates descriptors or otherwise causes the accelerator to perform cryptographic operations on the test data. Using a timer, the driver level test calculates the accelerator's throughput by dividing the number of bytes processed by the time taken.

While this seems simple enough, many of the variables described in Part 1 of this article can still influence the results. When evaluating (or creating) a driver level benchmark, the evaluator needs to consider the following:

Data size . Is the test encrypting a small or large chunk of data? The smaller the data size, the more the results will be influenced by the accelerator's memory latency, the descriptor size and other “non-data” context it must fetch to perform the operation, and the timer accuracy.

Understanding memory latency and DMA overheads to read context and data is important, but timer resolution can be the dominant variable when measuring the performance of a single operation on a small chunk of data.

Iteration . A good way to include accelerator DMA overheads and memory latency in a small data size measurement while reducing timer resolution as a variable is to construct the test so that the timer starts before iteration 1 and stops after the nth interation, where n is a fairly large number.

Freescale typically measures driver level performance using tests with 50,000 iterations. Iterations can introduce additional variables such as caching, pre-work, interrupts and checking.

1. Caching. Does the test repetitively encrypt the same data n times, using the same keys and context? Or does it encrypt n unique chunks of data with n unique keys? Assuming the accelerator operates within the SoC's cache coherency scheme (have fun if it doesn't! ), repetitively encrypting the same data can lead to the data being read from on-chip cache memory rather than main memory. Freescale's driver benchmarks repetitively use the same keys on 50K unique chunks of data, mimicking the behavior of the accelerator in a single tunnel packet processing scenario.

2. Pre-work and interrupts. If a high-level accelerator iteratively encrypts 50K unique chunks of data, that implies software builds and launches 50K unique descriptors. Two major variables, particularly at small packet sizes, are: When does software build the descriptors, and when does it dispatch them to the accelerator?

The optimized benchmarking scenario is for software to build all these descriptors at the same time the test data is created, followed by software dispatching all the descriptors to the accelerator as soon as the timer is started.

Software doesn't monitor the completion of individual descriptors via polling or interrupts. It waits for an interrupt from the final descriptor and stops the timer in the interrupt service routine.

A second approach, “build and dispatch descriptor n+1 after descriptor n completes,” is more representative of real-world packet processing scenarios. The “dispatch all at once” approach provides small data size performance that is approximately four times better than the “build and dispatch descriptor n+1 after descriptor n completes' approach.

Freescale's driver performance tests, which are provided along with the reference driver, demonstrate the latter approach. This is more representative of integrating the driver within a packet processing application.

3.Checking. The driver level test might check the output of each descriptor, or it may assume the output is always good. If the test checks the outputs, it may do those checks while the timer is running, or it may wait until the final job completes, then go back and check for expected results. The more checking done within the driver test, the less the results reflect the accelerator's raw performance.

Algorithm or ciphersuite. Data size and method of iteration are the dominant variables, but the more the driver benchmark focuses on true hardware performance, the more dominant algorithms or ciphersuites become in measured performance.

Some algorithms, and even modes of algorithms, have higher raw performance than others. Single-pass decryption + message integrity checking may be faster than single-pass encryption + message integrity generation.

Figure 1 below provides a comparison of ciphersuites using the “build and dispatch descriptor n+1 after descriptor n completes” approach, along with a single AES-HMAC-SHA-1 benchmark using the “dispatch all at once” approach.

Figure 1: Driver benchmarks

Figure 1 shows that at small packet sizes, all the “build and dispatch descriptor n+1 after descriptor n completes” results are nearly identical, because the software overheads of descriptor building, dispatching and monitoring descriptor completions overwhelm algorithm-specific performance differences of the accelerator.

Only at larger data sizes does the hardware algorithm performance difference become an observable variable. The single AES-HMAC-SHA-1 test in which all the descriptors are built in advance, and dispatched at the maximum rate the accelerator can accept them, shows approximately four times greater performance at small data sizes. However, by 1KB data size, the raw performance of the hardware is the dominant variable.

It is important for users of vendor-supplied benchmarking code to understand what exactly the benchmarking code does before comparing the results to another vendor's benchmarking code.

Different vendors may have different philosophies with regard to showing nearly raw accelerator performance vs. accelerator performance in a more realistic use case.

SoC Measurement Methodology
Accelerator benchmarks emphasize the performance of the accelerator and its path to system memory. SoC benchmarks emphasize the performance of all elements in the processing path, including the network interfaces, CPU(s), accelerator and memory subsystem. For networking-oriented SoCs such as those in the PowerQUICC family, an important cryptographic benchmark is IPsec packet processing.

The difference between encryption algorithms and a full security protocol such as IPsec were described in Part 1 of this article, along with software variables such as IPsec protocol stack efficiency and the quality of the API for exploiting hardware acceleration.

This section describes a methodology for creating normalized comparisons of different IPsec stacks on the same device, and comparisons of the same IPsec stack on different devices.

Network Test Environment
RFC2544 “Benchmarking Methodology for Network Interconnect Devices” describes a range of methods for the standardized testing of networking systems. Although the RFC doesn't specifically discuss IPsec, it introduces a number of important testing concepts.

The most logical way to test the IPsec performance of a system—Device Under Test, or DUT—could be to connect it to a network tester as shown in Figure 2 below. Coincidentally, this scenario is also Figure 2 below in RFC2544.

Figure 2: IPsec Testing Option 1

Using this option, the tester generates IPsec packets and sends them to the DUT, which then decrypts them and sends them back. The process can also be run in reverse so that the DUT's performance during encryption is also measured.

While this might be the most logical way to measure a DUT's IPsec performance, this setup is rarely used due to the cost of networking testers that support high IPsec data rates.

While this situation may change in the future, advanced networking SoCs with cryptographic acceleration often have more IPsec performance than the network test equipment that would purport to measure them. Figure 3 below (also Figure 3 in RFC2544 ) shows a better option for testing IPsec.

Figure 3: IPsec Testing Option 2

In option 2, IPsec is tested in a gateway-to-gateway configuration.

1. A network tester transmits clear text IP packets from port 1 to DUT 1 (labeled Home).
2. DUT 1 IPsec encrypts the packets and sends them to DUT 2 (labeled Office). 3. DUT 2 decrypts the packets and
4. Forwards them back to the network test's port 2, allowing the tester to measure received packet rate.

This process can be run bi-directionally so that each of the network tester's ports send and receive clear text packets, as shown in Figure 2 earlier. Many network testers are able to generate high rates of clear text IP traffic, allowing a broader set of users to recreate benchmarks measured using this method.

Because all packets are both encrypted and decrypted, IPsec performance is reported as the slower of the two operations. Also note that performance is measured in terms of clear text packets.

Depending on packet size, the Mbps measured inside the encrypted tunnel between the two DUTs could be nearly two times the Mbps reported by the network tester. The increase in packet size inside the tunnel also causes the 1 Gbps Ethernet link between the DUTs to saturate at approximately 950 Mbps, rather than approximately 990 Mbps for a non-encrypted Ethernet link.

Sources of variation in IPsec testing
It is beyond the scope of this two part series to cover all the potential variables impacting IPsec benchmarks using measurement option 2. However, there are two related testing options worth mentioning: injection rate and acceptable loss rate.

Injection rate is the rate at which the network tester transmits clear text packets to the DUTs. The tester can be configured to either flood the DUT (transmit at 100 percent line rate for all packet sizes), or start slowly and gradually increase the injection rate until the DUT starts dropping packets.

Flood testing generally produces better IPsec benchmarks than zero loss, but flood benchmarks are less useful to users because most systems are designed to achieve a given performance target without dropping large numbers of packets. Are those your zero loss packets on the floor?

RFC2544 doesn't have a concept of acceptable loss rate. Zero loss testing means performance is defined by the rate at which the DUT can forward packets for some time interval (typically 30-60 seconds) without losing any packets. This part of RFC2544 is generally ignored by networking equipment vendors when testing their end systems.

Independent testing labs, such as the Tolly Group, have been known to measure IPsec “zero loss” performance using an acceptable packet loss rate of <.001 percent.="" both="" equipment="" and="" embedded="" processor="" vendors="" have="" been="" seen="" to="" report="" loss="" rates="" as="" high="" as="" .1="" percent="" as="" zero="" loss.="">

Acceptable loss rate has a significant impact on performance. Freescale has reported zero-loss IPsec results using both absolute zero loss and <.001 percent,="" with="" the=""><.001 percent="" results="" typically="" 25="" percent="" better="" than="" the="" absolute="" zero="" results.="">

Higher acceptable loss rates improve the results still further. Because there is no universally agreed-upon definition of acceptable loss rate for a benchmark reported as zero-loss, this is an area where a prospective buyer needs to do some extra homework before comparing two vendor-supplied benchmarks.

Specific Results
Now that we have defined hardware and software sources of variation and testing methodologies to help produce normalized results, this article now looks at specific IPsec measurements on Freescale PowerQUICC processors and how these results illustrate some of the concepts discussed previously.

As the PowerQUICC performance graphs in this part in the series show, system performance is limited by the CPU for small packets and by the Freescale Integrated Security Engine (SEC) for large packets, with memory bus performance equally affecting throughput for all packet sizes.

Publicly available performance curves from competing products demonstrate that the same fundamentals are at play in their devices, and the superior IPsec performance of PowerQUICC (especially when paired with optimized IPsec stacks such as Mocana's NanoSec) results from a combination of higher performance CPUs, more efficient single-pass crypto acceleration cores and wider, faster, effectively pipelined buses.

All performance graphs were measured with a Smartbits SMB600 as both packet generator and packet counter. The Smartbits Terametrics module generates clear IPv4 packets at maximum rate and transmits them to one of the Ethernet ports of PowerQUICC board 1.

PowerQUICC board 1 classifies the packet as belonging to an IPsec session and performs ESP tunneling encapsulation using 3DES-HMAC-SHA-1 before forwarding it to one of the Ethernet ports of PowerQUICC board 2.

PowerQUICC board 2 classifies the packet as belonging to an IPsec ESP session to be terminated on board 2 and decapsulates/decrypts the packet before forwarding a clear IPv4 packet back to the Smartbits machine.

Unless otherwise noted, all performance numbers shown reflect the aggregate bi-directional IPsec packet forwarding rate of each PowerQUICC device. 3DES-HMAC-SHA-1 was selected as the ciphersuite for this measurement because it is still the most commonly used IPsec ciphersuite.

It is also the worst-case algorithm combination for the SEC and probably for other crypto-accelerators. The system-level performance difference between 3DES-HMAC-SHA-1 and AES-HMAC-SHA-1 is negligible in PowerQUICC for all but the largest packets.

Benchmark measurements on PowerQUICC II Pro platforms
The PowerQUICC II Pro MPC83xx series of integrated communications processors represents the low end of the PowerQUICC product line. These devices use the e300 Power Architecture processor core at frequencies up to 667 MHz.

Some members of the MPC83xx series use reduced featured versions of the SEC core (accelerating 3DES, AES, HMAC MD5 and SHA-1); others have full featured SECs, which additionally accelerate public key, ARC-4, and random number generation.

Measurement Configuration #1. Shown in Figure 4 below is the first configuration containing the MPC8313E PowerQUICC II Pro integrated communications processor, with the 32-bit e300 Power Architecture core, and SEC 2.2 and the following configuration parameters:

MPC8313E RDB
e300 core at 333 MHz, DDR at 333 MHz data rate, and SEC at 166 MHz
OS: Linux 2.6.21
IPsec stacks: StrongSwan, OpenSwan, Mocana NanoSec, all running 3DES-HMAC-SHA-1

The chart shows the Mocana NanoSec IPsec stack (http://mocana.com/NanoSec.html) as having higher throughput at all packet sizes. The Mocana performance advantage over OpenSwan is relatively constant at 1.7x, while the advantage over StrongSwan starts at 1.6x but grows to 2.2x as packet size increases.

Both OpenSwan and Mocana operate asynchronously, while StrongSwan processes packets one at a time, and waits for the SEC to complete processing before continuing. At small packet sizes, StrongSwan's greater efficiency and avoidance of SEC interrupts due to polling operations allow it to slightly outperform OpenSwan, but OpenSwan overtakes StrongSwan at medium packet sizes.

Figure 4. MPC8313E IPsec Performance

Measurement configuration #2 . Figure 5 below shows security performance of a measurement configuration platform containing MPC8323E, based on the the 32-bit e300 Power Architecture core, and SEC 2.2, with the following parameters:

MPC8323E RDB
e300 core at 333 MHz, DDR at 266 MHz data rate, and SEC at 133 MHz
OS: Linux 2.6.20.6
IPsec stacks: NetKey, StrongSwan, OpenSwan, Mocana NanoSec, all running 3DES-HMAC-SHA-1

Figure 5. MPC8323E IPsec Performance

The chart provides a complete comparison of the most popular open source IPsec stacks and the Mocana stack running on the same Linux kernel version. Netkey performance with both hardware and software encryption are included in the comparison.

As on the MPC8313E in Configuration #1, the Mocana NanoSec IPsec stack has the highest throughput at all packet sizes. Also as before, StrongSwan slightly outperforms OpenSwan at the smallest packet sizes, but then falls behind as the synchronous API to the SEC blocks the processor from doing other work.

The Mocana advantage over OpenSwan is approximately 1.9x at small-medium packet sizes. However, at larger packet sizes, Mocana hits the 200 Mbps link limit (bidirectional testing with two 10/100 fast Ethernet interfaces), allowing OpenSwan to appear to close the gap.

NetKey is slightly more efficient than OpenSwan when performing software encryption. However, the Native Linux Crypto API's dual pass, synchronous interface to the SEC creates such high overheads that NetKey with hardware acceleration is slower than the other stacks, and even slower than NetKey with software encryption at the smaller packet sizes.

Measurement Configuration #3 . Shown in Figure 6 below is a security measurement performance of a configuration containing MPC8349A with the 32-bit e300 Power Architecture core, and SEC 2.4, and the following parameters:

MPC8349EA MDS
e300 core at 666 MHz, DDR at 333 MHz data rate, and SEC at 166 MHz
OS: Linux 2.6.11
IPsec stacks: StrongSwan, OpenSwan, Mocana, all running 3DES-HMAC-SHA-1

The chart compares StrongSwan and OpenSwan with Mocana. Mocana provides the highest throughput at all packet sizes. The Mocana performance advantage vs. the second-best implementation (OpenSwan in this case) is approximately 1.3x.

Figure 6. MPC8349EA IPsec Performance

Measurement configuration #4 . Shown in Figure 7 below is the security performance for the MPC8360E, which contains the 32-bit e300 Power Architecture core, and SEC 2.4. These are the parameters:

MPC8360EA MDS
e300 core at 666 MHz, DDR at 333 MHz data rate, QUICC Engine at 500 MHz, and SEC at 166 MHz
OS: Linux 2.6.22
IPsec stacks: StrongSwan (uni-directional), Mocana (bi-directional), both running 3DES-HMAC-SHA-1

The chart compares StrongSwan with Mocana, with CPU utilization information. For this particular device, StrongSwan slightly outperforms Mocana at the smallest packet sizes. This may be due to a measurement difference (uni-directional vs bi-directional testing).

For IPv4 forwarding, uni-directional performance at 64 bytes is 142 Mbps versus 114 Mbps. This suggests that if the StrongSwan IPsec performance had been measured bi-directionally, Mocana would have outperformed StrongSwan, which is consistent with measurements on other devices.

Another unusual datapoint is the Mocana CPU utilization at 1456 bytes. CPU utilization should have continued to drop. CPU utilization is a calculated number, and it is possible that CRC or other Ethernet errors exceeded .001 percent, depressing the measured throughput. Because the Ethernet frames were dropped after IPsec processing, the CPU appears to have done more work that it actually did.

Figure 7. MPC8360E IPsec Performance

Measurement configuration # 5 . Shown in Figure 8 below is the security performance for the MPC8379E PowerQUICC II Pro integrated communications processor, which contains the 32-bit e300 Power Architecture core, and SEC 3.0. These are the parameters:

MPC8379E RDB
e300 core at 666 MHz, DDR at 333 MHz data rate, and SEC at 110 MHz
OS: Linux 2.6.23
IPsec stacks: OpenSwan, Mocana, both running 3DES-HMAC-SHA-1

The chart shows the Mocana NanoSec IPsec stack as having higher throughput at all packet sizes. The Mocana performance advantage over OpenSwan is approximately 1.4x at 64B. This delta is relatively constant until the SEC begins to become the performance limiter, at which time OCF closes the gap.

Figure 8. MPC8379E IPsec Performance

Note that in order to perform a true “apples to apples” comparison of OpenSwan and Mocana, the MPC8379RDB board, which currently restricts SEC frequency to 110 MHz (compared to 166 MHz on most other 83xx products), was used.

OpenSwan data measured on a different board without this SEC frequency limitation shows the MPC8379E IPsec performance to be approximately 13% better across all packet sizes than the results shown in the figure, reaching approximately 560 Mbps at large packet sizes. It is reasonable to assume that the Mocana results would be 13 percent better at all packet sizes as well if the SEC were running at 166 MHz.

Doing measurements using the PowerQUICC III MBC 85xx series
The PowerQUICC III MPC85xx series of integrated communications processors represents the mid-high end of the PowerQUICC product line. These devices use the e500 Power Architecture processor core, with front side L2 caches. The e500 core operates at frequencies up to 1.5 GHz. All members of the MPC85xx series use full-featured SECs.

Measurement configuration #1. Shown in Figure 9 below is the security performance for the MPC8544E and MPC8533E PowerQUICC III integrated communications processors, which contain the 32-bit e500 Power Architecture core, and SEC 2.1. These are the parameters:

MPC8544E CDS
e500 core at 800 MHz, DDR at 400 MHz data rate, and SEC at 133 MHz
OS: Linux 2.6.23
IPsec stacks: Mocana, running 3DES-HMAC-SHA-1

The only IPsec results available for the MPC8544E are from Mocana, for their commercial NanoSec IPsec implementation. Note that the SEC runs at a lower frequency (relative to the CPU) in the MPC8544E than in other 85xx devices, so the reduction in CPU utilization due to SEC saturation begins earlier than in other devices.

Figure 9. MPC8544E IPsec Performance

Measurement configuration #2. Figure 10 below show the security performance for the MPC8555E and MPC8541E PowerQUICC III integrated communications processors, which contain the 32-bit e500 Power Architecture core, and SEC 2.0. These are the parameters:

MPC8555E CDS
e500 core at 833 MHz, DDR at 333 MHz data rate, and SEC at 166 MHz
OS: StrongSwan Linux 2.4, Mocana Linux 2.6.11
IPsec stacks: StrongSwan, Mocana, both running 3DES-HMAC-SHA-1

The chart compares StrongSwan with Mocana, with CPU utilization information. For this particular device, Mocana NanoSec outperforms StrongSwan at all packet sizes. Mocana CPU utilization steadily declines as packet size increases, while StrongSwan consumes 100% of the CPU at all packet sizes due to its synchronous, polling mode of operation.

Figure 10. MPC8555E IPsec Performance

Measurement Configuration #3. The security performance for the MPC8548E PowerQUICC III integrated communications processor, which contain the 32-bit e500 Power Architecture core, and SEC 2.1 is shown in Figure 11 below. These are the parameters:

MPC8548E CDS
e500 core at 1.33 GHz, DDR at 533 MHz data rate, and SEC 2.1 at 266 MHz
OS: Linux 2.6.11
IPsec stacks: Mocana, running 3DES-HMAC-SHA-1

The chart compares OpenSwan with Mocana, and adds information about CPU utilization. Mocana NanoSec provides the highest throughput at all packet sizes and begins to saturate the SEC at 390 bytes.

CPU utilization continues to drop as packet size increases, and at 1456 bytes, Mocana achieves 1057 Mbps with 38 percent CPU utilization. OpenSwan has a similar profile; however, it does not saturate the SEC until 1024 bytes, and consumes 69 percent of the CPU while achieving 1057 Gbps at 1456 bytes.

Figure 11. MPC8548E IPsec Performance

Conclusion
As the example configurations described in this article illustrate, the security performance of embedded communications processors is strongly determined by the performance of the CPU for small/medium packet sizes and is limited more by the crypto-hardware/memory bus for larger packet sizes.

While this point was illustrated here with the performance curves of various PowerQUICC processor family members, it is true for all embedded communications processors using general purpose CPUs and look-aside crypto acceleration.

It is possible to achieve higher performance using more NPU-like processing engines; however, doing so generally results in less flexibility, both in absolute programmability and in tool chain support.

Author Geoff Waters Senior Systems Engineer, joined Freescale Semiconductor in 1997 to work in the company's Networking and Computing Systems Group in Austin, Texas. Initially focused on security acceleration technologies, Geoff has served as a senior systems engineer for Freescale's Digital Systems Division for the last 5 years. Prior to working at Freescale, Geoff was a contractor to the US Defense Threat Reduction Agency [formerly known as the Defense Nuclear Agency (DNA)]. He is a graduate of the University of Houston Honors Program.

Editor Kurt Stammberger , CISSP, is VP of Development for Mocana. which focuses on the security of non-PC devices. He has over 19 years of experience in the security industry. He joined cryptography startup RSA Security as employee #7, where he led their marketing organization for eight years, helped launch spin-off company VeriSign, and created the brand for the technology that now protects virtually every electronic commerce transaction on the planet. Kurt founded and still serves on the Program Committee of the annual RSA Conference. He holds a BS in Mechanical Engineering from Stanford University, and an MS in Management from the Stanford Graduate School of Business, where he was an Alfred P. Sloan Fellow.

Readers may download free source-code trial packages of the Mocana software mentioned in this article by going to the Mocana website.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.