Protecting multicore designs without compromising performance - Embedded.com

Protecting multicore designs without compromising performance

Networking silicon design teams face huge leaps in demand for faster security protocol throughput, a demand driven by increasing data transfer speeds, as well as market forces (increasing attacks from hackers and malware) and technology trends. To meet the throughput demands, designers are turning to SoCs with multiple processor cores as well as multiple dedicated blocks of intelligent packet processing engines, all working in parallel to deliver throughputs of 40+ Gbps.

Cisco estimates that IP traffic is expanding at a compound annual growth rate (CAGR) of 25 percent – a doubling every three years. In parallel with the continual bandwidth expansion is the swelling of security threats to data-in-transit. Threats include address spoofing, passive monitoring (or ‘eavesdropping’), data integrity attacks, and sophisticated man-in-the middle attacks. These threats are driving the industry to encrypt an ever-increasing percentage of communications using security protocols such as MACsec, IPsec and SSL/TLS.

Adding to the pressure from various market forces are technology trends that put increasing demands on the packet processing. Virtual private network (VPN) communications must be protected by encryption with a security protocol. And the increasing use of mobile offload to WiFi is also driving a rise in encrypted packet traffic. For example, LTE to WiFi offload using the Evolved Packet Data Gateway (ePDG) architecture (Figure 1 ) relies on the IPsec security protocol to protect otherwise exposed communications.

Figure 1: Mobile Data Offload with ePDG

The result is that specifications for security protocol throughput have moved from 5 Gbps to 10 Gbps and now to 40+ Gbps for IPsec and SSL and to over 100 Gbps for MACsec in just a few years. Silicon designs that implement protocol processing are challenged to keep pace.

Three design approaches for security protocol processing
Networking silicon design teams have three possible approaches to address requirements for security protocol processing:

  • A software-only approach, executing on a networking processor’s primary CPU
  • Cryptographic-specific processing in hardware IP
  • Full security protocol processing in hardware IP

With the software-only approach, all security protocol processing executes on the primary CPU. Software “stacks” for protocol processing can be integrated into system software without affecting the hardware design. However, these stacks are resource intensive, executing complex mathematical algorithms for decryption and encryption of data, as well as implementing extensive data movement routines for each packet payload (Figure 2 ). A software-only approach runs into a bottleneck because the compute-intensive and data-movement operations quickly overload the CPU. Multicore CPUs can push the bottleneck out, but performance, even on a multicore CPU, is typically limited to a throughput of less than 2 Gbps, well below today’s networking requirements.

Figure 2: This flowchart of a security protocol processing process shows operations in a software-only implementation. The software stack, running on the CPU, classifies a packet and then executes the cryptographic processing for the encryption and hash steps before routing the packet to its destination.

A second option, found in many networking silicon designs, uses discrete cryptographic algorithms embedded in hardware IP (intellectual property), integrated as part of a dedicated adjunct processor. The IP offloads encryption and decryption from the CPU, performing math efficiently (Figure 3 ). But the rest of the protocol processing workload remains in software and is still handled by the CPU.

Figure 3: Here is the processing flow using crypto-specific IP. While math intensive operations are offloaded, the CPU is still responsible for multiple steps and complex data movements.

The drawback to this crypto-specific IP approach is that although it increases throughput up to a point, it can’t scale as workloads grow. A general approach to performance acceleration is to use more copies of a processing unit, operating in parallel. But if multiple copies of crypto-processing IP are implemented in parallel, they require sophisticated synchronization and cache coherency management to operate together efficiently. The CPU, already handling data movement and other tasks, is overwhelmed by the added burden of keeping multiple secure data flows synchronized. Before getting to 10 Gbps, the protocol processing will overload the CPU. The bottleneck reappears and throughput requirements are not met.

The remaining option is full security protocol processing in hardware IP (Figure 4 ). Software stacks initiate data flows and interface with application software. This approach delivers the efficiency of hardware-based execution for all protocol processing steps and offloads the CPU, removing the throughput bottleneck, with performance scaling thru parallelization. Throughputs of 40+ Gbps are achievable.

Figure 4: A throughput-optimized design implementing full security protocol processing in hardware IP

Diving a little deeper, the IP must have the protocol knowledge to effectively manipulate a packet. For protocol encryption the IP must be able to inspect and classify a packet within the data flow, move the packet payload to the proper crypto algorithm, execute the algorithm, and then move the encrypted payload back into a packet in the data flow. The IP must also be able to manage multiple data flows, synchronizing data movements and maintaining cache coherence and security registrations for multiple packets.

INSIDE Secure’s approach – Intelligent Packet Engines
One example of IP that meets these requirements is the family of Intelligent Packet Engines from INSIDE Secure. The modules are described as ‘intelligent’ because they contain complete protocol knowledge and don’t need software executing on the CPU to intervene. The engines apply protocol knowledge to manipulate packets, manage multiple data flows, and execute a range of other functions (Figure 5 ). These engines can be employed in several system architectural variations – look-aside, in-line, and hybrid look-aside. Most critically, Intelligent Packet Engines can scale up to meet new demands for security protocol throughputs of 40+ Gbps by managing multiple data flows in parallel.

Figure 5: Example of an Intelligent Packet Engine. Developed by Inside Secure and packaged as synthesizable Verilog RTL source code, it provides multi-CPU support. This version implements a hybrid look-aside internal architecture.

Summary
As data rates climb and malicious software attacks escalate, the need for high-speed security protocol processing is rapidly increasing, driving throughput requirements to new levels. Common approaches, using software or crypto-only IP, cannot keep pace.

Another approach, full security protocol processing in IP, can scale to meet throughput requirements. Described as Intelligent Packet Engines, these advanced modules deliver throughput levels that scale up to 40+ Gbps.

A more detailed discussion of this topic will be presented by Steve Singer at the Multicore Developers Conference , May 8-9 where he will deliver a presentation on Meeting New Demands for Networking Security (ME1106) .

Steve Singer is vice president of worldwide field application engineering responsible for the company’s embedded hardware and software security products. Previously he worked at SafeNet as Staff ASIC Design Engineer and at Oki Semiconductor as senior. ASIC Design Engineer. He has a BS/MS in Computer Engineering from Northeastern University in Boston, Massachusetts.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.