To demonstrate the capabilities of TIPIarchitectures, we construct a novel processing element for networkprocessing. The ClickPE architecture takes into account theprocess-level, data-level, and datatype-level concurrency expressed inClick applications. We will show that hardware support for theapplication's concurrency can be implemented at low cost. By shrinkingthe concurrency implementation gap, we can also achieve highperformance.
Designing a PE for Click
How does an architect approach the problem of designing a new PE? Thereare three categories of knowledge that guide the design process:
Category#1: Application Knowledge. First, we look at Clickmodels like the router in Fig. 13.15 and analyze the computations theyperform. The Click model divides packets into headers and bodies in theingress side. Headers are passed through elements like Paint,CheckPaint, CheckIPChecksum, DecIPTTL, CheckIPTTL, and LookupIPRoute.Headers and bodies are reassembled on the egress side.
The ClickPE should be able to support all of theseoperations with good performance. In this chapter, we focus on theelements that perform computations on packet headers. We do notconsider the problem of queuing packets or packet header/bodyseparation.
Category #2: Architectural Knowledge. Second, welook at the targetimplementation platform and see what features are available toconstruct the ClickPE. We wish to implement our PE on a Xilinx Virtex 2Pro FPGA. Since this is a configurable platform, the ClickPE should bea customizable element.
One important customization is extensibility. Itshould be possible to extend the ClickPE to add support for additionalClick elements. As a complement to this, the ClickPE should also bereducible. It should be possible to trim it down to support just one ortwo Click elements with greater performance and smaller arearequirements.
Category #3: Prior Experience. Third, we look atother custom PEsfor network processing to see what is possible and what the potentialpitfalls are: Longest prefix match computation: A common observationis that the LookupIPRoute element is a bottleneck. Severalarchitectures have been proposed to speed up this operation. Henrikssonand Verbauwhede propose a pipelined architecture based on a 2-bit trie. There, the key to performance is to saturate accesses to therouting table stored in external SRAM. This has the potential to do oneroute lookup every clock.
Taylor et al.  use eight PEs to saturate accessto an external routing table. The PEs perform a more complicatedrouting algorithm that saves significant memory over the trie approach.
The Lulea algorithm  is compelling for theClickPE because it offers several ways to exploit datatype-levelconcurrency. One operation in this algorithm is a reduction add, whichcounts the number of set bits in a 16-bit word. Also, the Luleaalgorithm can be modified to take advantage of the 18-bit wide BRAMblocks on the Xilinx FPGA, thus permitting a compressed routing tableto be stored on-chip.
- Manipulating header fields: A secondobservation is that it is good to be able to quickly modify irregularlysized bit fields in packet headers. Sourdis et al.  build PEs withheader field extraction and modification engines to do this efficiently.
- Custom control flow: The Linkopingarchitecture  recognizes that standard sequential control flow isnot the best match for packet processing. This PE offers anapplication-specific data-driven control scheme. In one cycle, the PEcan perform a field matching operation and compute the next controlword for itself.
To guide the integration of all of these ideas, werely on architectural design patterns  and try to follow aminimalist approach.
The resulting ClickPE architecture is shown in Figure. 13.16 below.This schematicis the top level of a hierarchical design that matches the hierarchicalarrangement of fields in an Ethernet packet header.
At the top level, the datapath is 344 bits wide.This allows the PE to process a complete packet header in parallel. The344 bit number is the sum of a 20-byte IPv4 header, a 14-byte Ethernetheader, and a 9-byte Click header for annotations such as Paint valuesand timestamps. There is a 344-bit register file at this level, andnetwork-on-chip I/O ports for sending and receiving headers as the dataportion of signalwith- data transfers.
The operations at this level of the ClickPE are tobreak up the header into IPv4, Ethernet, and Click sub-headers and passthese components to SubPEs for further processing. Additionaloperations provide for reassembling modified packet fields and storingthe result in the packet register file.
|Figure13.16. ClickPE architecture schematic.|
The program factory SubPE is used to calculatecontrol words for transmission to other PEs as a transfer. Control canbe generated from an immediate value in the instruction word, by alookup within a small memory, or by a hardware if-then-else operation.Each of the other Sub- PEs can perform comparison operations to drivethe if-then-else operation. These connections are not shown in thediagram for clarity.
The SubPEs do the bulk of the header processing. TheIPv4 SubPE is diagrammed in Figure13.17 below where the remaining SubPEs are similar in structure.There is a primary register file for storing entire IPv4 packetheaders. The SubPE has operations to break out the fields into smallerregister files and reassemble them. This hierarchy of register filesallows computation to occur on different parts of the header inparallel.
The IPv4 SubPE also has functional units for doingchecksum calculations, testing the values of fields, decrementingfields, and incrementally adjusting the checksum. All of theseoperations are inspired by the elements in the Click router model. Thiscombination of functional units and register files of various widthssupports the datatype-level and datalevel concurrency in theapplication.
|Figure13.17. Example of a TIPI PE in a multiprocessor context|
ClickPE Control Logic
The ClickPE is also interesting for what it omits. There is no programcounter or instruction fetch and decode logic. Instead, the controltemplate shown in Figure 13.12 (from Part 3 in this series) is used.The ClickPE can calculate transfers for itself or it can receivetransfers from another PE in the platform architecture.
By omitting the standard RISC control logic, we aresimultaneously simplifying the architecture and realizing a model ofprocess-level concurrency that is a better match for the requirementsof the application. And still, the ClickPE is a programmablearchitecture. The programmability lies in how one defines the macrooperations that are performed when transfers are received.
The LuleaPE is a trimmed-down version of the ClickPE designed toperform the Lulea longest prefix match algorithm. It is shown in Figure13.18 below.
This architecture trades off reusability (since itsupports fewer Click elements) for higher performance and lesserresource utilization. This PE is meant to perform one stage of thethree-stage Lulea algorithm. Later we will implement a pipeline ofthree LuleaPEs to do the entire algorithm.
The Lulea approach involves performing several tablelookups using parts of the IPv4 destination address as indices. In, the sizes of these tables are constrained by the byte boundariesof a traditional processor's memory space. The FPGA offers embeddedmemories that are 18 bits wide; therefore we modify the Lulea algorithmto take advantage of this. Several of the lookup tables can be madesmaller as a result, allowing us to store a larger routing table withfewer BRAM blocks. These lookup tables are implemented as functionalunits directly in the PE datapath. They can be accessed concurrently,thereby improving data-level concurrency.
Another modification from the original algorithm isin the reduction add operation. This was originally done using a lookuptable. The LuleaPE provides a hardware operation for this, eliminatingextra memory references and better matching datatype-level concurrency.
Several of the ClickPE's wide register files areabsent in the LuleaPE. This is because the LuleaPE only requires readaccess to the packet header. No temporary storage for modified headersis required. The LuleaPE reads the header data directly from thepacketIn port, where it arrives as the data component of an incomingtransfer. This saves extra cycles that the ClickPE needs to move datato SubPEs.
|Figure13.18. Lulea architecture schematic.|
To measure the performance and implementation costs of the ClickPE andLuleaPE architectures, we use TIPI's simulator generation and RTLgeneration tools. A cycle-accurate simulator is used to measure howfast the PEs can run Click application tasks. The generated Verilogcode is synthesized and implemented on a Xilinx Virtex 2 Pro FPGA todetermine a feasible clock rate and hardware resource requirements.
ClickPE Performance . To test the ClickPE, weimplementan architecture based on the template shown in Figure 13.12 in Part 3.Thisrepresents a single ClickPE as it would appear in a largermultiprocessor configuration.
The external incoming and outgoing transfer sourcesare mapped to FPGA I/O pins. A roundrobin arbitrator provides fairaccess to a 16-deep transfer queue that is shared by the externaltransfer source and the internal feedback path.
The ClickPE is programmed using the example Clickapplication shown in Figure 13.15in Part 3 in this series. This application is first modeled using theCairn design tools. The Click model transform is applied, and theresulting actors are assigned to the PE's mapping model.
A straightforward deployment strategy for amultiprocessor router is to use one ClickPE for packet ingress andegress on a single network port. Therefore we assign theCheckIPChecksum, CheckIPTTL, Paint, CheckPaint, and DecIPTTL actors tothe PE.
|Table13.1. Cycle counts for mapped Click elements|
The Cairn code generation tool converts these mappedactors into individual programs that the PE can run. Table 13.1 aboveshows the resultsof the code generation process. These values are the optimal cyclecounts for executing each mapped actor's functionality. For thisexperiment, we included additional hardware in the IPv4 SubPE forperforming the L1 stage of the Lulea lookup algorithm. This is notshown in Figure 13.17 earlierfor clarity. The lookup tables were sized for a 10,000-entry routingtable.
FPGA resource usage is given in Table 13.2 below.For comparison, amedium-sized FPGA in the Virtex 2 Pro family (2VP50) has 23,616 slicesand 232 block RAMs.
|Table13.2. FPGA implementation results|
An incoming packet is processed by the Paint,CheckIPChecksum, and CheckIPTTL actors for a total of 22 cycles ofcomputation. An outgoing packet requires 14 cycles for the CheckPaintand DecIPTTL operations.
At 110 MHz, the ClickPE can therefore process morethan 3 million packets/sec. Assuming minimum-sized 64-byte Ethernetpackets with a 64-bit preamble and a 96-bit interframe gap, thiscorresponds to a line rate of 2 Gbps. This is ideal for a full-duplexGigabit Ethernet port.
To test the performance of the LuleaPE, we constructed a pipeline ofthree PEs to match the three stages of the Lulea lookup algorithm. Thisis shown in Fig. 13.19. The L1 stage matches the first 16 bits of theIPv4 destination address. The L2 stage matches the next 8 bits, and theL3 stage the last 8 bits.
After the Click model transform is applied, eachstage is mapped to the corresponding LuleaPE. If a longest prefix matchis found in the L1 or L2 PEs, the remaining PEs simply push this resultthrough to the end of the pipeline instead of doing further computation.
The routing tables across the three LuleaPEstogether contain 100,000 entries, compared to the 10,000 entries forthe ClickPE by itself. Of these 100,000 entries, 10,000 are 16 bits orless, 89,000 are distributed between 17 and 24 bits, and the remaining10,000 are longer than 24 bits.
This distribution mimics that found in a typical corenetwork router, and is much larger than the tables used in NPClick(10,000 entries) and Cliff (a small 12-entry table). The routing tablesuse 435 KB of on-chip BRAM memory.
Results of the Cairn code generation process areshown in Table 13.3. The L2 and L3 PEs require larger lookup tablesthan the L1 PE. The extra two cycles these PEs use to process a packetare due to extra pipelining in these large RAMs.
For packets that match prefixes of 16 bits or less,the three-PE pipeline can process up to 10.5 Gbps of traffic. A randommix of incoming packets will result in a processing rate of between 8.4and 10.5 Gpbs of traffic.
FPGA implementation results for the LuleaPE pipelineare included in Table 13.2 above.As expected, the LuleaPE is both smaller and faster than the ClickPE.Three LuleaPEs together occupy about the same number of slices as asingle ClickPE. The BRAMs are used exclusively for storing the routingtables. Although the FPGA offers dual-ported BRAMs, we are only usingone read port. The second port remains available for making updates tothe routing tables.
|Figure13.19. Three LuleaPE pipeline for route lookup.|
Comparison of these performance numbers to other methodologies iscomplicated by the differences in the target architectures.Furthermore, we have not constructed a complete stand-alone router withEthernet interfaces and proper handling of packet bodies. This workfocuses on the header processing path.
We believe that the same techniques demonstrated herecan be applied to the remaining facets of the IPv4 forwardingapplication to create a full implementation with good performanceresults. Supporting additional Click elements on the ClickPE willrequire making extensions to the architecture.
SMP-Click achieves 493K packets/sec on adual-processor Xeon Linux workstation. Synchronization and cachecoherency between PEs are major bottlenecks in this implementation.With the LuleaPE pipeline multiprocessor architecture, we are able tobetter match process-level concurrency and greatly exceed this packetrate.
NPClick implements a full 16-port Fast Ethernetrouter at rates between 880″1360 Mbps on the IXP1200. The ClickPEperforms better than the IXP microengines, but the IXP architecturealso interfaces with actual Ethernet MACs. The NPClick code spendsabout 50% of its time moving packet data on and off the IXP.
A full Cairn design should surpass this limitation byusing more PEs for packet I/O. The raw lookup performance of ourLuleaPE pipeline currently matches the lookup rate of a hand-codeddesign for the IXP2800 , so we expect to be competitive with thisarchitecture but with higher designer productivity.
CRACC's DSLAM application requires a much lowerpacket rate than a core router. It achieves 3 Mbps downstream and 512Kbps upstream with a single embedded microprocessor.
StepNP can achieve a 10-Gpbs throughput, but thisrequires 48 ARM processors, each with 16 hardware threads and a 500-MHzclock rate. These hardware requirements are significant, andperformance is dependent on network-on-chip latency.
|Table13.3. Performance results for LuleaPE|
The SoftMP approach achieves 2.4 Gbps on an FPGA with14 Micro- Blaze cores at 100 MHz. This design requires 11,000 slices.
Finally, the Cliff router is able to handle 2 Gbpsof traffic in about 4,000 slices. This includes the logic to interfaceto two Gigabit Ethernet MACs, but not the MACs themselves. Thecomparable Cairn design would have two ClickPEs and the pipeline ofthree LuleaPEs, requiring 8,200 slices plus additional logic for MACinterfaces. For the extra area, we gain programmability and lookupperformance headroom to scale to several more Ethernet ports.
Potentials for Improvement
In both the ClickPE and the LuleaPE, transfers are processed one at atime. This leads to relatively poor utilization of the datapathresources since the lookup tables are only accessed once in each 8- or10-cycle program schedule. Better performance can be achieved if the PEprocesses transfers in a pipelined fashion.
The Cairn code generation tool can accept aninitiation interval constraint to produce schedules that can bepipelined. If we apply the datapath duplication design pattern andinclude four dual-port register files, four adders, and four reductionadders, we can find a schedule with an initiation interval of only onecycle.
The lookup tables are accessed every cycle, asrecommended in . This provides 8 times the improvement inthroughput (to 125M packets/sec or 84 Gbps) at a small extra cost inhardware and latency.
If the LuleaPE is implemented in an ASIC processinstead of on an FPGA, we can expect a further improvement from anincreased clock rate. A 1.3-Mb SRAM (the largest lookup table among thethree Lulea PEs) should run at 500 MHz in a 90-nm technology.
By saturating accesses to this memory, throughputwould improve by another 4 times (to 500M packets/sec). Packet data I/Owill be the bottleneck at this point. The extra processing capabilitycan be applied to more sophisticated packetmatching techniques.
The concurrency implementation gap is a major impediment to deployingprogrammable platforms. Architectures that provide poor support forapplication concurrency requirements make this gap wide; ad hocmethodologies for programming concurrent applications make the gapdifficult to cross.
Designers must address both of these issues toachieve good results for several reasons.
First, it is important to choose a basic blockprocessing element that can support application-specific process-level,data-level, and datatypelevel concurrency. TIPI sub-RISC PEs are acompelling choice because they provide the right balance betweenprogrammability and application specificity. This comes at a lower costthan typical processor architectures in terms of hardware resources anddesigner hours.
Second, designers must take a disciplined approachto deploying concurrent applications. Cairn provides multipleabstractions for the different facets of the design problem. This makesit easy to cross the implementation gap. Designers can experiment withchanges to the application, the architecture, and the mappingindividually. Effective design space exploration will lead to highperformance.
The performance numbers for our FPGA test caseimplementation show the advantages of sub-RISC PEs. Implemented in anASIC, a multiprocessor of ClickPE and LuleaPE elements can easilysurpass the IXP2800 in raw packet destination lookup performance.Clearly, these PEs can make excellent building blocks for futureprogrammable platforms.
The TIPI and Cairn frameworks are funded, in part, by theMicroelectronics Advanced Research Consortium (MARCO) and InfineonTechnologies, and are part of the efforts of the Gigascale SystemsResearchCenter (GSRC).
Toread Part 1, go to Concurrentarchitectures, concurrent applications
To read Part 2: go to Generatingthe architecture from theinstruction set
To read Part 3, go to DeployingApplications with Cairn
Usedwith the permission of the publisher, Newnes/Elsevier,this series offour articles is based on copyrighted material from “Sub-RiscProcessors,” by Andrew Mihal, Scott Weber and Kurt Keutzer inCustomizableEmbedded Processors, edited byPaolo Ienne and RainerLerupers. The book can be purchased on line.
Paolo Ienne is aprofessor at the Ecole Polytechnique Federale de Luasanne (EPFL) andRainer Leupers is professor for software for systems in silicont atRWTH Aachen University. Kurt Keutzer is professor of electricalengineering and computer science at the University of California,Berkeley, where Andrew Mihal was a Ph.D. candidate and Scott Weberreceived a PhD., related to the design and simulation of sub-RISCprogrammable processing elements.