The now-common phrase “the processor is the NANDgate of the future” begs the questions: “What kind ofprocessor?” and”How to program them?” When this is discussed, the focus is usuallyplaced onRISC-based processors augmentedwith instruction extensions as thenatural buildingblock. The presumption is that programming will be done in the C language .
Challenging this viewpoint, our opinion is than evena RISC processor is too coarse-grained for typical embeddedapplications, and that C is insufficient for programming amultiprocessor architecture. As an alternative, we explore the designand deployment of tiny sub-RISC processors as the “NAND gate of thefuture.”
With Tiny Instruction-set Processors and Interconnect(TIPI) processing elements, wecan achieve better performance than RISCelements with fewer resource requirements. Also, we can deployconcurrent applications with a programmingmethodology more productive than C.
Concurrent Architectures, ConcurrentApplications
Programmable processors are a compelling basic block for embeddedsystems. They are a coarser-grained component than RTL statemachinesthat can be used to build large systems with higher designerproductivity. Their software programmability allows them to be usedacross application generations, avoiding the riskof rapid obsolescence that ASICs face.
In the network processing application domain, thereis already a wide variety of networkprocessor architectures that useprogrammable processors as basic blocks . These machines aretypically a heterogeneous mix of RISC-like processing elements (PEs).This style of architecture is often called a programmable platform.Despite the diversity of these architectures, one thing they all havein common is that they are notoriouslydifficult to program.
What makes programming so hard? One proposal is thatthe programming difficulties are the result of poor methodologies formodeling and implementing application concurrency. Network processingapplicationshave multiple types of concurrency on several levels of granularity.
For example, there is traditional process-levelconcurrency, such as the ability to process independent streams ofpackets in parallel. Another level of granularity is concurrency withina task, or data-level concurrency.
Tasks frequently perform computations on severaldifferent fields within a single packet header, and these can often bedone in parallel. At the lowest level of granularity, there isconcurrency on the level of individual bits.
Network processing uses custom mathematicaloperations (such as checksumcalculations) on custom data types (suchas irregular packet header fields). This is datatype-level concurrency.To meet performance goals, all of these flavors of concurrency must beconsideredin the implementation.
The common programming paradigm for networkprocessors is that programmers must write code for each PE in thearchitecture individually. Typically this is done using assemblylanguage, or a C language dialect with compiler intrinsics.
C does not provide good abstractions for applicationconcurrency, especially process-level concurrency. Programmers areforced to manually translate high-level application conceptsinto a set of interacting C programs.
In an application domain where concurrency is aprimary concern, this ad hoc treatment leads to numerous troubles.Producing even one correct implementation is slow and error-prone, andeffective design space exploration is out of the question due toincreasing time-to-market pressure. Even minor modifications to theapplication require lengthysoftware rewrites.
The solution that has been proposed for this problemis to raise the level of the programming abstraction through the use ofa domain specific language(DSL). DSLs provide formalisms forconcurrency thatimprove designer productivity and make it easier to build correctapplications.
For network processing, Click is a popular choice. Projects such as SMP-Click , NPClick , CRACC , StepNP, and Soft- MP  start with a programming abstraction derived fromClick andtarget implementation on platform architectures.
These techniques are successful at improvingdesigner productivity, but they also expose a more fundamental problem.After programmers create formal models of concurrent applications, itbecomes clear that there is an inherent mismatch between theconcurrency requirements of the application and the capabilities of thearchitecture. This problemis the concurrency implementation gap, shown in Fig. 13.1 below .
Like network processing applications, networkprocessors also exhibit multiple types of concurrency on several levelsof granularity. However, these flavors of concurrency are differentfrom those found in the applications.
At the process level, multiple PEs run programs inparallel, but network-on-chip communication protocols do not match thepush and pull communication semantics in Click. RISC-inspired PEs havegeneralpurpose instruction sets and work on standard byte- andword-sized data.
Unfortunately, common bit-level header fieldmanipulations are not a part of standard RISC instruction sets.Irregular packet header fields do not match 32-bit-wide datapaths andALUs. A significant part of the deployment effort is spent workingaround differences such as these while tryingto maintain a high level of performance.
|Figure13.1. The concurrency implementation gap.|
Clearly, programmable platforms face two challenges.One issue is how to model and deploy concurrent applications. The otherissue is that the architectures built to date are not a good match forthe applications.
Although we use network processing and networkprocessors as examples here, these issues apply to embeddedapplications and programmable platforms in general. If processors areto find success as the “basic blocks of the future,” we must attackboth of theseproblems.
This motivates us to rethink the way we designprogrammable platforms in addition to the way we program them.
In this series of articles, we present a novel classof processing element architectures called Tiny Instructions setProcessors and Interconnect (TIPI). TIPI PEs are a finer-grainedcomponent than RISC processors, allowing architects to buildprogrammable platforms that provide customized support for anapplication's processlevel, data-level, and datatype-level concurrencyat low cost. This works to make the implementation gap smaller.
To program these PEs, we provide a deploymentmethodology called Cairn. Cairn provides formal abstractions forcapturing application concurrency and a disciplined mapping methodologyfor implementation on TIPI multiprocessor platforms. This makes theimplementation gap easier to cross. Together, these two techniques makeit possible to realize all of the benefits that programmableplatforms are supposed to have over ASIC systems.
In this first part in a series of articles, weexplore alternative PE architectures and motivate sub-RISC PEs as abasic block. Part 2 willdescribethe design methodology and toolset forconstructing TIPI PEs. In Part 3, the Cairn deployment methodology isintroduced. Next, these techniques are used to build and program anovel architecture for packet processing applications. The ClickPE,described later in Part 4, provides support for the process-level,data-level, and datatype-level concurrency found in a basic Click IPv4forwarding application. This PE can be programmed using Click as ahigh-level input language, without requiring manual coding in C orassembler. Finally, in Part 4, we implement the ClickPE ona Xilinx FPGA and conclude with performance results.
Motivating Sub-RISC PEs
In choosing programmable processors as a replacement basic block forstate machine-based RTL design methodologies, there are severalcriteriawe are trying to fulfill:
- Improvingdesigner productivity: RTL methodologies give architects finecontrol over the details of a design. This is good for buildinghigh-performance systems, but it does not scale. The next architecturalabstraction will accept coarser granularity in exchange for a moreproductive design process.
- Customizability: The basic block should be customizable to meet the needs of aparticular application or application domain. Therefore it should notbe a static entity, but rather a template that defines a family ofpossible blocks. We desire features that support the application'sprocess-level, data-level, and datatype-level concurrency.
- Degrees ofreusability: In some cases, architects are willing to make adesign more application-specific in exchange for higher performance. Inother cases, the ability to work across a broad application domain ismore important than raw performance. The next basic block should permitarchitects to find the sweet spot along this spectrum by exploringtradeoffs between application-specific features and general-purposefeatures.
- Architecturalsimplicity: The basic block should be easy to design, verify, andcompose with other basic blocks. Complexity should only be introduced(and paid for) when the application requires it. If the basic block isitself complicated, then even simple problems will lead to complexsolutions. For a customizable basic block, this means being able toomit unnecessary features as well as to addapplication-specific features.
- Exportability: We must be able to communicate the capabilities of the basic block toprogrammers so that they can deploy applications on it. Developersrequire a programming abstraction that finds the right balance ofvisibility into architectural features that are necessary forperformance and opacity toward less important details. If programmerscannot figure out how to use the architecture, then the benefits of thenew basic block are lost. RISC processors are familiar to both hardwaredesigners and software developers, so they appear to be an obviouschoice. However, they are only one of many different kinds ofprogrammable elements. Fig. 13.2 below arranges the important candidates along a spectrum according to levelof granularity. Existing RTL design methodologies are included as apoint of reference on the left. We are interested in a basic block witha coarser granularity than state machines.
|Figure13.2. Spectrum of basic block candidates.|
At the opposite end of the spectrum, there are largegeneral-purpose processors such as the Pentium. A basic block such asthis is very expensive in terms of area and power, and it does notinclude any applicationspecificfeatures.
The ideal basic block will lie somewhere in betweenthese two extremes. In this section, we will explore each candidate indetail and arguethe merits and shortcomings of each relative to the goals given earlier.
The trouble with using a RISC PE as a basic block can be summarized inone seemingly contradictory statement: they are simultaneously toosimple and too complex for the applications we are interested in. The”too simple” argument follows from the customizability criterion givenearlier. A traditional RISC processor provides features that work forallapplications but are exceptional for no application.
There are few, ifany, opportunities to customize the architecture or to exploretradeoffsbetween application-specific features and general-purpose features.
The “too complex” argument begins at thearchitectural simplicity criterion.In trying to be applicable to any application, RISC PEs will inevitablyinclude features that will never be used by certain applications.These arguments apply to each of the three levels of concurrency:process-level, data-level, and datatype-level.
As an example, considerthe difference between deploying a network processing application ona RISC PE versus using an approach like Cliff . Cliff is astructuralsynthesis approach that builds a custom FPGA design starting from aClick application model.
A RISC PE, on the other hand, must make dowith the available instruction set. This comparison will reveal thatthereis a large design space gap between what is possible with anapplicationspecificdatapath versus what a RISC PE has to do.
Datatype-level concurrency refers to the way data is represented as abit vector and the arithmetic and logical computations that are carriedout on the data. In a Click application, the fundamental data type is apacket header, and elements describe bit-level computations on theseheaders.
The DecIPTTL element is common in IP forwardingapplications and will serve as an example. This element decrements theIP time-to-live header field (an 8-bit value) and incrementally adjuststhe checksum tomatch using 1's complement arithmetic.
Cliff can produce an FPGA design that exactlymatches the bit widths, and the 1's complement arithmetic expressed inthe DecIPTTL element, as shown in Fig.13.3(a) below. A RISC PE, on the other hand, lacks instructionsfor 1's complement arithmetic.
This datatype-level concurrency must be emulatedusing logical shifts, bit masks, and 2's complement arithmetic as shownin Fig. 13.3(b) below. TheRISC PE is sufficient for implementing DecIPTTL, but it is neither thecheapest nor the fastest way toimplement it.
|Figure13.3. Implementing datatype-level concurrency.|
The CheckIPHeader element demonstrates a mismatch in data-levelconcurrency. This element treats an IPv4 header as an array of ten 16-bit fields and performs a 16-bit 1's complement checksum operation. Inthe Cliff design shown in Fig.13.4(a) below , the datapath is wide enough for the entireheader, and all ten fields can be immediately passed into an adder tree.
A typical RISC PE will not have enough paralleladders or enough memory ports to start processing 160 bits of dataright away. The C pseudocode shown in Fig.13.4(b) below specifies the checksum calculation sequentially.
If the PE is a VLIW-type machine, it is up to thecompiler to try to reverse-engineer the data-level concurrency. A PEwith dynamic data-level concurrency (e.g., a superscalar machine) is aneven more expensive solution: it has extra hardware to calculatedynamically what we know a priori.
|Figure13.4. Implementing data-level concurrency.|
Click's push and pull connections are an abstraction for acontrolfollows- data style of process-level concurrency. In pushcommunications, the upstream element performs computation on a packetheader and then passes the flow of control to the downstream element.
This is different from a function call because theupstream element does not wait for a return value and it does notcontinue. In pull communications, the downstream element initiates thetransfer of control. Control is passed to the upstream element, and thedownstream element blocks until the upstream element provides a packetheader.
Cliff can build control structures at themultiprocessor level that mirror these semantics exactly. Each Clickelement in the application is implemented as a miniature datapath, asdescribed earlier. These datapaths are controlled by finite statemachines that communicate using a three-way handshake.
This is shown in Fig.13.5(a) below . Push and pull style communications events areimplemented using these interelement control signals. This represents atight coupling between the control logic of elements at themultiprocessor level.
A RISC PE has a more self-sufficient model ofcontrol, where the normal behavior is to forge ahead through asequential program. Aside from interrupts, there is no mechanism formultiple PEs to directly influence each other's flow of control.
A possible RISC implementation is shown in Fig. 13.5(b) below . Here, there aretwo RISC PEs that are connected to an on-chip network. The Clickcomputations are implemented as processes, and an operating systemprovides access to the communication hardware resources through amessage-passing interface.
The processes use blocking and nonblocking send andreceive functions to emulate Click's push and pull semantics insoftware. Since the PEs are loosely coupled in their control logic, itis more difficult and expensive to implement certain styles ofprocess-level concurrency.
There is quite a large design gap between RISC PEsand what is possible with application-specific datapaths. In each ofthe three categories of concurrency, this implementation gap has anoticeable effect on performance.
Tensilica reports that there is significantperformance to be gained by adding support for the application'sdatatype-level concurrency to a RISC PE's instruction set . To workaround the mismatch in data-level concurrency, Sourdis et al. outfit PEs with header field extraction engines to efficiently movedata into register files.
For process level concurrency, SMP-Click finds thatthe synchronization and communication between PEs is a majorperformance bottleneck . The mismatch between the application andthe architecture is so strong in this case that the authors resort to amapping strategy that minimizes inter-PE communication.
|Figure13.5. Implementing process-level concurrency.|
For these reasons, we maintain that a finer-grainedprogrammable component is a better choice than RISC PEs. We desire morepotential for application-specific concurrency, especiallyprocess-level concurrency. Whenever possible, we want to removegeneral-purpose features that are not used by our applications. Asub-RISC PE will be able to provide higher performance at a lower cost.
The next finer-grained PE candidate is a configurable datapath such asthe Tensilica Xtensa or the Stretch architecture. They permitarchitects to add custom instructions to a RISC core to implementapplication-specific arithmetic and logic computations. This providesopportunities to match datatype-level concurrency. However, the RISCcore itself is mostly fixed, and it is not possible to trim awaycertain unwanted general-purpose functions.
These PEs are also inadequate in terms ofprocess-level and data-level concurrency. Traditional sequentialcontrol logic consisting of a program counter and an instruction memoryis assumed. The number of register files, the number of register fileread ports, and the number of memory ports are only customizable to alimited extent.
The main difficulty with synthesis approaches such as Cliff is thatthey do not create a programmable system. These approaches overshootthe goal in terms of customization, sacrificing reusability.
A second problem is that synthesis provides limitedopportunities for design space exploration. To go from a high-levelinput language to an RTL implementation, the tool has to make a largenumber of design decisions, most of which are not visible orcontrollable by the architect. If the tool's design cannot meetperformance goals, architects are left with undesirable choices, suchas editing the generated RTL by hand.
Architecture Description Languages
Fortunately, there is a good basic block candidate that givesarchitects more programmability than synthesis approaches and providesmore opportunities for customization than RISC PEs. These are theprocessing elements produced by architecture description language (ADL)design methodologies. In a nutshell, an ADL PE is a datapath whosecapabilities can be exported as an instruction set.
Like the customizable datapaths discussedpreviously, ADLs permit architects to build custom instructions forapplication-specific arithmetic and logic operations. However, thecustomizability does not end at the level of making extensions to acore architecture. Architects can design the entire architecture andcan therefore decide what features to leave out as well as what toinclude.
One can design a simple architecture with a fewinstructionsor a complex architecture with hundreds of instructions. Theinstructions can be tuned for a particular application, or for a domainof applications, or they can be general purpose. Traditionalarchitectural styles, such as five-stage pipelines with a singleregister file and memory port, can be easily broken in favor of schemesthat support application-specific data-level concurrency. This isattractive because it leads to PEs that are “simpler” and “morecomplex” than RISC PEs. ADLs are also a clear winner in terms ofimproving designer productivity.
To design a programmable element using an RTLlanguage, architects must specify both the datapath and the controllogic manually. The hardware implementation must be kept consistentwith the description of the instruction set and the softwaredevelopment tools. This covers instruction encodings, semantics,integration with an assembler, a simulator, and many other things.Unfortunately, RTL languages do not have an abstraction for describingthese details. In practice, instruction sets are usuallyEnglish-language specifications. The problem of maintaining consistencyis a verification nightmare. This impedes the ability to explorearchitectural options.
ADLs solve this problem by using a three-partapproach. First, they provide formalisms for specifying instructionsets. This counters the lack of precision and analysis difficulties ofEnglish-language specifications. Second, they enforce particulararchitectural design patterns.
Architects can no longer freely makearbitrary collections of finite state machines. Instead, design is atthe level of components like register files, multiplexors, andfunctional units. The focus is on designing the instruction-levelbehavior of the PE. Third, there is a mechanism for ensuring theconsistency between the hardware structure and the instruction set.ADLs can be roughly grouped into two categories based on how theyattack this last problem.
Next, in Part 2: Generatingthe architecture from the instruction set.
Usedwith the permission of the publisher, Newnes/Elsevier , this series offour articles is based on copyrighted material from “Sub-RiscProcessors,” by Andrew Mihal, Scott Weber and Kurt Keutzer inCustomizableEmbedded Processors, edited by Paolo Ienne and RainerLerupers. The book can be purchased on line.
Paolo Ienne is aprofessor at the Ecole Polytechnique Federale de Luasanne (EPFL) andRainer Leupers is professor for software for systems in silicont atRWTH Aachen University. Kurt Keutzer is professor of electricalengineering and computer science at the University of California,Berkeley, where Andrew Mihal was a Ph.D. candidate and Scott Weberreceived a PhD., related to the design and simulation of sub-RISCprogrammable processing elements.
[1 ]N. Shah. Understanding Network Processors. Master's thesis,University of California, Berkeley, California, 2001.
 E. Kohler, R. Morris, B. Chen, J.Jannotti, and M.F. Kaashoek. TheClick modular router. ACM Transactions on Computer Systems (TOCS),18(3):263-297, Aug. 2000.
 B. Chen and R. Morris. Flexiblecontrol of parallelism in a multiprocessor PC router. InProceedings of the 2002 USENIX Annual Technical Conference, 2002, pp.333-346.
 N. Shah, W. Plishker, K. Ravindran, and K.Keutzer. NP-Click:A productive software development approach for network processors.IEEE Micro, 24(5):45-54, Sep.-Oct. 2004.
 C. Sauer, M. Gries, and S. Sonntag. Modulardomain-specific implementation and exploration framework for embeddedsoftware platforms. In Proceedings of the Design AutomationConference, 2005, pp. 254-259.
 P. Paulin, C. Pilkington, E. Bensoudane, M.Langevin, and D. Lyonnard. Application of amulti-processor SoC platform to high-speed packet forwarding. InProceedings of the Design, Automation and Test in Europe Conference andExhibition (Designer Forum), 2004, vol. 3, pp. 58-63.
 K. Ravindran, N. Satish, Y Jin, and K. Keutzer. AnFPGA-based soft multiprocessor for IPv4 packet forwarding. InProceedings of the 15th International Conference on Field ProgrammableLogic and Applications, 2005, pp. 487-492.
 C. Kulkarni, G. Brebner, and G. Schelle. Mappinga domain specific language to a platform FPGA. In Proceedings ofthe 41st Design Automation Conference, 2004, pp. 924-927.
 Tensilica, Inc. TensilicaXtensa LX processor tops EEMBC networking 2.0 benchmarks, May 2005.
 1. Sourdis, D. Pnevmatikatos, and K. Vlachos. Anefficient and low-cost input/output subsystem for network processors.In Proceedings of the FirstWorkshop on Applic