In Architecture Description Languages (ADLs) such as ISDL  and nML, designersmodel an instruction set, and then a correct-by-construction synthesistool generates a matching architecture and the appropriate softwaredevelopment tools. This approach has drawbacks similar to the synthesisapproaches. The tool makes decisions about the structure of thearchitecture that are not under the control of the architect.
To remedy this, ADLs suchas LISA  and EXPRESSION  weredeveloped. Here, designers use a single unifieddescription for the ISA, the datapath structure, and a mapping betweenthe two. This gives architects more control over the details of theimplementation. However, designers must specify an instruction setand an encoding by hand. Another downside is that the architect isexpected to conform to a traditional architectural style.
These ADLs are similar to the customizable datapaths in that theyassume asequential control scheme with a program counter and instructionmemory. To construct application-specific process-level concurrencyschemes, we want more flexibility in this area.
Extracting the Instruction Set fromthe Architecture
The opposite approach is to have the designer build the architecture,and then automatically extract the instruction set and the softwaredevelopment tools. MIMOLA is an example of an ADL in this category. The theory is that for unusual architectures withcomplex forwarding paths, multiple memories, multiple register files,and specialized functional units, it can be more difficult to describethe instruction set than it is to just build a structural model of thearchitecture. This method is clearly advantageous for the sub-RISCPEs we wish to design.
However, there are still issues withapplication-specific processlevel concurrency.MIMOLA expects atraditional control scheme in order to produce a compiler. Designersmust manually identify key architectural features such as the programcounter and the instruction memory. Thus the deployment methodology isonly applicable to a subset of the architectures that can be designed.
TIPI Processing Elements
To build sub-RISC PEs with application-specific control logic, we mustexplore an area of the design space that is not adequately covered byexisting ADLs. This space is shown in Figure13.6 below . Many of the basicblocks we have considered, from the hardwired datapaths created bysynthesis approaches to VLIW-or RISC -likecustomizabledatapaths, canbe called ASIPs. For some of the finest grained hard-wired designs onthe left in the figure, RTL languages offer sufficient designerproductivity.
However, as we start to add programmability to thearchitectures, designers quickly become overwhelmed in thespecification of details of the control logic for the architecture. Register transfer languages (RTL)are no longer sufficient. ADLs make it easier todescribeprogrammable machines, but their focus is on architectures that areamenable to traditional compilation techniques (e.g., sequential Ccode ).
|Figure13.6. Spectrum of ASIP basic blocks.|
When we consider application-specific process-levelconcurrency, and multiprocessor machines with tightly coupled controllogic between PEs, it is clear that there is a gap in the design space.We have already identified that C is not a good language for modelingsome of the concurrency in embedded applications. Therefore it isperfectly natural to consider architectures that are different fromtraditional machines that run C programs.
These are architectures thatare too complex to design with RTL languages in terms of productivity,but are also not targeted by existing ADLs. This motivates the designof a new ADL. We call ourapproach TIPI. It builds upon the ADLs that extract the instructionset from the architecture. To this we add a mechanism for expressingthe programmability of architectures that lack traditional controllogic.
This is accomplished by using a horizontallymicrocoded, statically scheduled machine as the core abstraction.Designers build structural models of datapaths, but leave controlsignals (such as register file address inputs and multiplexor selectsignals) disconnected. These signals are automatically driven by ahorizontal microcode unit that decodes an instruction stream that isprovided by a source external to the PE. This is shown in Figure 13.7 below .
This abstractionapplies to the whole range of architectures that can be designed. Atthe multiprocessor level, a TIPI PE is an elementthat can both receive control from and send control to other PEs.Designers build compositions where the instruction input for each PE isconnected tothe outputs of other PEs. The possibility of tightly coupling controlatthe multiprocessor level enables the construction of hardware supportfor application-specific process-level concurrency.
|Figure13.7. TIPI control abstraction.|
Programming such a machine is a matter of arrangingfor the proper instruction streams to arrive at each PE. This is quitedifferent from writing individual C programs for each PE. The conceptof a multiprocessor machine where the PEs pass control messages amongthemselves forms the basis for a new application deploymentmethodology, which will be explored later in this article.
With these pieces put together we can buildprogrammable basic blocks that meet all of the goals given at thebeginning of this series:
Improvedproductivity : Instead of making compositions of statemachines with RTL languages, architects work at the level of structuraldatapaths composed of register files, memories, multiplexors, andfunctional units. Complex control logic may be omitted. An automaticanalysis process extracts the capabilities of the datapath in the formof an instruction set, preventing the need for consistencyverification. A correct-by-construction code generation approachproduces synthesizable RTL implementations and simulators from thesehigh-level architectural models. These techniques are central to theMESCAL methodology . This disciplined approach simplifies thedesign space exploration problem.
Customizability: TIPI PEs can be customized to support all three levels of concurrency.Architects can add function units for application-specific arithmeticand logic computations. Beyond this, architects can create unusualpipelines with varying numbers of register files and memory ports forbetter data-level concurrency. Most importantly, TIPI PEs permit theconstruction of novel multiprocessor control schemes for process-levelconcurrency.
Varying degrees ofreusability: Architects can explore a range of architecturalalternatives, from datapaths that provide a large number ofgeneral-purpose features to datapaths that focus onapplication-specific performance.
Architecturalsimplicity: Deciding what to leave out of an architecture isjustas important as deciding what to add. A minimal TIPI PE can have as fewas two or three instructions. Architects only pay for complexity whenthe application demands complexity.
Exportability: The automaticallyextracted instruction set describes the possible cycle-by-cyclebehaviors of the design. This exposes the PE's capabilities fordata-level and datatype-level concurrency. Programming such a machineis a matter of arrangingfor the proper instruction streams to arrive at each PE.
This is quitedifferent from writing individual C programs for each PE. The conceptof a multiprocessor machine where the PEs pass control messages amongthemselves forms the basis for a new application deploymentmethodology, which will be explored later in this article. With thesepieces put together we can buildprogrammable basic blocks that meet all of the goals given at thebeginning of this series:
Improvedproductivity: Instead of making compositions of statemachines with RTL languages, architects work at the level of structuraldatapaths composed of register files, memories, multiplexors, andfunctional units. Complex control logic may be omitted. An automaticanalysis process extracts the capabilities of the datapath in the formof an instruction set, preventing the need for consistencyverification.
A correct-by-construction code generation approachproduces synthesizable RTL implementations and simulators from thesehigh-level architectural models. These techniques are central to theMESCAL methodology . This disciplined approach simplifies thedesign space exploration problem.
Designing TIPI Processing Elements
In this section, we cover the design of TIPI processing elements andthe construction of multiprocessor architectures. This part of thesystem design flow is shown in Figure 13.8 below. The TIPI architecturedesign flow is an iterativemethodology. Designers begin on the left side of the diagram bybuilding structural models of PE datapaths. The capabilities of eachdatapath are characterized by running an automatic operation extractionalgorithm.
At this point, architects can iterate over the design andimprove the architecture until it implements the desired functions, asshown by the dotted line feedback path. TIPI also produces acycle-accurate, bit-true simulator for a PE, so timing simulation canalso be included in the design loop. Once one or more PE architectureshave beendesigned, architects can build programmable platforms by makingmultiprocessor compositions of PEs.
A transaction-level simulator canbe automatically generated from a multiprocessor model. TIPI can alsoproduce a synthesizable Verilogmodel that can be passed to third-partyFPGA or ASIC design tools, or simulatedusing a third-party Verilogsimulator. Later, we will extend this design flow to includeapplication deployment. The following sections describe each of thearchitectural design steps in detail.
|Figure13.8. TIPI design flow.|
Building Datapath Models
TIPI PEs are constructed by assembling atoms from a library.This library contains common things such as register files,multiplexors, memories, pipeline registers, and arithmetic and logicalfunctional units. Most TIPI atoms are type polymorphic in terms of bitwidth, which simplifies the design process. Atoms are written in afeed-forward dataflow language based on Verilog.
This allows architects to specify arbitrarydatatype-level arithmetic and logic computations. Figure 13.9 below isan example of a simple architecture. This datapath has two registerfiles. One is connected to a decrement functional unit, and the otheris connected to an equality comparison unit and a checksum updatingunit.
A TIPI datapath is a nondeterministic description ofcomputations that are atomic with respect to architectural state andoutputs. One example of this nondeterminism is that many of the portsin the design are left unconnected. The read and write address ports onboth register files, one input to the comparison functional unit, andthe select signal to the multiplexor are all unspecified in thisdesign.
The TIPI abstraction states that these signals areimplicitly connected to a horizontal microcode control unit. Theirvalues will be controlled by statically scheduled software on acycle-by-cycle basis. The register file read and write addresses andmultiplexor select signals become part of the instruction word.Unspecified data inputs, such as the input to the comparison functionalunit, become immediate fields in the instruction word.
Also missing aretraditional control logic elements such as a program counter andinstruction memory. Instead, this design receives instruction wordsfrom an external source as in Figure.13.7 . Obviously there are restrictions on whatcombinations of control bits represent valid and useful settings forthe datapath. The next step in the design process is to analyze thedatapath and determine what these are.
|Figure13.9. Example TIPI architecture model.|
TIPI provides an operation extraction algorithm that analyzes adatapath design and finds the set of valid execution paths that can beconfigured statically by software. The resulting operations are thestate-to-state behaviors of the datapath. A complete description of theoperation extractionprocess is given in .
Fundamentally, operation extraction works byconverting a structural datapath model into a system of constraints,and then finding satisfying solutions to the constraint problem. Eachatom has firing rule constraints that specify the valid ways to use thecomponent in a cycle.
An algorithm based on iterative Booleansatisfiability is used to find configurations where all of the atoms'rules are satisfied. To represent the programmability of the PEcompactly, TIPI defines an operation to be a minimal solution to thisconstraint system. A minimal solution is a datapathconfiguration that is not simply a combination of smaller solutions.
Continuing with the example of Figure 13.9 above, Figure 13.10 below shows the set ofextracted operations. There are three operations for this datapath:decrement, compare immediate, and checksum update. A nop operation alsoexists in which the datapath remains idle for a cycle,but this is not shown in the figure. The valid datapath configurationshown in Figure 13.10(d) is asolution, but it is not an operationbecause it is simply the union of the decrement and checksum updateoperations.
|Figure13.10. Extracted operations.|
TIPI captures these nonminimal solutions by creatinga conflict table for the PE. This is given in Figure 13.11 below . The conflicttable shows us that it is possible for software to issue thedecrement and checksum update operations simultaneously in one cyclewithout violating any of the PE's constraints.
On the other hand, it isnot possible to issue the compare and checksum update operationssimultaneously. Together, the operation set and the conflict table area compact way of describing the valid ways in which software can usethe datapath.
Single PE Simulator Generation
After operation extraction, TIPI can produce a cycle-accurate bit-truesimulator for the PE. This is a C++ program that compiles and runs onthe host machine. The input to the simulator is a list of data andinstructions that are to be fed to the PE, along with the timinginformation that says on what cycles the inputs appear. The simulatormodels the cycle-by-cycle functional behavior of the datapath. The GNUGMP library is used to model the arbitrary bit widths in thedatapath.
|Figure13.11. Conflict table.|
Performance results for our C++simulator are givenin . On average, the simulator is 10 times faster than anequivalent Verilog simulation using a commercial simulator. If theprogram trace to be simulated is known a priori, TIPI also provides theoption to create a statically scheduled simulator. With this techniquewe can obtain a speedup of 100 times more than the Verilog simulation.Thus, simulation is not a bottleneck in the iterative design approacheven when large programs are used.
At the multiprocessor level, a TIPI PE is modeled as an opaque block asshown in Figure 13.7 . Itis an element that can send and receivedata and control to and from PEs. Communication between PEs in amultiprocessor is governed by a signal with data abstraction.
A communications even between PCs is an atomic message consisting of acontrol component and a data component. The control component of themessage can be either a single operation or a temporal and spatialcombination of operations.
The later form is called a macro operation. This consists of an atomicsequence of instructions (a temporal combination), where eachinstruction is a conflict-free combination of operations to be executedin one cycle (a spatial combination). There are no jumps orconditionals within a macro operation and they can take a finite numberof cycles to complete.
For implementation efficiency, macro operations arestored in a PE's horizontal microcode control unit and a controlmessage simply provides a pointer to the desired microcode. A RAM isused so that the PE can be reprogrammed. A PE remains idle until itreceives a message. Whenthis event occurs, the control part of the message causes the PE tocompute for one or more cycles. The data portion of the message is madeavailable on the PE's input ports for the entire period that thecomputation occurs.
While it is executing, the PE may compute values andwrite them to its output ports. When the computation is complete, theseoutput values are bundled together into new atomic signal-with-datamessages and are transmitted to destination PEs. The PE returns to theidle state and computation continues at the destinations of the newlycreated messages. We call these messages transfers becausethey represent the transfer of control from one PE to another.
The mechanism by which transfers are communicatedbetween PEs is outside the scope of the TIPI project. Architects canuse best-practices network-on-chip techniques to build a suitablecommunication infrastructure. In this work, we use point-to-pointconnections for simplicity.
The signal-with-data abstraction is abuilding block that can be used to construct more complicated models ofprocess-level concurrency. For example, one can easily create apipeline of PEsto implement the controlfollows- data style of control found in Clickapplications. In Figure 13.5(a) earlier in this series, Cliff useshandshaking signals to pass controlbetween basicblocks. In a TIPI multiprocessor, we can achieve the same effect bysending push and pull transfers between PEs.
It is also possible to duplicate the behavior of asequential von Neumann machine using transfers. At the multiprocessorlevel, one makes a feedback loop between a PE's outputs and inputs sothat the PE may send transfers to itself. The PE computes its own nextcontrol word every cycle.
This is exactly how a RISC processor behaves.In every cycle, in addition to doing an ALU or memory access operation,the processor also updates its program counter. This value is passedthrough an instruction fetch and decode unit to obtain the control bitsthat are used in the next cycle. Figure 13.12 below shows a typicalconfiguration of a TIPI PE in a multiprocessor context.This PE can send transfers to itself as well as to other PEs. Anarbitrator is included to merge incoming transfers from multiple StateLogic Operations and Macro Operations Transfer Queue Outgoing ArbTransfers sources.
The signal-with-data abstraction does not specifypriorities or any specific arbitration scheme. The only requirement isthat arbitration be fair and that transfers are never dropped. Deadlockis possible in multiprocessor compositions with feedback loops. A queuemay be provided to buffer transfers. This is consistent with a globallyasynchronous, locally synchronous (GALS) multiprocessorabstraction.
|Figure13.12. Example of a TIPI PE in a multiprocessor context.|
MultiprocessorSimulation and RTL Code Generation
TIPI produces transaction-level simulators for multiprocessor models.This type of simulator captures the bit-true functionality of PEs andthe creation and consumption of transfers. The timing details ofsending transfers between PEs is not modeled. Like the single-PE case,these simulators are C++ programs that compile and run on the hostworkstation.
TIPI can also produce a synthesizable Verilogimplementation of the multiprocessor system. If a bit-true,cycle-accurate simulation of the entire system is desired, the Verilogcan be simulated using third-party tools. Otherwise, the RTL code canbe sent directly to a standard ASIC or FPGA flow.
Next in Part 2: Deployingapplications with Cairn
To read Part 1, go to Concurrentarchitectures, concurrent applications
Usedwith the permission of the publisher, Newnes/Elsevier ,this series offour articles is based on copyrighted material from “Sub-RiscProcessors,” by Andrew Mihal, Scott Weber and Kurt Keutzer in CustomizableEmbedded Processors, edited byPaolo Ienne and RainerLerupers. The book can be purchased on line.
Paolo Ienne is aprofessor at the Ecole Polytechnique Federale de Luasanne (EPFL) andRainer Leupers is professor for software for systems in silicont atRWTH Aachen University. Kurt Keutzer is professor of electricalengineering and computer science at the University of California,Berkeley, where Andrew Mihal was a Ph.D. candidate and Scott Weberreceived a PhD., related to the design and simulation of sub-RISCprogrammable processing elements.
[1 ]N. Shah. Understanding Network Processors. Master's thesis,University of California, Berkeley, California, 2001.
 E. Kohler, R. Morris, B. Chen, J.Jannotti, and M.F. Kaashoek. TheClick modular router. ACM Transactions on Computer Systems (TOCS),18(3):263-297, Aug. 2000.
 B. Chen and R. Morris. Flexiblecontrol of parallelism in a multiprocessor PC router. InProceedings of the 2002 USENIX Annual Technical Conference, 2002, pp.333-346.
 N. Shah, W. Plishker, K. Ravindran, and K.Keutzer. NP-Click:A productive software development approach for network processors.IEEE Micro, 24(5):45-54, Sep.-Oct. 2004.
 C. Sauer, M. Gries, and S. Sonntag. Modulardomain-specific implementation and exploration framework for embeddedsoftware platforms. In Proceedings of the Design AutomationConference, 2005, pp. 254-259.
 P. Paulin, C. Pilkington, E. Bensoudane, M.Langevin, and D. Lyonnard. Application of amulti-processor SoC platform to high-speed packet forwarding. InProceedings of the Design, Automation and Test in Europe Conference andExhibition (Designer Forum), 2004, vol. 3, pp. 58-63.
 K. Ravindran, N. Satish, Y Jin, and K. Keutzer. AnFPGA-based soft multiprocessor for IPv4 packet forwarding. InProceedings of the 15th International Conference on Field ProgrammableLogic and Applications, 2005, pp. 487-492.
 C. Kulkarni, G. Brebner, and G. Schelle. Mappinga domain specific language to a platform FPGA. In Proceedings ofthe 41st Design Automation Conference, 2004, pp. 924-927.
 Tensilica, Inc. TensilicaXtensa LX processor tops EEMBC networking 2.0 benchmarks, May 2005.
 1. Sourdis, D. Pnevmatikatos, and K. Vlachos. Anefficient and low-cost input/output subsystem for network processors.In Proceedings of the FirstWorkshop on Application Specific Processors, 2002, pp. 56-64.
 G. Hadjiyiannis, S. Hanono, and S. Devadas. ISDL:An instruction set description language for retargetability. InProceedings of the 34th Design Automation Conference, 1997, pp.299-302.