Early performance analysis and architecture exploration ensures thatyou will select the right FPGA platform and achieve optimalpartitioning of the application onto the fabric and software. Thisearly exploration is referred to as rapid virtual prototyping.
A virtual prototype couldsimulate the FPGA and boardusing models that are developed quickly using pre-built, parameterizedmodeling libraries in a graphical environment. These models generallydo not require implementation-level information such as the applicationsoftware, pin-level connectivity and detailed signal data.
Virtual prototypes can be used during the product definition andspecification for platform selection, bottleneck identification,hardware-software partitioning and functional correctness.
FPGA's can be simulated with varying levels of details at the ICand at the board-level. The level of detail will be determined byexploration required, state in the design cycle and amount of RTL and executable codeavailable. At thespecification phase, only a rough architecture specification isavailable.
The virtual prototype must be flexible to be constructed using onlythis limited available information, but still maintain a high degree ofaccuracy. This article presents an approach that enables designers todescribe the FPGA architecture using preconfigured components andwithout resorting to a lot of programming. The FPGA used in theexamples will be the Xilinx Virtexplatform.
Constructing a virtual systemmodel
To construct a virtual system model, the modeling environment wouldrequire a model of the FPGA platform and a minimum set of associated IPand Cores. For the Xilinx Virtex family, the elements include PowerPC,MicroBlaze, and PicoBlaze processors; CoreConnect; DMA; interruptcontrollers; DDR; and block RAM.
In addition, there must be modeling components that describe variousapplication behavior, traffic, user activity, and board-level devices.The models must be preset to generate statistics for use in explorationdecisions. These include latency, utilization, throughput, hit ratio,state activity, context switching, power consumption, and processorstalls.
The key decision in selecting the early performance explorationsolution is the model construction time and availability of statisticsgenerators for rapid trade-offs.
The advantages of early architecture exploration can be illustratedusing the example of a streamingmedia processor implemented using a Virtex-4 device. The designwouldneed to process 1080 lines to achieve the required performance and notdrop more than three video frames for every time period.
Early exploration will determine the performance of the proposedspecification before any development has been scheduled. The analysiswill identify bottlenecks that limit performance. Also the efficiencyof the proposed hardware-software partitioning, topology quality andother architecture decisions can be validated.
The system model will contain details of the peripheral and theFPGA. The analysis must include correctness of the functional flow forthe video frames, compare the clock synchronization with the audioframes, measure the utilization of the shared resources and calculatethe end-to-end latency for each sub-task or DMA operation.
Addressing these bottlenecks and partitioning issues duringimplementation can eliminate product quality issues. Architectureerrors or system bottlenecks cannot be typically resolved throughclever manufacturing and will require expensive workarounds duringimplementation. Both of these have the potential to delay a project formonths and possibly years.
If detected during the specification phase, bottlenecks can beresolved by exploring wide variety of alternates. Alternates could beselected by conducting optimization over a variety of simulationsequences such as bus speed, bus width, large FIFOand new DMAarbitration algorithm.
For example, a synchronization mismatch in audio and videopresentation could be addressed by associating higher priority foraudio and other less resource intensive tasks; and migrating tasks fromthe embedded processor to a dedicated off-chip processor.
The change could be trade-off for power, performance and cost. Thebottleneck resolution could pave the way for a better quality productat a cheaper price. If certain FPGA resources are under-utilized,additional tasks can be loaded on the same FPGA. This modification willreduce the number of single points of failures, increase reliabilityand, at the same time, reduce the Bill of Materials (BOM).
Rapid Visual Prototyping
Rapid visual prototyping can help you make better partitioningdecisions. Evaluations with performance and architectural models canhelp eliminate clearly inferior choices, point out major problem areas,and evaluate hardware/software trade-offs.
|Figure1. Translating a system concept into rapid visual prototyping|
Simulation is cheaper and faster than building hardware prototypesand can also help with software development, debugging, testing,documentation, and maintenance. Furthermore, early partnership withcustomers using visual prototypes improves feedback on designdecisions, reducing time to market and increasing the likelihood ofproduct success (Figure 1, above ).
A design-level specification captures a new or incremental approachto improve system throughput, power, latency, utilization, and cost;these improvements are typically referred to as price/power/performancetrade-offs.
At each step in the evolution of a design specification,well-intentioned modifications or improvements may significantly alterthe system requirements. The time required to evaluate a designmodification before or after the system design process has started canvary dramatically, and a visual prototype will reduce evaluation time.
To illustrate the use of the rapid visual prototype, let's considera Layer 3 switch implementedusing a FPGA for the packet processing. The Layer 3 switch is anon-blocking switch and the primary consideration is to maintain totalthroughput across the switch.
In product design, three factors are certain: specifications change,non-deterministic traffic creates performance uncertainty, andoff-the-shelf FPGAs get faster. Products operate in environments wherethe processing and resource consumption are a function of the incomingdata and user operations. FPGA-based systems used for production mustmeet quality, reliability, and performance metrics to address customerrequirements.
What is the optimal distribution of tasks into hardware accelerationand software on FPGAs and other board devices? How can you determinethe best FPGA platform to meet your product requirements and attain thehighest performance at the lowest cost?
One approach is to construct simulationmodels using SystemCfor the FPGA platform and IP. An alternate approach is to use pre-builtgraphical components that are parameterized for easy reuse.
Alternatively is the use of pre-builtcomponents that are graphicallyinstantiated to describe hardware and software architectures. Thesecomponents resemble the common IP and cores that are instantiated onthe FPGA.
In a system simulation, the applications and use activity aredescribed as flow charts in a behavior diagram. The behavior is mappedto the architecture model through a named association. The combinationis simulated with a large number of traffic sequences. Using pre-builtcomponents, reduces the model construction burden and allows thedesigner to focus on analysis and interpretation of results.
System simulation assists the designer to optimize productarchitectures by running simulations with application profiles toexplore FPGA selection; hardware versus software decisions; peripheraldevices versus performance; and partitioning of behavior on targetarchitectures.
|Figure2. FPGA architecture platform model using FPGA components fromMirabilis Design|
You can use architecture exploration (Figure2, above ) to optimize every aspect of an FPGA specification,including:
1) Task distribution on softprocessor cores, on-chip higher-performance RISC and standaloneprocessors
2) Sizing the processors;associated buses and caches; and board peripherals
3) Selecting functionsrequiring a co-processor or alternate accelerator
4) Determining optimalinterface speeds and control signals
5) Exploring block RAMallocation schemes, cache and RAM speeds, off-chip buffering, andimpact of redundant operators
System-level analysis includes packet size versus latency, protocoloverhead versus effective bandwidth, and resource utilization. Inreference to the Layer 3 example, the decisions would include using:
1) On-chip PowerPC or external RISC processor for routingoperations
2) Encryption algorithms usingthe DSP function blocks or fabric multipliers and adders
3) A dedicated soft processor core for traffic management or fabric
4) RISC processor for control or proxy rules processing
5) TCP offload using an external coprocessor or soft processor core
Should the packet packaging for network communication be performedon the on-chip processor or would a dedicated micro-controller on theinterface card be used? Can a set of parallel micro-controllers withexternal SDRAM supportin-line spy-ware detection? What will the performance be when thepacket size changes from 256 bytes to 1,512 bytes? How can you plan forfuture applications such as mobile IP?
You can extend the exploration to consider the interfaces betweenthe FPGA and board peripherals, such as SDRAM. On the Virtex FPGA, thePowerPC will be sharing the CoreConnect PLB bus with the MicroBlazeprocessor.
The effective bus throughput is a function of the number of datarequests and the size of the local block RAM buffers. For example, youcould enhance the MicroBlaze processor with a co-processor to doencryption at the bit level in the data path.
You could also use the CoreConnect PLB bus to connect the peripheralSDRAM to the PowerPC while a DMA on OPB is used for memory access tothe MicroBlaze processor.
A good system simulation model must be reused for exploration ofthe software design, identifying high-resource consumption threads,balancing load across multiple MicroBlaze processors, and splittingoperations into smaller threads.
If a new software task or thread hasdata-dependent priorities, exploration of the priorities andtask-arrival time on the overall processing is a primary modelingquestion. If you change the priority on a critical task, will this besufficient to improve throughput and reduce task latency?
In most cases, this will be true, but there may be a relative timeaspect to a critical task that can reduce latencies on lower prioritytasks such that both benefit from the new ordering. If peak processingis above 80% for a system processing element, then the system may bevulnerable to last-minute tasks added, or to future growth of thesystem itself.
|Figure3. Flow chart describing the application flow diagram in theVisualSim|
System modeling of the Layer 3 switches (Figure 3, above ) starts by compilingthe list of functions (independent of implementation), expectedprocessing time, resource consumption, and system performance metrics.The next step is to capture a flow diagram using a graphical blockdiagram editor.
The flow diagrams are UML diagramsannotated with timing information. The functions in the flow arerepresented as delays; timed queues represent contention; andalgorithms handle the data movement. The flow diagram comprises dataprocessing, control, and any dependencies.
Data flow includes flow and traffic management, encryption,compression, routing, proxy rules, and TCP protocol handling. Thecontrol path contains the controller algorithm, branch decision trees,and weighted polling policies.
VisualSim builds scenariosto simulate the model and generate statistics. The scenarios aremultiple concurrent data flows such as connection establishment (slowpath); in-line data transfer after setup of secure channel (fast path);and protocol-and data-specific operation sequences based on data typeidentification.
You can use this model of the timed flow diagram for functionalcorrectness and validation of the flows. Modeling tools such asVisualSim can enable random traffic sequences, application-specifictraffic or trace-driven stimulus for the model.
This timed flow diagram can be simulated to run a wide range ofoperation scenarios. The resulting output is used to select the FPGAplatform and conduct initial hardware and software partitioning. In theexample shown, the simulation flow diagram model defines the FPGAcomponents and peripheral hardware using the FPGAModeling Toolkit.
|Figure 4. Analysis output for the Layer 3 switch design|
The functions of the flow diagram are mapped to these architecturecomponents. For each function, the simulation model in automaticallycollects the end-to-end latency, throughput and number of packetsprocessed in a time period.
For the architecture, the model plots the average processing time,utilization, and effective throughput (Figure4, above ). These metrics are matched against the requirements.Exploration of the mapping and architecture is possible by varying thelink and replacing the selected FPGA with other FPGAs.
Early architecture exploration ensures a highly optimized product forquality, reliability, performance, and cost. This provides directionfor implementation plans, reduces the amount of tests you need toconduct, and has the ability to shrink the development cycle by almost30%.
You can add overhead to the models to capture growth requirementsand ensure adequate performance. Graphical virtual prototypes increaseuniform understanding among the design team members, validate customerrequirements, great for demonstration purposes and can be used to gainearly design-wins.
Deepak Shankar is chief executiveofficer at Mirabilis Design andhas over 15 years experience in development, sales and marketing ofsystem-level design tools. Prior to Mirabilis Design, Mr. Shankar wasVP of Business Development at both MemCall, a fabless semiconductorcompany and SpinCircuit, a supply chain joint venture of HP, Cadenceand Flextronics. Prior to that, Deepak spent many years in productmarketing at Cadence Design Systems.