Building a configurable embedded processor - From Impulse C to FPGA -

Building a configurable embedded processor – From Impulse C to FPGA

This “Product How-To” article focuses how to use a certain product in an embedded system and is written by a company representative.

When most design groups think of the term “configurableprocessing, ” their minds immediately go to configurableprocessors offered by IP vendors. But those designers may beoverlooking an obvious and trusted alternative: processor- laden FPGAs.Today FPGA vendors offer high performance industry standard MPU coreson board feature-rich FPGAs.

By using a mix of FPGA vendor and EDA vendor tools and a bit of ingenuity, embedded designers canextend the instruction set of the processors running in these FPGAs toadd their own unique functions to their designs. And they can do sowithout having to go back to school to get a degree in hardwareengineering.

Over the last couple of years, FPGA and EDA vendors have made greatstrides in creating software-to-FPGA tools that will allow embeddeddesigners to effectively use FPGAs to increase the performance of theirdesigns, while meeting timing budgets and cutting bill of materialcosts.

Let's examine a configurable FPGA-based hardware acceleratormethodology in which we'll use an auxiliary processing unit (APU)controller interface to integrate co-processing accelerators to vastlyspeed up a system's overall performance.

In particular, we'll employ this configurable FPGA-based hardwareaccelerator methodology to increase the performance of a machine visionsystem that formerly employed an embedded processor.

Traditionally, an application such as a machine vision systemrequires a substantial amount of computation, far more than what asingle processor can handle. Because a single processor isn't a viablealternative, some design groups may consider using one or morehigher-end DSP devices.

But increasingly designers are employing a hardware-acceleratedapproach using an FPGA, in which designers can implement part of theapplication as software running on an embedded processor (or multipleembedded processors) within the FPGA, while they implementperformance-critical portions of the application, such as video imagefiltering, as hardware accelerators within that same FPGA.

Video processing is a major driver of advances in embedded systemsand tools, and has become one of the largest areas of growth forembedded computing. Computer vision and security systems demandincreasingly high levels of bandwidth to support ever-higherresolutions, faster frame rates and more complex image analysis andconversion.

Near-real-time format conversions have become critical in someapplications, as have specialized algorithms such as object recognitionand motion estimation.

Today's most advanced video applications include complex, pipelinedalgorithms that must process data at a high rate. Applications includemachine vision, unmanned aerial vehicles (UAVs) medical imaging, andautomotive safety.

The only way to achieve the needed performance for these types ofalgorithms is using an accelerated computing strategy. Solutions forsuch applications might include the use of multiple high-end DSPdevices, GPUs, or custom ASIC hardware.

Migrating from Discrete CPUs to FPGAs
Why migrate embedded video applications to FPGAs from discreteprocessors? The two primary reasons are integration and acceleration.Today's FPGAs have high capacities, which allow design teams to movemultiple discrete components (the embedded processor and its variousperipherals) into a single programmable device.

There are clear cost savings in integration, and also advantagesrelated to flexibility and protection from future device obsolescence.FPGAs also offer acceleration for applications requiring a significantamount of computation, as is typical in image processing, DSP andcommunications.

FPGAs support acceleration by providing configurable hardware andflexible, on-chip memories. Designers can access these device resourcesthrough libraries, through hardware-level programming, or viasoftware-to-hardware compilation.

Challenges with software-to-hardwareconversion.
But before embedded systems designers jump in and start using FPGAs assoftware platforms, they should be aware that there are challenges.Historically, to use FPGAs as software platforms, designers wererequired to write low-level hardware descriptions in the form of VHDL or Verilog , which are languagesthat are not generally part of a software programmer's expertise.

Designers also needed to figure out how and when to partitioncomplex applications between hardware and software, and how tostructure an application to take maximum advantage of hardwareparallelism.

Today, FPGA and EDA vendors have come a long way in providing C compilation andoptimization tools for FPGAs and are providing a new level ofprogramming abstraction for FPGAs. With ever higher FPGA gate densitiesand the proliferation of FPGA embedded processors, there is strongdemand for even higher levels of abstraction.

And for applications that involve embedded processors, C-to-hardwaretools such as Impulse C from ImpulseAccelerated Technologies Inc. as shown in Figure 1 below , can abstract awaymany of the details of hardware-to-software communication, allowing thesoftware programmer to focus on application partitioning without havingto worry about the low-level details of the hardware.

Figure1. Impulse C- FPGA Tools Overview

This allows software application developers to more quicklyexperiment with alternative software/hardware implementations. Althoughsuch tools can dramatically improve a programmer's ability to createFPGA-based application, for the highest performance a programmer stillneeds to understand certain aspects of the underlying hardware.

In particular, the programmer needs to understand how partitioningdecisions and C coding styles will impact performance, size and powerusage. For example, the acceleration of critical computations and innercode loops must be balanced by the expense of moving data betweenhardware and software.

Fortunately, modern tools for FPGA compilation provide various typesof analysis tools that can help a software programmer more clearlyunderstand and respond to these issues.

Practically speaking, the initial results of software-to-hardwarecompilation from Clanguage descriptions will not equal the performance of hand codedVHDL, but the turnaround time to get those first results working may bean order of magnitude better.

Still, by using a software-to-hardware compilation from C language,programmers can improve the performance improvements iteratively,through analysis of how the compiler is compiling the application tothe hardware and through experimentation with C-language programming.

Figure2. Impulse Tools Flow showing algorithm data flow analysis andoptimized partitioning.

Graphical tools can also help programmers, as they provide initialestimates of algorithm throughput such as loop latencies and pipelineeffective rates. From here, the application developer may interactivelychange optimization options and/or iteratively modify and recompiletheir C code to obtain higher performance.

Such design iterations may take the form of loop optimizations, toincrease pipelining efficiency for example, or the programmers can usepre-optimized FPGA library functions, such as higher-level mathoperations provided by the FPGA vendor. Figure 2 above shows the Impulsetool flow, highlighting data flow analysis and algorithm partitioningcapability.

The integration factor
Integration is a key factor in driving the performance andsimplification of configurable embedded systems. For an FPGA-basedconfigurable embedded processing solution, it's crucial for FPGAvendors to minimize the amount of FPGA logic users need to build a highperformance processing system yet vendors must still support a widevariety of topologies.

A good FPGA solution will maintain the flexibility and theadvantages of a configurable implementation but will also have theadded benefit of a hardened, integrated interconnect. The result willbe an embedded block that allows you to develop a wider range of highperformance processing architectures in a shorter period of time.

Next to the change from PowerPC405 to integrating the PowerPC440core, a major advances in the Virtex-5 FXT is the integration of theprocessor interconnect ” a 5×2 crossbar switch with integrated DMA,dedicated memory interface and bus interfaces, enabling high throughputand low latency to memory and I/O. A detailed block diagram of theintegrated processor block features is shown in Figure 3 below.

Figure3. Integrated processor blow, with crossbar switch, integrated DMA, anddedicated memory and bus interfaces.

Integration provides a significant savings in terms of FPGAresources, ease of implementing systems, and power reduction. Inaddition, the three largest Virtex-5FXT devices, FX100T, FX130T, andFX200T integrate dual processor blocks, significantly expanding theoptions for configurable processing by providing multiple access portsto extend the processor ISA.

Figure4. Dual Processing Platform using the Virtex-5 FX130T FPGA

Supporting the development of the configurable dual processingsystems, the ML510 board pictured in Figure4 above, allows two separate processing systems with separateDDR2, Ethernet, and peripheral systems to be designed on a single ATXform factor platform (Figure 5 below ).

Figure5. The configurable co-processing is showcased using the dualprocessing platform ” ML510 with complete dual integrated processingblocks as shown in Figure 4

Documentation along with full schematics and Gerber plots areavailable for the ML510 from Xilinx embedded website. Next let's lookat Virtex-5 FXT dual processing and how this platform providesconfigurable processing to accelerate overall system level performance.

Accelerating system performance
While many systems can benefit from embedding and integrating acrossbar, some configurable systems benefit additionally from addingprocessor acceleration logic directly to the processor's internalpipeline and controlling this logic via special op-codes and anauxiliary interface.

This isn't as far-fetched as it might sound. One familiarapplication that you're probably aware of is the floating point unit–atightly coupled processing unit that accelerate floating pointcomputations which are notoriously slow when run entirely in software.A classic example of co-processing is a Floating Point engine.

Figure 6 below depicts animplementation using the double/single-precision FPU available throughEDK interfaced directly to the APU Controller. When using the ImpulseCo-developer suite, the co-processing engines are designed with the APUinterface already completed. You import your pre-defined core into theEmbedded Development Kit supplied as part of the Xilinx Platform Studiokit.

Figure6. Embedded System implemented with EDK BSB highlighting PowerPC440Block features.

It's not much of a leap to imagine other types of computations thatmight benefit from this type of hardware acceleration. Real time videoimage processing, ultrasound beam forming and security applications areall blocks that can benefit from hardware acceleration.

It bears repeating that while programming is typically done in ahigh-level-language like C or C++, and hardware is done in a synthesislanguage like VHDL or Verilog, there are tools available that allow youto write code in C or C++, and if the algorithm function execution inthe processor needs speeding up, the tool can generate the hardwareaccelerator automatically, and connect the HW accelerated function toan auxiliary processor unit interface (APU).

The APU provides the means to directly interface to the pipeline ofthe PowerPC440 processor for low latency with a 128-bit load store datainterface. By attaching hardware accelerator engines to the APU, youcan increase the overall performance of your system.

By offloading CPU-intensive operations such as video and 3D dataprocessing and floating-point math, embedded programmers can distributededicated functions leveraging the parallelism and resources in theFPGA.

Figure7. Dual system diagram for configurable processing with specificacceleration algorithms as defined by User Defined Instructions.

By creating custom co-processors in the FPGA logic, programmers canoptimize hardware/software partitioning with the PowerPC 440 blockAuxiliary Processor Unit (APU) controller. With up to 16 User definedop codes that programmers can easily extend in the FPGA fabric logic,there is a broad range of applications programmers can address.

One specific example for MPEG video capture, is shown in Figure 7 above and Figure 8 below . Impulse hasleveraged the configurable nature of the Virtex-5FXT to acceleratespecific imaging algorithms used for MPEG decode and display.

By identifying the algorithm and accelerating these with UserDefined Instructions implemented as configurable con-processingengines, with the Impulse Co-Developer tool suite, an overallperformance increase of 8X was recognized.

Figure8. Expanded view of acceleration details show casing configurableacceleration engines ” achieving over 8X acceleration in performanceimprovement.

Configurable embedded processing has come into its own. There is now awide variety of available solutions that span a broad range ofapplications. This article has examined a specific implementation andhighlighted the steps necessary to create acceleration enginesutilizing advanced FPGA technology and latest development tools toachieve significant performance improvement.

The analysis included a review of the main system processingfeatures for the configurable solution, and the implementationtechnique addressing many of the associated embedded design challenges.

By presenting the latest innovations in the realm of reconfigurableprocessing, a highly integrated and implementation efficient solutionwas demonstrated enabling embedded designers to rapidly design anddeploy a flexible and scalable embedded system.

David Pellerin is Founder and CEOof Impulse Accelerated Technologiesand Dan Issacs is Director, APD Embedded Marketing at Xilinx, Inc.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.