Developing processor-compatible C-code for FPGA hardware acceleration -

Developing processor-compatible C-code for FPGA hardware acceleration

This article describes an iterative process for converting C code to run on FPGAs with or without processor cores, how to identify which code sections can best benefit from hardware acceleration, and coding styles to use to retain commonality.

FPGAs are becoming increasingly popular with software teams to accelerate critical portions of their code. In most cases these teams already have processing stacks and applications written in C that target embedded microprocessors or servers. For applications that require acceleration, a logical next step is to offload some portion of the code to an FPGA. A good way to do this is to migrate portions of the working microprocessor system to an FPGA while keeping the code base compatible with the original processor. This approach lowers risk and allows the software team to more easily experiment with alternate implementations, iterating toward an accelerated solution without creating a fundamentally different branch of the code.

This article describes how to identify which code sections can best benefit from hardware acceleration, use coding styles to retain commonality, and select hardware for both development and deployment.

Selecting code elements for hardware acceleration

FPGAs have a comparatively slow clock but can outperform CPUs by providing more opportunities for parallelism. CPUs are constrained by fixed instruction sets and fixed data paths, while FPGAs allow incremental flexibility to create a customized processor to solve a wide variety of embedded computing problems.

Efficient FPGA coding styles tend towards a “coarse grained parallel” approach to algorithm refactoring. This is coupled with automated or semi-automated compiler tools to generate finer-grained parallelism, for example by automatically scheduling parallel statements or by unrolling and pipelining inner code loops.

Microprocessors, in contrast, run at higher clock speeds and have instruction sets and other features that make programming with sequential programming models and languages more natural.

When evaluating which sections of a given application are most appropriate for FPGA acceleration, weigh both the I/O requirements and specific needs for accelerating specific computations. Moving data between an FPGA and a CPU can be enabled in many ways, depending on the architecture of the selected computing platform.

These architectural differences have a significant impact on the more appropriate methods of developing an application. For example, if the FPGA is connected to the host CPU via a standard bus interface such as PCI Express, there may be performance implications when considering the size of individual data transfers – how many bytes are transferred in each I/O request – as well as considerations regarding the latency of data movement.

C-to-FPGA programming is conceptually similar to programming for GPU accelerators, or programming for multi-core processors using threads or MPI. All of these involve using a set of functions – extensions to the C language – in support of parallel programming.

In the case of FPGAs, modern compiler tools are capable of generating highly parallel implementations from one or more C-language subroutines, which are implemented as hardware processes. If appropriate coding styles are used, these compilers are able to generate equivalent low-level hardware that operates with a high degree of cycle-by-cycle parallelism.

The nature of FPGAs makes it unlikely that a large legacy C application written using traditional C programming techniques can be compiled entirely to hardware with no changes (Figure 1 below).

Figure 1 In a traditional hardware/software development process, hardware design may represent a significant bottle neck (Source: Practical FPGA Programming in C) .

Many C applications are best-suited to processor architectures, but may be accelerated dramatically by identifying key subroutines and inner code loops and moving just those sections to hardware using data streaming or shared memory methods of I/O, and by doing some level of algorithm optimization to best utilize the computing resources found in an FPGA. High level compiler tools can assist in this application conversion by providing a parallel programming model and corresponding APIs that allow varying levels of parallelism to be expressed, extending the power and flexibility of the C language.

By using a C-to-FPGA compiler in an iterative manner, key subroutines can be partitioned and moved into dedicated hardware with relatively little effort. Software-based simulations should be performed along with generation of prototype hardware to simplify debugging and system-level optimization.

Although C compilers for FPGAs are limited in the features they can support for hardware generation, C-to-FPGA programming methods that exploit attached processors allow more complete use of the language. You can use any feature of C, for example, to develop a simulation test bench, or to describe processes that will run on FPGA-embedded CPU cores (Figure 2 , below).

Figure 2 By introducing software to hardware compilation into the design process, it’s possible to get a working prototype faster, with more time available later for design refinement. (Source: Practical FPGA Programming in C).

For those processes that will be compiled directly to FPGA hardware, however, there are certain constraints placed on the C coding style. These constraints typically include only limited support for pointers, restrictions on the use of recursive function calls and on the types of control flow statements that may be used.

When considering C-to-FPGA, it's useful to think of the FPGA as a coprocessor or offload resource for an embedded or host processor. In this use case, an instruction-based CPU hosts traditional C programs with all the features of C or C++.

The FPGA is used to create hardware accelerators, using a somewhat more restricted subset of the C language and using the C-to-FPGA compiler. Data streams, shared memories, and signals (specified using function calls from the C-to-FPGA library) are used to move data between the processor and the FPGA hardware.

The C-to-FPGA compiler creates the interfaces between the processor and the hardware to implement data communication via streams, memories, and signals. This model of programming is similar to the methods used when programming for GPU accelerators.

Alternatively, a C-to-FPGA compiler can be used for module generation, an approach that is most practical when there is no CPU in the data path. Examples of module generation include the creation of DSP and video filters, using C-language functions and streaming I/O to describe a pipeline of connected hardware processes.

Coding style
By using appropriate C coding styles and thecompiler tools, it is possible to optimize for both performance and forutilization. The first priority of the compiler is for performance, asmeasured in clock cycles required to complete a given task.

Arelated, secondary priority is to reduce gate delays, which directlyimpact maximum clock rates. Because optimizing for speed can result invery large amounts of logic being generated, C-to-FPGA compilers allowoptimizations such as loop unrolling and pipelining to be controlled vialanguage-level pragmas or via compiler parameters.

Figure 3 Cto FPGA involves using standard C entry methods and tools, plusanalytics like Stage Delay, to construct efficient C to Hardwarerepresentations of high performance algorithms such as this Mandelbrot

Ina typical scenario, a C-to-FPGA user is first concerned with getting aworking prototype, one that may operate at reduced rates but issufficient to verify correct functionality. Software emulation andhardware simulation can be used to analyze dataflow at a system leveland to validate assumptions about parallel implementations.

Lateroptimizations may include using modified C coding styles, specifyingcertain optimizer controls, or re-partitioning the application to takebetter advantage of system-level and fine-grained parallelism. Iterativemethods can be used to tune the application and its constituentalgorithms to the target platform.

C programmers with base levelknowledge of FPGAs can achieve algorithm acceleration, concurrentlyreducing overall design size. If the performance of a given applicationis of higher importance, then extra time should be budgeted for thisoptimization phase, perhaps with the assistance of an experienceddigital logic (hardware) designer.

If the smallest possibledesign size is a critical requirement, then the C-to-FPGA user should beprepared to replace certain portions of the generated hardware withhand-crafted HDL, using the generated HDL as an overall specificationand functional benchmark.

Some specific coding tips include:

  • Avoid dynamic memory management (malloc(), free()) and other C library functions.
  • Limit the use of pointers and structures, use array-based styles when practical
  • Keep function call depths to a minimum and avoid over-abstraction
  • Code things simply and directly

Global variables, pointers to structures, and other features of C

C-to-FPGAcompilation allows specific modules (subroutines, or processes) of the Capplication to be ported to the FPGA. Those portions that are beingmoved to the FPGA as dedicated hardware are constrained in the level of Cthat can be written, as described above.

Those portions thatwill remain in software (in an embedded processor, for example) are notconstrained, except by limitations of the cross-compiler being used. Forhardware modules (subroutines that are being moved to the FPGA throughhardware compilation), C to FPGA compilers generally support onlylimited use of structures.

Pointers are quite limited as well,and must be resolvable at compile time to array references. As forglobal variables, these can be accessed using shared memories (using theC to FPGA compiler blockread/blockwrite functions) or by declaring aglobal array and accessing this array in multiple C hardware processes.

Working with untimed C
ModernC-to-FPGA compilers accept untimed C. This means that the C code doesnot need to be decorated or extended to include information related toregister boundaries, clock signals, and reset logic.

TheC-to-FPGA compiler can automatically parallelize C code and insertregisters to maintain proper operation of the parallelized code; thereis rarely a need to express such parallelism at the level of individualstatements or blocks of code. To do this, the complier analyzes the Ccode, finds interdependencies and collapses multiple C statements intosingle instruction stages representing a single clock cycle. Thisautomated creation of parallel hardware can be controlled by theprogrammer (for size/speed tradeoffs) using compiler pragmas.

C-to-FPGAcompilers have improved dramatically in recent years, but for mostefficient results it is important to consider hardware-appropriatealgorithmic approaches, for example to exploit alternative data pathsizes or to optimize for the memory architecture available in the targetFPGA.

Fixed and Floating point math
FPGAs are flexiblewhen it comes to math operations. While modern CPUs are optimizedaround 32- and 64-bit floating point operations, FPGAs support math ofmany widths and many corresponding levels of numeric precision. In fact,this flexibility in data width can render the concept of GFLOPirrelevant; there are many applications in which CPUs are highlywasteful due to their constrained datapaths and the fixed widths oftheir fundamental operations.

C-to-FPGA compilers supportfloating-point types, though this depends on the FPGA devicemanufacturers who provide the underlying floating point features andlibraries. Better compilers are capable of automatically scheduling andpipelining multi-cycle floating operations, allowing for dramaticincreases in performance over traditional CPUs. Some companies offeroptional math libraries and other higher performance libraries, forexample implementing FFTs.

Fixed-point and arbitrary-widthinteger math is supported via data types and macros that can be used toimplement basic fixed-point operations. These alternate numericrepresentations and operations can be useful in many domains includingvideo processing, in which the use of 24-bit data is common.

Taking advantage of Loops in hardware
Loopsare critical aspects of virtually all applications written in C today.In fact, modern sequential programming languages such as C place a greatdeal of emphasis on loops, to the extent that even the mostparallelizable of operations (initializing an array, for example) arewritten using looping constructs.

For a C-to-FPGA compiler,then, the analysis and transformation of loops into parallel hardware isa key capability. Loop pipelining may be selectively enabled for agiven loop using a compile-time pragma. Note that loop pipeliningintroduces additional hardware (for pipeline control) and can alsointroduce additional dataflow and synchronization requirements, andshould be used carefully.

Unrolling is another useful method forparallelizing loops. Automatic unrolling of loops may also be controlledusing a compiler pragma, or may be accomplished through C-levelrefactoring, for example by duplicating a loop body in the source code.

Loopunrolling can dramatically increase the performance of many types ofalgorithms, but can also result in large increases in the size ofgenerated hardware. Partial unrolling, using a combination of automatedoptimizations and code-level refactoring, may be needed to create animplementation that balances the need for high throughput with efficientresource utilization.

The C to FPGA compiler adds simpleconcepts like opening and closing streams to make it easy to createpipelined operations that compile well to hardware.  For example:

void img_proc(co_stream pixels_in, co_stream pixels_out) {
    int nPixel;
    . . .
    do {
        co_stream_open(pixels_in, O_RDONLY, INT_TYPE(32));
        co_stream_open(pixels_out, O_WRONLY, INT_TYPE(32));
        while ( co_stream_read(pixels_in, &nPixel, sizeof(int)) == 0 ) {

            . . .
            // Do a series of pipelined operations here using standard C…
            . . .

            co_stream_write(pixels_out, &nPixel, sizeof(int));
    } while(1);              // Run forever

Coding FPGA soft- and hard-core processors
TheC-to-FPGA compiler allows hardware processes to be described in C andconnected to software running on the embedded processor. API functionsconceptually similar to file I/O in standard C can be used to describestreaming I/O between CPU and FPGA processes.

Hard-coreprocessors are available with fixed instruction sets and excellentsoftware and library support. Soft-core processors are becoming moreprevalent and can be deployed singly or in banks.

Modern C toFPGA compilers can be used to create coprocessors for the CPU, usingplatform support packages to automatically generate the needed CPU-FPGAconnections. This same method can be used to extend C-to-FPGA into therealm of high performance computing (HPC), allowing an FPGA acceleratorto be attached to a server.

The platform support package orinterface creates a layer which automatically gives the software coderaccess to memory, I/O, busses, and other hardware features of an FPGAenabled acceleration card.

Figure 4 Programmingfor FPGAs as an offload processor for a CPU requires iterative effortto route code into the most appropriate processor for the task.

Simulation and Test
C-to-FPGAtools include various methods of algorithm simulation. Better onesoffer a simulation library compatible with standard ANSI C, which allowsstandard C debuggers such as Visual Studio or GDB to be used foremulation of parallelism.

Such environments allow multipleparallel processes written in C to be modeled as separate threads duringsimulation, and to communicate via streams, signals and shared memoriesto validate the algorithm prior to compilation to FPGA hardware. Whenexecuted in these environments, parallel behavior is emulated usingmultiple threads of execution that may be observed either by using acombination of standard debugging techniques, or graphically by using anapplication monitoring tool.

Hardware level verification alsocan be enabled in C-to-FPGA tool flows through a combination ofcycle-accurate simulation and automatic generation of HDL test benches.This method of simulation is important because there can be subtledifferences in application behavior that are only discoverable on acycle-by-cycle basis, using simulated input data. This is particularlytrue when an application has multiple levels of pipelines or involvesmultiple processes that must be synchronized at the level of individualdata transactions.

Seven steps to success

  1. C-to-FPGA tool flow typically entails seven steps:
  2. Design entry using Visual Studio, GCC, Eclipse or another standard design tool.
  3. Desktop simulation using standard tools plus the C to FPGA compiler’s flow and analysis tools.
  4. Parallel optimization in which the C to FPGA compiler generates synthesizable, parallelized HDL code.
  5. Design iteration where user intelligence is really critical to make design tradeoffs that enable the compiler to better use available resources and unroll to the fullest. The design can be relatively device independent to this point, wherein a target device or board is selected, typically from a pull down menu.
  6. Output via HDL to Synthesis in which output is sent to place and route tools, usually provided by the FPGA manufacturer, for the intricate allocation of physical resources.
  7. Synthesis to FPGA bitmap
  8. Downloading to available development boards and systems, which, for the more complex boards, is greatly enhanced by a PSP or platform support package that abstracts away board memory, I/O, busses, etc.

Whilethe C-to-FPGA process is best assigned to a senior engineer, it israpidly penetrating universities and research labs worldwide and findingits niche in the engineer’s toolbox.

Brian Durwood isthe CEO of Impulse Accelerated Technologies,  which developed andsupports Impulse C, a widely used C to FPGA compiler  used inorganizations from NASA to Harvard to Honda, and also providesengineering services and training to help embedded system hardware andsoftware development teams. Brian and David Pellerin met in the 1980'sas part of the original team behind ABEL, one of the first, successfulprogrammable logic software tools. Brian was a Vice President atTektronix and Virtual Vision and is a graduate of Brown and Wharton.

David Pellerin is co-founder and technical advisor to Impulse AcceleratedTechnologies. He is author of “Practical FPGA Programming in C”,“Practical Design Using Programmable Logic”, and other books related toprogrammable hardware technologies. David’s interests in programmablelogic include video processing, embedded systems, and acceleratedcomputing for life sciences. He is a graduate of the University ofWashington.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.