Developing processor-compatible C-code for FPGA hardware acceleration
This article describes an iterative process for converting C code to run on FPGAs with or without processor cores, how to identify which code sections can best benefit from hardware acceleration, and coding styles to use to retain commonality.
FPGAs are becoming increasingly popular with software teams to accelerate critical portions of their code. In most cases these teams already have processing stacks and applications written in C that target embedded microprocessors or servers. For applications that require acceleration, a logical next step is to offload some portion of the code to an FPGA. A good way to do this is to migrate portions of the working microprocessor system to an FPGA while keeping the code base compatible with the original processor. This approach lowers risk and allows the software team to more easily experiment with alternate implementations, iterating toward an accelerated solution without creating a fundamentally different branch of the code.
This article describes how to identify which code sections can best benefit from hardware acceleration, use coding styles to retain commonality, and select hardware for both development and deployment.
Selecting code elements for hardware acceleration
FPGAs have a comparatively slow clock but can outperform CPUs by providing more opportunities for parallelism. CPUs are constrained by fixed instruction sets and fixed data paths, while FPGAs allow incremental flexibility to create a customized processor to solve a wide variety of embedded computing problems.
Efficient FPGA coding styles tend towards a “coarse grained parallel” approach to algorithm refactoring. This is coupled with automated or semi-automated compiler tools to generate finer-grained parallelism, for example by automatically scheduling parallel statements or by unrolling and pipelining inner code loops.
Microprocessors, in contrast, run at higher clock speeds and have instruction sets and other features that make programming with sequential programming models and languages more natural.
When evaluating which sections of a given application are most appropriate for FPGA acceleration, weigh both the I/O requirements and specific needs for accelerating specific computations. Moving data between an FPGA and a CPU can be enabled in many ways, depending on the architecture of the selected computing platform.
These architectural differences have a significant impact on the more appropriate methods of developing an application. For example, if the FPGA is connected to the host CPU via a standard bus interface such as PCI Express, there may be performance implications when considering the size of individual data transfers – how many bytes are transferred in each I/O request – as well as considerations regarding the latency of data movement.
C-to-FPGA programming is conceptually similar to programming for GPU accelerators, or programming for multi-core processors using threads or MPI. All of these involve using a set of functions – extensions to the C language – in support of parallel programming.
In the case of FPGAs, modern compiler tools are capable of generating highly parallel implementations from one or more C-language subroutines, which are implemented as hardware processes. If appropriate coding styles are used, these compilers are able to generate equivalent low-level hardware that operates with a high degree of cycle-by-cycle parallelism.
The nature of FPGAs makes it unlikely that a large legacy C application written using traditional C programming techniques can be compiled entirely to hardware with no changes ( Figure 1 below).
Figure 1 In a traditional hardware/software development process, hardware design may represent a significant bottle neck (Source: Practical FPGA Programming in C).
Many C applications are best-suited to processor architectures, but may be accelerated dramatically by identifying key subroutines and inner code loops and moving just those sections to hardware using data streaming or shared memory methods of I/O, and by doing some level of algorithm optimization to best utilize the computing resources found in an FPGA. High level compiler tools can assist in this application conversion by providing a parallel programming model and corresponding APIs that allow varying levels of parallelism to be expressed, extending the power and flexibility of the C language.
By using a C-to-FPGA compiler in an iterative manner, key subroutines can be partitioned and moved into dedicated hardware with relatively little effort. Software-based simulations should be performed along with generation of prototype hardware to simplify debugging and system-level optimization.
Although C compilers for FPGAs are limited in the features they can support for hardware generation, C-to-FPGA programming methods that exploit attached processors allow more complete use of the language. You can use any feature of C, for example, to develop a simulation test bench, or to describe processes that will run on FPGA-embedded CPU cores (Figure 2, below).
Figure 2 By introducing software to hardware compilation into the design process, it’s possible to get a working prototype faster, with more time available later for design refinement. (Source: Practical FPGA Programming in C).
For those processes that will be compiled directly to FPGA hardware, however, there are certain constraints placed on the C coding style. These constraints typically include only limited support for pointers, restrictions on the use of recursive function calls and on the types of control flow statements that may be used.
When considering C-to-FPGA, it's useful to think of the FPGA as a coprocessor or offload resource for an embedded or host processor. In this use case, an instruction-based CPU hosts traditional C programs with all the features of C or C++.
The FPGA is used to create hardware accelerators, using a somewhat more restricted subset of the C language and using the C-to-FPGA compiler. Data streams, shared memories, and signals (specified using function calls from the C-to-FPGA library) are used to move data between the processor and the FPGA hardware.
The C-to-FPGA compiler creates the interfaces between the processor and the hardware to implement data communication via streams, memories, and signals. This model of programming is similar to the methods used when programming for GPU accelerators.
Alternatively, a C-to-FPGA compiler can be used for module generation, an approach that is most practical when there is no CPU in the data path. Examples of module generation include the creation of DSP and video filters, using C-language functions and streaming I/O to describe a pipeline of connected hardware processes.