Offloading CPUs to FPGAs – Hardware programming for software developers
Several factors are disrupting the traditional monopoly of microprocessors for being the chip of choice for C algorithms. These include the cost and accessibility of cross-compilation tools, the power and speed limitations of microprocessors, and the availability of more reliable building blocks.
In this article, three university researchers break down the problem into understandable steps that the average developer can follow to determine if FPGAs are worth the (decreasing) bother and – if the answer is "yes" – how to go about it. This is based on hundreds of hours of class and lab testing. The authors are willing to share teaching materials, curricula, and advice with any certified university. If there is sufficient interest in this article, they will produce two follow-on articles going into more details with regard to lab work and cycle-accurate incremental improvement.
Microprocessors are not going away. They continue to represent the biggest "bang for the buck" and are at the center of most systems. FPGAs are a complimentary, semi-custom, co-processing resource that is "picking off" the parallelizable tasks from CPUs. FPGAs do this – at lower clock speeds and power – by deploying multi-core parallelism.</p>
HPRC (High Performance Reconfigurable Computing) as a branch of Computer Science is thriving. Largely driven by GPGPU (general-purpose graphics processing unit) growth, HPRC is also supported by FPGA-based applications. The programming environment is considered to be the main obstacle preventing FPGAs from being used to their full potential in accelerators. Thus, the need to gain familiarity with High Level Languages (HLLs) is inevitable.
Architectural differences in C for FPGAs vs. C for CPUs
The C language, refactored for FPGA, can be characterized as a stream-oriented, process-based language. Processes are the main building blocks interconnected using streams to form the architecture for the desired hardware module. From the hardware perspective, processes and streams are hardware modules and FIFOs (First In, First Out registers) respectively.
The C programming model is generally based on the Communicating Sequential Processes model. Every process must be classified as a hardware or a software process. It is the programmer's responsibility to ensure inter-process synchronization. Like most HLLs, C does not provide access to the clock signal, which relieves the designer from implementing cycle synchronization procedures. However, it is possible to attach HDL modules and synchronize them at the RTL level using clock signals. It is worth noting that C as a hardware design language does not permit dynamic resource allocation (e.g., "malloc()" and "calloc()").
The second unique language construct, besides being process-oriented, is stream orientation. Streams are unidirectional and can interconnect only two processes, which imposes restrictions on hardware module architectures designed in C. Since pipelines can become a source of deadlocks, the designer particularly needs to consider mechanisms to avoid them. Unfortunately, occurrences of deadlocks are difficult to trace during simulations since the "#pragma co pipeline" C-to-HDL compiler directive is ignored during software simulation. These problems are usually revealed after implementation when the module is tested in hardware.
In addition to streams and processes, C as a design method provides signals and semaphores. These structures are used for inter-process synchronization. The best practice is often to implement pure pipeline modules, with the lowest possible number of synchronization signals.
Software processes are converted to multiple streaming
hardware processes where they use streams,
signals, or memory for synchronization.
HLLs used for this purpose are generally intended to be flexible in terms of data types so as to ease HDL module integration. Typically, there will be a range of data structures available such as co_int2, co_int32, co_uint1, co_uint32, etc. These constructs are also a source of inconsistency between the software and hardware implementations.
Prior to FPGA implementation, all of the hardware modules should be simulated on a GPP (general purpose processor) where their data structures are mapped on the types available on the GPP. Unfortunately GPPs use limited sets of data types, so each time a simulation is performed, the data is extended to the nearest wider data type, which affects intrinsic computation precision. This operation is performed unless a dedicated macro is used (e.g. "UADD4()" and "UDIV20()"); thus, using macros is encouraged.
Special attention should be paid to functions, since they highly simplify modular implementation, which is a common design strategy. The following pragmas are useful: "co inline," "co implementation," "co unroll," "co pipeline," and "co set." These allow module shaping, providing a set of restrictions. For instance, using "inline" in a function body enables the compiler to freely modify the internal architecture of the module. Otherwise, the function is treated as a uniform module, which cannot be modified by the compiler. Static recurrence is permitted and proves to be a useful structure in many applications such as a binary tree implementation with the "add_tree()" command.
The limitations of using C for hardware result mostly from the exceptions of the adaptation of ANSI C to hardware design. Notable examples include:
- Lack of dynamic recurrence
- No support for unions
- Dynamic memory allocation is unsupported (free(), malloc()) in hardware
- Limited support of pointers
- A pointer may only point to one block of memory
- Pointers must be determined at compilation time
There are several techniques that may optimize performance of the implemented hardware modules. Reading and writing to streams can be implemented in several ways; however, one of them may provide better pipeline performance (1 cycle per single operation).
Data access conflicts may also contribute significantly to any reduction of expected performance. Therefore, it is important to prevent such conflicts by memory duplication or table scalarization (using the "co_array_config()" instruction). The number of combinational logic levels should be kept reasonably low, which can be achieved with a single pragma parameter (e.g., "co set Stage Delay 32").
Stage Delay Analysis provides the tools needed to see
how decisions made in C algorithms will propagate
in logic and clock cycles.
Generally speaking, it is recommended to use appropriate data structures to maximize data throughput. The C-to-FGPA IDE delivers a range of tools which facilitate debugging. One convenient tool is the Stage Master Explorer (SME), which may be used to examine code and pinpoint throughput bottlenecks. The measured performance in the SME is expressed through a set of four parameters, which characterize the digital module: Latency, Rate, Max. Unit Delay, and Effective Rate.