Offloading CPUs to FPGAs – Hardware programming for software developers

Several factors are disrupting the traditional monopoly of microprocessors for being the chip of choice for C algorithms. These include the cost and accessibility of cross-compilation tools, the power and speed limitations of microprocessors, and the availability of more reliable building blocks.

In this article, three university researchers break down the problem into understandable steps that the average developer can follow to determine if FPGAs are worth the (decreasing) bother and – if the answer is “yes” – how to go about it. This is based on hundreds of hours of class and lab testing. The authors are willing to share teaching materials, curricula, and advice with any certified university. If there is sufficient interest in this article, they will produce two follow-on articles going into more details with regard to lab work and cycle-accurate incremental improvement.

Problem identification
Microprocessors are not going away. They continue to represent the biggest “bang for the buck” and are at the center of most systems. FPGAs are a complimentary, semi-custom, co-processing resource that is “picking off” the parallelizable tasks from CPUs. FPGAs do this – at lower clock speeds and power – by deploying multi-core parallelism.

HPRC (High Performance Reconfigurable Computing) as a branch of Computer Science is thriving. Largely driven by GPGPU (general-purpose graphics processing unit) growth, HPRC is also supported by FPGA-based applications. The programming environment is considered to be the main obstacle preventing FPGAs from being used to their full potential in accelerators. Thus, the need to gain familiarity with High Level Languages (HLLs) is inevitable.

Architectural differences in C for FPGAs vs. C for CPUs
The C language, refactored for FPGA, can be characterized as a stream-oriented, process-based language. Processes are the main building blocks interconnected using streams to form the architecture for the desired hardware module. From the hardware perspective, processes and streams are hardware modules and FIFOs (First In, First Out registers) respectively.

The C programming model is generally based on the Communicating Sequential Processes model. Every process must be classified as a hardware or a software process. It is the programmer's responsibility to ensure inter-process synchronization. Like most HLLs, C does not provide access to the clock signal, which relieves the designer from implementing cycle synchronization procedures. However, it is possible to attach HDL modules and synchronize them at the RTL level using clock signals. It is worth noting that C as a hardware design language does not permit dynamic resource allocation (e.g., “malloc()” and “calloc()”).

The second unique language construct, besides being process-oriented, is stream orientation. Streams are unidirectional and can interconnect only two processes, which imposes restrictions on hardware module architectures designed in C. Since pipelines can become a source of deadlocks, the designer particularly needs to consider mechanisms to avoid them. Unfortunately, occurrences of deadlocks are difficult to trace during simulations since the “#pragma co pipeline” C-to-HDL compiler directive is ignored during software simulation. These problems are usually revealed after implementation when the module is tested in hardware.

In addition to streams and processes, C as a design method provides signals and semaphores. These structures are used for inter-process synchronization. The best practice is often to implement pure pipeline modules, with the lowest possible number of synchronization signals.


Software processes are converted to multiple streaming
hardware processes where they use streams,
signals, or memory for synchronization.

Methodology
HLLs used for this purpose are generally intended to be flexible in terms of data types so as to ease HDL module integration. Typically, there will be a range of data structures available such as co_int2, co_int32, co_uint1, co_uint32, etc. These constructs are also a source of inconsistency between the software and hardware implementations.

Prior to FPGA implementation, all of the hardware modules should be simulated on a GPP (general purpose processor) where their data structures are mapped on the types available on the GPP. Unfortunately GPPs use limited sets of data types, so each time a simulation is performed, the data is extended to the nearest wider data type, which affects intrinsic computation precision. This operation is performed unless a dedicated macro is used (e.g. “UADD4()” and “UDIV20()”); thus, using macros is encouraged.

Special attention should be paid to functions, since they highly simplify modular implementation, which is a common design strategy. The following pragmas are useful: “co inline,” “co implementation,” “co unroll,” “co pipeline,” and “co set.” These allow module shaping, providing a set of restrictions. For instance, using “inline” in a function body enables the compiler to freely modify the internal architecture of the module. Otherwise, the function is treated as a uniform module, which cannot be modified by the compiler. Static recurrence is permitted and proves to be a useful structure in many applications such as a binary tree implementation with the “add_tree()” command.

The limitations of using C for hardware result mostly from the exceptions of the adaptation of ANSI C to hardware design. Notable examples include:

  • Lack of dynamic recurrence
  • No support for unions
  • Dynamic memory allocation is unsupported (free(), malloc()) in hardware
  • Limited support of pointers
  • A pointer may only point to one block of memory
  • Pointers must be determined at compilation time

There are several techniques that may optimize performance of the implemented hardware modules. Reading and writing to streams can be implemented in several ways; however, one of them may provide better pipeline performance (1 cycle per single operation).

Data access conflicts may also contribute significantly to any reduction of expected performance. Therefore, it is important to prevent such conflicts by memory duplication or table scalarization (using the “co_array_config()” instruction). The number of combinational logic levels should be kept reasonably low, which can be achieved with a single pragma parameter (e.g., “co set Stage Delay 32”).


Stage Delay Analysis provides the tools needed to see
how decisions made in C algorithms will propagate
in logic and clock cycles.

Generally speaking, it is recommended to use appropriate data structures to maximize data throughput. The C-to-FGPA IDE delivers a range of tools which facilitate debugging. One convenient tool is the Stage Master Explorer (SME), which may be used to examine code and pinpoint throughput bottlenecks. The measured performance in the SME is expressed through a set of four parameters, which characterize the digital module: Latency, Rate, Max. Unit Delay, and Effective Rate.Two simple teaching examples
#1 Implementation of an FPGA-accelerated HASH function using C
In this exercise, the students' task was to implement a hash function in C. The algorithm could not be implemented in a naive way (by copying and pasting it inside the hardware main function body), but it had to be re-coded and optimized as hardware.

The first step was to run a given hash function on a GPP to produce a reference output for the given input. The reference code in ANSI C is shown below. Every student had this same input vector but different functions, preventing plagiarism. Such solutions have the additional advantage of verifying if the whole solution is correct. To do so, one must only check the correctness of one output number.

It was noted that the interface between the CPU and FPGA is 32 bits wide, therefore four bytes of data should be transferred at once to the hardware module to minimize this bottleneck influence on the system throughput. The students' results differed in quality, but the best ones used pipelined operations, which resulted in a high throughput with slightly higher latency than for non-pipelined cases. A few students unrolled an internal for loop using the pragma “CO UNROLL” directive. By doing so, they reduced the hashing function execution time N times. The disadvantage of such an approach was the high usage of available logic resources.

#2 A Prime Number Generator using C
Another tutorial began by asking students to find algorithms in literature used as Prime Number Generators (PNGs). This knowledge was used during the following hands-on exercises. The students also had to write their own module(s) in VHDL, which were used in the PNG. The module could be a trivial one but – if so – the number had to be greater than one (e.g., the operation of square and incrementation by constant factor).

The modules had to be 100% compatible with the C external module standards and verified as accurate in a testbench. The students got their guidelines about the communication interface connecting the CPU and FPGA (using a 32-bit bus). The software portion of the design was a user interface that collects information regarding the lower and upper bounds between which the prime numbers had to be found.

This information had to be sent to the hardware module using a stream, after which the hardware process started to run. The algorithms selected by students generated proper primes and sent them to the software application, whose role was to present these results on-screen and write them into a text file.

PNG was used as an example for two reasons:

  • It is hard to write such a function in pure VHDL, which makes the advantages of C more visible
  • Every algorithm that can be used to generate primes is composed of many operations. Therefore it is easier to pick one and implement it in VHDL, avoiding duplication across students within the same group.

Some students decided to implement PNG in naive way using algorithms called trial division; that is, by checking if a number can be divided, without a remainder, only by itself. As a “reward,” they had to implement some “nasty” operation in VHDL, like a modulus or a square root. Many students decided to use the Sieve of Eratosthenes; a few students decided to use the “Fermat's 4k+1” and “Euler's 6k+1” algorithms to check if a number is a prime. The best method was one proposed by student Grzegorz Glowka BSc due to its adaptation of all three previously mentioned algorithms. Student Glowka observed that — in some intervals — some methods are more efficient than others, and he implemented his PNG in such a manner as to leverage this fact.

The lessons from the exercise were as follows:

  • Implement everything in C and perform software simulations, then run working applications and tests that measure their performance in the hardware
  • Find weaknesses of C-to-HDL translation and eliminate them by replacing them with regards to HDL blocks. Regenerate the whole design, implement it in hardware, and then run tests and measure their performance.
  • Consider rewriting key elements in HDL to tune performance.
  • Rewrite the whole hardware part into HDL, or only those parts that are responsible for data processing, leaving C to interface and transmit data, etc.


Results?

Since we are teachers, our students are our “results.” We are happy to report that both their grades were up (from prior years of similar coursework) and that students felt a little more prepared for corporate life. The class continues to develop. If we get positive responses from this article, we will share more of the teaching units. Also, please feel free to contact the authors (see below) to share or exchange teaching materials.

Acknowledgments
The authors would like to acknowledge the assistance given by Brian Durwood of Impulse Accelerated Technologies in the preparation of this article.

About the authors
Grzegorz Gancarczyk was born in Nowy Sacz, Poland, in 1984. He received an MSc degree in the field of electronics from the AGH University of Science and Technology (AGH-UST), Krakow, Poland, in 2009.

Since 2009, he is with the Academic Computer Centre (ACC) CYFRONET AGH, Krakow, Poland and now also with the Department of Electronics, AGH-UST, Krakow, Poland. His research interests include engineering education, statistics, stochastic processes, phenomenon of noise, digital signal processing and hardware acceleration of numerical methods. You can contact Grzegorz at

Maciej Wielgosz was born in Krakow, Poland, in 1979. He received his MSc and PhD degrees in the field of electronics from the AGH-UST, Krakow, Poland, in 2005 and 2010, respectively.

Since 2005, he is with the ACC CYFRONET AGH, Krakow, Poland and since 2009 also with the Dept. of Electr., AGH-UST, Krakow, Poland. He has published over 40 papers in journals and conferences and also one book: “FPGA implementation of the selected floating point operations” (Warszawa: Akademicka Oficyna Wydawnicza EXIT, 2010). His research interests include educational issues in electronics, data compression, neural networks and hardware acceleration of computations. You can contact Maciej at

Kazimierz Wiatr was born in Tarnow, Poland, in 1955. He received MSc and PhD degrees in the field of electrical engineering from the AGH-UST, Krakow, Poland, in 1980 and 1987, respectively, D. Hab. (habilitation) degree in electronics from the University of Technology of Lodz, Lodz, Poland, in 1999. Professor degree in 2002.

Since 1980, he works at the Dept. of Electr., AGH-UST, Krakow, Poland. Head of Reconfigurable Computing Systems Group. Since 2004 director of the ACC CYFRONET AGH. Since 2006 chairman of the board of PIONIER – Polish Optical Internet – Consortium. Between 1998 and 2002 adviser to the Prime Minister of Poland on “educational and upbringing of the young generation”. Managed 9 Polish Scientific Research Committee research grants. His works resulted in over 200 publications, 19 books, 5 patents and 35 industrial implementations. Achieved Polish Science and Higher Education Minister's Award. Has been involved with youth education for more than 30 years. One of the founders of the Polish independent scouting movement. His research interests include educational issues, processes automation, image systems, multiprocessor and many core systems, reconfigurable devices and hardware methods of calculations accelerating.

Prof. Wiatr was appointed in 2007 to a chairman of Tarnow Scientific Society. Member of the Polish Information Processing Society, European Organization for Information and Microelectronics (EUROMICRO). In the Sixth and Seventh Term Senate was a chairman of the Science, Educational and Sport Committee. Reviewer in the IEEE Expert Magazine, IEE Computer and Digital Techniques, IEE Electronic Letters, IEEE Transactions on Neural Networks, Eurasip Journal on Applied Signal Processing, Journal Machine Graphics and Vision. Prof. Wiatr can be contacted at
 


This article has also been published on the Programmable Design Line .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.