CPU plus FPGA design flow for software developers: A new tangible reality
Recently, Brian Bailey organized a round table that resulted in a two-part article called Supporting CPUs Plus FPGAs. The experts discussed the evolving reality of systems design based on FPGAs and CPUs. This discussion addresses recent developments in design flow and how using new technology can assist software developers in reaching a faster time-to-market for CPU plus FPGA platforms.
Looking at the growing interest in artificial intelligence (AI), the emergence of connected objects (IoT), and the data center acceleration trend, all leads us to ask: what is the common denominator between the three?
Software developers are at the center of all of these trends and they are looking to accelerate their programming and calculations. Latest technology breakthroughs, including low communication latency between FPGAs and CPUs, coupled with the relatively low power consumption of todays FPGAs, makes FPGA- and CPU-based systems the right choice to achieve the desired performance. At the center of this convergence, however, software developers are hampered by the underlying complexity of FPGA technology.
Over the past few years, High Level Synthesis (HLS) tools have greatly improved with regard to addressing today's system complexity and shortening time-to-market. However, HLS tools focus foremost on IP blocks (i.e., they are IP-centric). There is a broad range of system-level decisions/optimizations that can't be supported by HLS tools to satisfy the requirements. Some of these requirements include finding the right balance between software tasks and hardware accelerators, comparing pipeline versus parallel execution, achieving desired data granularity, assessing communication mechanisms, and many more.
In order to build these complex systems, software developers require a design flow that offers joint support of both hardware and software. Such a flow must be simple enough to warrant its use (like the software developer flow) and adoption by software developers. The flow must also provide insightful feedback about the optimization choices available to achieve the required performance goals. Some companies have recently paved the way to facilitate the task for software developers by abstracting the technological details of the hardware design flow. These companies are inspired by the System-Level Design approaches described in ESL Models and their Application: Electronic System Level Design and Verification in Practice.
Understanding the system-level design flow methodology
System-Level Design is focused on higher abstraction level concerns. While there is a need to concentrate on the bigger picture, various levels of abstractions are used to validate, verify, refine, and integrate different pieces of the system before it is actually developed. Even though the engineering community does not agree on a common language to use, the majority of design engineers start at the algorithmic level. Designers validate non-functional and functional system specifications by creating execution models written in C/C++/SystemC, MATLAB, Simulink, and LabVIEW environments. These high-level languages are used to model the behavior of the entire system.
For the purposes of this discussion, we've focused on a System-Level Design flow based on C/C++ specifications (Figure 1). The first block is divided into three steps. The first of these steps represents application profiling (i.e., hardware-software partitioning) where pieces of C/C++ code (functions, loops, etc.) are being considered to be moved into the hardware (FPGA). The next step is the specification of the CPU/FPGA platform (e.g., ARM53/FPGA, POWER8/FPGA) and configuration of the hardware platform elements (system clock, processor cache, interconnections, etc.). The following step is to map the application tasks (based on the profiled application) between hardware and software (i.e., hard and/or soft CPU) and -- at the very end -- the generation of an executable architecture.
Figure 1. Typical system-level design flow for CPU/FPGA
(Source: Space Codesign Systems, Inc.)
The second block of Figure 1 involves architecture optimization (also known as architectural exploration or performance verification). This is depicted in more detail in Figure 2.
Figure 2. The architecture optimization process
(Source: Space Codesign Systems, Inc.)
The architecture optimization process addresses the following estimators:
- Hardware estimation assesses metrics of hardware partitioning (i.e., C/C++ code moved on the FPGA). It can be broken down by resources, performance (e.g., loop latency) and power estimates. Hardware estimation is driven by HLS (High-Level Synthesis) tools.
- Software estimation evaluates metrics for the C/C++ partition code running on the CPU (i.e., hard and/or soft CPU). This process is complementary to the hardware estimation step. Examples of performance metrics are processor load, task switching, and cache misses.
- Data transfer estimation consists of modeling the interfaces (i.e., memory mapped and streaming interfaces) by which the hardware and software communicate. Examples of collected metrics are bus performance (e.g., latency and throughput), queue, and memory usage.
These estimations are aggregated in a database and a system performance analysis is presented to the developer to assess if the requirements of the system are being met. Architectures that satisfy the requirements proceed to the architecture implementation process; otherwise, additional system level optimization attempts are processed.
The last block of Figure 1 relates to the architecture implementation where system architectures are converted to a bitstream (for FPGA implementation) using implementation tools such as Xilinx Vivado or Intel Quartus Prime for final and complete system generation to be executed on the specific physical platform. This step must produce quality code and should be transparent for the software developer.
The lack of automated tools for architectural optimization has long been perceived as a key weakness of FPGA-based computing. The development of such tools was difficult due to the complexity and challenges involved.
To illustrate these challenges, Figure 3 shows a typical system-level optimization process during an architectural exploration for an image processing application composed of six functions (pieces of C/C++ code) to be implemented on a Zynq-7000 platform. Here, we list eight potential architectures that can be implemented on the platform. As the time-to-market does not allow for implementation of each architecture, the best one to implement must rapidly be determined. This sequence of optimizations can be challenging, even for experienced hardware designers.
Figure 3. Architecture exploration with system-level decisions shown in blue
(Source: Space Codesign Systems, Inc.)
FPGA software development tools like SDSoC/SDAccel (Xilinx), Merlin Compiler (Falcon Computing Solutions), and SpaceStudio (Space Codesign Systems) are commercial solutions that assist software developers in the design of FPGA/CPU systems while achieving system-level optimization. These tools adopt a similar flow as described in Figures 1 and 2, and -- by that -- they demonstrate the existence of a new generation of system-level tools with different approaches.
SDSoC estimates the system performance in a two-step approach. Initially, SDSoC estimates latencies for the hardware functions (from HLS tools) and internal characterization (i.e., data transfer) of the targeted physical platform and its communication interfaces. Later, this estimate is compared against a software-only version of the application running on the physical platform.
Merlin Compiler proposes source-to-source transformation. The goal of source-to-source transformation is to reduce or eliminate the design abstraction gap between software/algorithm development and existing HLS design flows. The Merlin Compiler relies on four pragmas to infer specific FPGA designs. In addition to the four major optimizations triggered by explicit pragmas, the Merlin Compiler also contains various implicit optimizations (i.e., transform passes of the compiler) which are performed along with the pragmas to help improve the results of the pipeline and parallelization.
SpaceStudio seamlessly generates an executable virtual platform (VP) for each architecture candidate (mapping). A typical VP is composed of processor core simulators connected to various models of buses, memory controllers, and other data peripheral models. It models the targeted platform along with data transfers in a co-simulated environment that is tailored specifically to the application. This means that the executable VP enables more accurate performance prediction and algorithm validation of the application. It also integrates monitoring and analysis capabilities for non-intrusive performance profiling of both hardware functions and software tasks. VP relies on HLS tools for hardware estimators, while delays (e.g., latencies) of hardware mapped functions are automatically annotated to increase the accuracy of the simulation process. The VP can be inspected by the software developer to understand how the optimization tasks are implemented. Such feedback helps the software developer to achieve the intended design for the specific applied optimizations.
One way to view the commercial ecosystem
Figure 4 proposes a view of the commercial ecosystem gravitating around the world of platform-based design of CPUs and FPGAs. The first (upper) box presents the main design entry at the algorithm level. The second box contains environments supporting algorithmic synthesis (i.e., from algorithm to implementation). The tools marked in bold support C/C++ design entry and perform system-level optimizations. The third box represents tools used to achieve the architectural implementation, mainly tools from FPGA vendors that perform the low-level synthesis and the bitstream generation. At the bottom of the figure, examples of CPU/FPGA platforms are illustrated.
Figure 4. Commercial ecosystem for CPU/FPGA platforms
(Source: Space Codesign Systems, Inc.)
Additionally, Table 1 lists some of the main commercial tools used in CPU/FPGA platform design.
Table 1. Commercial automation tools (*Note: A list is proposed in this review)
The ultimate goal is to democratize the development of CPU plus FPGA platforms to a wider population of users, such as the software developer community. Looking at the analogy of programming languages, it took the IT industry over 50 years for programming languages to evolve into friendly languages such as Python or, more recently, Swift. A similar evolution process is happening in the FPGA programming industry. The acceptance of HLS tools took some time to be endorsed by system designers. Today, with the advent of system-level solutions for software developers, we are entering into a new phase. Commercial tools such as SpaceStudio, SDSoC, and Merlin Compiler are testimony of this acceptance process. Still, much work remains to be done in order to have a fully automated and optimized process across compilers targeting CPU plus FPGA platforms.
Guy Bois, Ing., PhD is the Founder of Space Codesign Systems and Professor in the Department of Software and Computer Engineering of Polytechnique Montréal. Guy has participated in many R&D projects in collaboration with industry leaders such as STMicroelectronics, Grass Valley, PMC Sierra, Design Workshops Technologies, and Cadabra Systems. Guy's research expertise in the field of hardware/software codesign led to the commercialization of the solution and the inception of SpaceStudio from Space Codesign Systems Inc.