High-performance embedded computing -- Hardware accelerators

João Cardoso, José Gabriel Coutinho, and Pedro Diniz

January 15, 2018

João Cardoso, José Gabriel Coutinho, and Pedro DinizJanuary 15, 2018

All these hardware resources allow the implementation of sophisticated hardware accelerators, possibly including one or more computing engines, interconnected by the most suitable communication structure (e.g., RAM, FIFO), and with the possibility to natively support data streaming computations. As an illustrative example, Fig. 2.16 depicts a hardware accelerator implemented on a reconfigurable fabric (such as an FPGA)[Note: This “emulation” of a coarse grain architecture over a fine-grained reconfigurable fabric, such as the one offered by existing FPGA devices, is often referred to as an “overlay” architecture as described next.], which consists of two computing engines, on-chip RAMs and FIFOs. These customized hardware resources enable the implementation of architectures with significant performance improvements even at low clock frequencies and with high energy efficiency.

click for larger image

FIG. 2.16 Block diagram of an example of a hardware accelerator implemented in a reconfigurable fabric (i.e., by reconfigurable hardware).

The extreme flexibility of reconfigurable hardware, such as FPGAs, however, comes at a cost. First, they are not as computationally dense internally in terms of transistor devices and are thus less “space efficient” when compared to its ASIC or GPU counterparts. Second, and more importantly, due to the lack of a “fixed structure,” they require the use of hardware synthesis tools to derive a configuration file that defines the actual architecture logic. As a result, reconfigurable hardware accelerators impose an additional burden to the programmer, often requiring the learning of hardware-oriented programming languages and the mastering of low-level hardware details about their physical architecture. Despite these perceived difficulties, recent advances in high-level synthesis (term used to identify compilers that generate a configuration file from a high-level program description) have provided more efficient methods for mapping C and/or OpenCL descriptions to hardware. FPGA vendors also provide integrated design environments (IDEs) which can increase programming productivity by offering a sophisticated set of tools.

The FPGA designs obtained by synthesizing OpenCL code may also exploit the customization features inherent in reconfigurable hardware. Examples include the customized hardware communication mechanisms between computing units (CUs) of the OpenCL model and the use of specialized CUs depending on the data types and operations at hand. Another important feature, when in the presence of SoC devices composed of a CPU and reconfigurable hardware, is the possibility of the hardware accelerator to include direct access to the main memory and to shared data, which might have a significant performance impact as the communication using the CPU is simply avoided.

An alternative approach to realize reconfigurable hardware accelerators is the use of overlay architectures. In this case, reconfigurable hardware resources are used to implement architectures that do not directly result from hardware synthesis over the native FPGA fabric. Examples of overlay architectures include coarse-grained reconfigurable arrays (CGRAs), which typically consist of an array of ALUs (word-length width), memories, and interconnect resources each of which is synthesized using the native FPGA resources. As such, these overlay architectures provide a higher architectural abstraction that exposes coarser-grained elements. This higher architectural abstraction offers shorter compilation times, as the mapping problem has to contend with a smaller number of coarser-grained compute units and simpler control logic, but such reduction in compile time comes at the cost of customization flexibility. These CGRAs are typically designed for specific application domains and/or implement very specific execution models (e.g., systolic arrays). The use of overlay architectures has been the approach of choice of the Convey computer platforms with its Vector Coprocessors [20] which are especially suited for vector processing.


Recent commercial SoCs also include reconfigurable hardware fabrics (i.e., die areas with reconfigurable logic), which are able to implement custom hardware accelerators and/or other system functions as is the example of the Xilinx Zynq device [21,13]. Fig. 2.17 presents the block diagram of a Zynq device which includes a dual-core ARM CPU, on-chip memory, peripherals, reconfigurable hardware, transceivers, I/Os, ADCs, and memory controllers. These devices allow tightly hardware/software solutions in a single chip, and are examples of the need to apply hardware/software codesign approaches (see Chapter 8) and hardware/software partitioning. In this case, the developer needs to identify the components of the application that may run on the CPU (i.e., as a software component) and in the reconfigurable hardware (i.e., as a hardware component and/or as processor softcore). The extensive progress in high-level compilation and hardware synthesis tools for reconfigurable architectures substantially simplifies the development of embedded applications targeting these systems and shortens the time to market of these applications.

In these SoCs, the reconfigurable hardware resources can be used to implement custom hardware accelerators (i.e., hardware accelerators specifically designed to execute part of the application), domain-specific hardware accelerators (i.e., hardware accelerators with possibility to be used by different applications and/or sections of the application due to their programmability support), and even architectures consisting of multiple processors implemented using softcore CPUs (i.e., a CPU implemented using the reconfigurable hardware resources, as opposed to a hardcore CPU which is implemented at fabrication time). The recent increases in device capacity have even enabled these SoCs to morph into MPSoCs (Multiprocessor SoCs) and recent Xilinx Zynq devices, such as the Zynq UltraScale+ [21], even include a GPU core coupled with the CPU.

click for larger image

FIG. 2.17 Block diagram of the Zynq-7000 device [21]: an example of a SoC with reconfigurable hardware.

The next installment in this series discusses key performance metrics used to guide the process of architectural selection. 

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017

João Manuel Paiva Cardoso, Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.

José Gabriel de Figueiredo Coutinho, Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.

Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.

< Previous
Page 2 of 2
Next >

Loading comments...