High-performance embedded computing — Hardware accelerators

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.

By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

Common hardware accelerators come in many forms, from the fully customizable ASIC designed for a specific function (e.g., a floating-point unit) to the more flexible graphics processing unit (GPU) and the highly programmable field programmable gate array (FPGA). These devices require different programming models and have distinct system-level interfaces which, not surprisingly, exhibit different trade-offs between generality of use, performance, or energy. In the following subsections, we focus on GPU- and FPGA-based hardware accelerators.


Originally used exclusively for the acceleration of graphical computation (e.g., shading), graphics processing units (GPUs) have evolved in terms of flexibility and programmability to support many other compute-intensive application domains, such as scientific and engineering applications. Internally, GPUs consist of many lightweight cores (sometimes referred as shader cores) and on-chip memories which provide native support for high degrees of parallelism. Hence, the single program, multiple data (SPMD) paradigm is often used to program GPUs.

For embedded devices, GPUs have a relatively simple organization as illustrated by the example in Fig. 2.13, whereas for high-end computing a typical internal organization is depicted in Fig. 2.14. Naturally, the GPU used for embedded applications illustrated in Fig. 2.14 has a relatively “flat” hierarchical organization with multiple GPU cores sharing their access to an internal memory or L2 cache. Conversely, high-end GPUs have a much more complex internal architecture where cores are organized hierarchically in clusters, each of which have local nonshared and shared resources, such as for example, constant caches and L1 caches as illustrated in Fig. 2.14. Regarding their intended target applications, the diversity of the embedded domains has led to a greater diversity of embedded GPU configurations with varying characteristics (e.g., size of caches, number of shader cores, number of ALUs per shader core). On the other hand, high-end GPUs exhibit less architectural diversity as they are mostly designed to serve as hardware accelerators, providing architectures with many cores and vast on-chip memory resources.

click for larger image

FIG. 2.13 Block diagram of a GPU for embedded devices (in this case representing an ARM Mali high-performance GPU).

click for larger image

FIG. 2.14 Block diagram of a high-end GPU-based accelerator (in this case representing an NVIDIA Fermi GPU). SP identifies a stream processor, LDST identifies a load/store unit, SFU identifies a special function unit, and Tex identifies a Tex unit.

It should be clear that the earlier two GPU organizations exhibit very different performance and energy characteristics. For mobile devices, the GPU is primarily designed to operate with very low power, whereas for high-end scientific and engineering computing the focus is on high computation throughput.


Given the ever-present trade-off between customization (and hence performance and energy efficiency) and generality (and thus programmability), reconfigurable hardware has been gaining considerable attention as a viable platform for hardware acceleration. Reconfigurable devices can be tailored (even dynamically—at runtime) to fit the needs of specific computations, morphing into a hardware organization that can be as efficient as a custom architecture. Due to their growing internal capacity (in terms of available hardware resources), reconfigurable devices (most notably FPGAs) have been extensively used as hardware accelerators in embedded systems.

click for larger image

FIG. 2.15 Simplified block diagram of a typical reconfigurable fabric.

Fig. 2.15 illustrates the organization of a typical reconfigurable fabric. It consists of configurable logic blocks (CLBs), input/output blocks (IOBs), digital signal processing (DSP) components, block RAMs (BRAMs), and interconnect resources (including switch boxes). The extreme flexibility of configurable (and reconfigurable) fabric lies in the ability to use its components to build customized hardware, including customizable memories (e.g., both in terms of number of ports and size), customizable datapaths, and control units. With such configurable fabric, developers can thus implement hardware designs that match the characteristics of the computations at hand rather than reorganizing the software code of their application to suit a specific computing architecture. The on-chip memory components (BRAMs or distributed memories) can be grouped to implement large memories and/or memories with more access ports than the two access ports provided by default. This is an important feature as memory components can be customized as needed by the application.

More specifically, reconfigurable architectures allow hardware customization by providing hardware structures to implement functions with an arbitrary number of input bits; bit-level registers; bit-level interconnect resources; resource configuration to support shift registers, FIFOs, and distributed memories; and high-performance built-in arithmetic and memory components that can be configured, for instance, to implement mathematical expressions (e.g., similarly to FMA units) and to support local storage.


An example of built-in components is the XtremeDSP DSP48 slices provided by Xilinx FPGAs (see an example in the following figure). These DSP48 slices can implement functions such as multiply, multiply accumulate (MACC), multiply add/sub, three-input add, barrel shift, wide-bus multiplexing, magnitude comparator, bit-wise logic functions, pattern detect, and wide counter. The DSP48 slices included in high-end FPGAs include logical functions as ALU operations, a 3- or 4-input 48 bit adder, and a 25 or 27 18 bit multiplier. The number of DSP slices depends on the FPGA model, but current models provide from about 1,000 to 10,000 DSP slices.

Below we present an FPGA DSP slice showing its basic components: (Xilinx UltraScale DSP48E2, From: Xilinx Inc. UltraScale architecture DSP slice. User Guide, UG579 (v1.3) November 24, 2015.):

click for larger image

All these hardware resources allow the implementation of sophisticated hardware accelerators, possibly including one or more computing engines, interconnected by the most suitable communication structure (e.g., RAM, FIFO), and with the possibility to natively support data streaming computations. As an illustrative example, Fig. 2.16 depicts a hardware accelerator implemented on a reconfigurable fabric (such as an FPGA)[Note: This “emulation” of a coarse grain architecture over a fine-grained reconfigurable fabric, such as the one offered by existing FPGA devices, is often referred to as an “overlay” architecture as described next.] , which consists of two computing engines, on-chip RAMs and FIFOs. These customized hardware resources enable the implementation of architectures with significant performance improvements even at low clock frequencies and with high energy efficiency.

click for larger image

FIG. 2.16 Block diagram of an example of a hardware accelerator implemented in a reconfigurable fabric (i.e., by reconfigurable hardware).

The extreme flexibility of reconfigurable hardware, such as FPGAs, however, comes at a cost. First, they are not as computationally dense internally in terms of transistor devices and are thus less “space efficient” when compared to its ASIC or GPU counterparts. Second, and more importantly, due to the lack of a “fixed structure,” they require the use of hardware synthesis tools to derive a configuration file that defines the actual architecture logic. As a result, reconfigurable hardware accelerators impose an additional burden to the programmer, often requiring the learning of hardware-oriented programming languages and the mastering of low-level hardware details about their physical architecture. Despite these perceived difficulties, recent advances in high-level synthesis (term used to identify compilers that generate a configuration file from a high-level program description) have provided more efficient methods for mapping C and/or OpenCL descriptions to hardware. FPGA vendors also provide integrated design environments (IDEs) which can increase programming productivity by offering a sophisticated set of tools.

The FPGA designs obtained by synthesizing OpenCL code may also exploit the customization features inherent in reconfigurable hardware. Examples include the customized hardware communication mechanisms between computing units (CUs) of the OpenCL model and the use of specialized CUs depending on the data types and operations at hand. Another important feature, when in the presence of SoC devices composed of a CPU and reconfigurable hardware, is the possibility of the hardware accelerator to include direct access to the main memory and to shared data, which might have a significant performance impact as the communication using the CPU is simply avoided.

An alternative approach to realize reconfigurable hardware accelerators is the use of overlay architectures. In this case, reconfigurable hardware resources are used to implement architectures that do not directly result from hardware synthesis over the native FPGA fabric. Examples of overlay architectures include coarse-grained reconfigurable arrays (CGRAs), which typically consist of an array of ALUs (word-length width), memories, and interconnect resources each of which is synthesized using the native FPGA resources. As such, these overlay architectures provide a higher architectural abstraction that exposes coarser-grained elements. This higher architectural abstraction offers shorter compilation times, as the mapping problem has to contend with a smaller number of coarser-grained compute units and simpler control logic, but such reduction in compile time comes at the cost of customization flexibility. These CGRAs are typically designed for specific application domains and/or implement very specific execution models (e.g., systolic arrays). The use of overlay architectures has been the approach of choice of the Convey computer platforms with its Vector Coprocessors [20] which are especially suited for vector processing.


Recent commercial SoCs also include reconfigurable hardware fabrics (i.e., die areas with reconfigurable logic), which are able to implement custom hardware accelerators and/or other system functions as is the example of the Xilinx Zynq device [21,13]. Fig. 2.17 presents the block diagram of a Zynq device which includes a dual-core ARM CPU, on-chip memory, peripherals, reconfigurable hardware, transceivers, I/Os, ADCs, and memory controllers. These devices allow tightly hardware/software solutions in a single chip, and are examples of the need to apply hardware/software codesign approaches (see Chapter 8) and hardware/software partitioning. In this case, the developer needs to identify the components of the application that may run on the CPU (i.e., as a software component) and in the reconfigurable hardware (i.e., as a hardware component and/or as processor softcore). The extensive progress in high-level compilation and hardware synthesis tools for reconfigurable architectures substantially simplifies the development of embedded applications targeting these systems and shortens the time to market of these applications.

In these SoCs, the reconfigurable hardware resources can be used to implement custom hardware accelerators (i.e., hardware accelerators specifically designed to execute part of the application), domain-specific hardware accelerators (i.e., hardware accelerators with possibility to be used by different applications and/or sections of the application due to their programmability support), and even architectures consisting of multiple processors implemented using softcore CPUs (i.e., a CPU implemented using the reconfigurable hardware resources, as opposed to a hardcore CPU which is implemented at fabrication time). The recent increases in device capacity have even enabled these SoCs to morph into MPSoCs (Multiprocessor SoCs) and recent Xilinx Zynq devices, such as the Zynq UltraScale+ [21], even include a GPU core coupled with the CPU.

click for larger image

FIG. 2.17 Block diagram of the Zynq-7000 device [21]: an example of a SoC with reconfigurable hardware.

The next installment in this series discusses key performance metrics used to guide the process of architectural selection. 

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017

João Manuel Paiva Cardoso , Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.

José Gabriel de Figueiredo Coutinho , Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.

Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.