High-performance embedded computing -- Hardware accelerators

João Cardoso, José Gabriel Coutinho, and Pedro Diniz

January 15, 2018

João Cardoso, José Gabriel Coutinho, and Pedro DinizJanuary 15, 2018

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.


By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

Common hardware accelerators come in many forms, from the fully customizable ASIC designed for a specific function (e.g., a floating-point unit) to the more flexible graphics processing unit (GPU) and the highly programmable field programmable gate array (FPGA). These devices require different programming models and have distinct system-level interfaces which, not surprisingly, exhibit different trade-offs between generality of use, performance, or energy. In the following subsections, we focus on GPU- and FPGA-based hardware accelerators.


Originally used exclusively for the acceleration of graphical computation (e.g., shading), graphics processing units (GPUs) have evolved in terms of flexibility and programmability to support many other compute-intensive application domains, such as scientific and engineering applications. Internally, GPUs consist of many lightweight cores (sometimes referred as shader cores) and on-chip memories which provide native support for high degrees of parallelism. Hence, the single program, multiple data (SPMD) paradigm is often used to program GPUs.

For embedded devices, GPUs have a relatively simple organization as illustrated by the example in Fig. 2.13, whereas for high-end computing a typical internal organization is depicted in Fig. 2.14. Naturally, the GPU used for embedded applications illustrated in Fig. 2.14 has a relatively “flat” hierarchical organization with multiple GPU cores sharing their access to an internal memory or L2 cache. Conversely, high-end GPUs have a much more complex internal architecture where cores are organized hierarchically in clusters, each of which have local nonshared and shared resources, such as for example, constant caches and L1 caches as illustrated in Fig. 2.14. Regarding their intended target applications, the diversity of the embedded domains has led to a greater diversity of embedded GPU configurations with varying characteristics (e.g., size of caches, number of shader cores, number of ALUs per shader core). On the other hand, high-end GPUs exhibit less architectural diversity as they are mostly designed to serve as hardware accelerators, providing architectures with many cores and vast on-chip memory resources.

click for larger image

FIG. 2.13 Block diagram of a GPU for embedded devices (in this case representing an ARM Mali high-performance GPU).

click for larger image

FIG. 2.14 Block diagram of a high-end GPU-based accelerator (in this case representing an NVIDIA Fermi GPU). SP identifies a stream processor, LDST identifies a load/store unit, SFU identifies a special function unit, and Tex identifies a Tex unit.

It should be clear that the earlier two GPU organizations exhibit very different performance and energy characteristics. For mobile devices, the GPU is primarily designed to operate with very low power, whereas for high-end scientific and engineering computing the focus is on high computation throughput.


Given the ever-present trade-off between customization (and hence performance and energy efficiency) and generality (and thus programmability), reconfigurable hardware has been gaining considerable attention as a viable platform for hardware acceleration. Reconfigurable devices can be tailored (even dynamically—at runtime) to fit the needs of specific computations, morphing into a hardware organization that can be as efficient as a custom architecture. Due to their growing internal capacity (in terms of available hardware resources), reconfigurable devices (most notably FPGAs) have been extensively used as hardware accelerators in embedded systems.

click for larger image

FIG. 2.15 Simplified block diagram of a typical reconfigurable fabric.

Fig. 2.15 illustrates the organization of a typical reconfigurable fabric. It consists of configurable logic blocks (CLBs), input/output blocks (IOBs), digital signal processing (DSP) components, block RAMs (BRAMs), and interconnect resources (including switch boxes). The extreme flexibility of configurable (and reconfigurable) fabric lies in the ability to use its components to build customized hardware, including customizable memories (e.g., both in terms of number of ports and size), customizable datapaths, and control units. With such configurable fabric, developers can thus implement hardware designs that match the characteristics of the computations at hand rather than reorganizing the software code of their application to suit a specific computing architecture. The on-chip memory components (BRAMs or distributed memories) can be grouped to implement large memories and/or memories with more access ports than the two access ports provided by default. This is an important feature as memory components can be customized as needed by the application.

More specifically, reconfigurable architectures allow hardware customization by providing hardware structures to implement functions with an arbitrary number of input bits; bit-level registers; bit-level interconnect resources; resource configuration to support shift registers, FIFOs, and distributed memories; and high-performance built-in arithmetic and memory components that can be configured, for instance, to implement mathematical expressions (e.g., similarly to FMA units) and to support local storage.


An example of built-in components is the XtremeDSP DSP48 slices provided by Xilinx FPGAs (see an example in the following figure). These DSP48 slices can implement functions such as multiply, multiply accumulate (MACC), multiply add/sub, three-input add, barrel shift, wide-bus multiplexing, magnitude comparator, bit-wise logic functions, pattern detect, and wide counter. The DSP48 slices included in high-end FPGAs include logical functions as ALU operations, a 3- or 4-input 48 bit adder, and a 25 or 27 18 bit multiplier. The number of DSP slices depends on the FPGA model, but current models provide from about 1,000 to 10,000 DSP slices.

Below we present an FPGA DSP slice showing its basic components: (Xilinx UltraScale DSP48E2, From: Xilinx Inc. UltraScale architecture DSP slice. User Guide, UG579 (v1.3) November 24, 2015.):

click for larger image

Continue reading on page two >>


< Previous
Page 1 of 2
Next >

Loading comments...