High-performance embedded computing — Target architectures

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

In this excerpt from the book, the authors discuss embedded-system architectures, key approaches used for enhancing performance, and the implications for power dissipation and energy consumption in the following installments:


Simple embedded systems, such as the ones depicted in Fig. 2.2, can be extended by connecting the host CPU to coprocessors acting as hardware accelerators. One type of hardware accelerator is the FPGA, which is a reprogrammable silicon chip that can be customized to realize any digital circuit. Other examples of hardware accelerators include GPUs and network coprocessors (i.e., coprocessors specially devoted to the network interface and communication).

click for larger image

FIG. 2.3 Block diagram of a typical single CPU/single core architecture extended with hardware accelerators.

Fig. 2.3 presents a block diagram of an architecture consisting of a CPU (①) and a hardware accelerator (②). Depending on the complexity of the hardware accelerator and how it is connected to the CPU (e.g., tightly or loosely coupled), the hardware accelerator may include local memories (③: on-chip and/or external, but directly connected to the hardware accelerator). The presence of local memories, tightly coupled with the accelerator, allows local data and intermediate results to be stored, which is an important architectural feature for supporting data-intensive applications.

At a programmatic level, an important aspect to consider when using hardware accelerators is the data movement cost between the accelerator and the host processor. Often, the accelerator only has control of its local storage, and data communication between the host processor and the accelerator is accomplished explicitly via data messaging. Despite its conceptual simplicity, this organization imposes a significant communication overhead if the CPU must be involved in transferring data from the main memory (⑤) to the accelerator (②). An alternative arrangement that incurs in a much lower communication overhead relies on the use of direct memory access (DMA), as the accelerator can autonomously access data in the main memory (②⟷⑤). In this case, only the data location needs to be communicated between the host CPU and the accelerator (① ⟷②). In addition, and depending on the system-level connectivity of the input/output devices, the hardware accelerator may be directly connected to input/output channels (④⟷②, e.g., using FIFOs) thereby bypassing the host CPU altogether.

Hardware accelerators are also used in desktops and servers, and are commonly connected to the CPU via a PCI-express (PCIe) bus. Although PCIe provides a dedicated connection, it still imposes a high data transfer latency and exhibits low throughput. Thus, only computationally intensive regions of code where the ratio of computational effort to data transferred is high may profit from offloading them to hardware accelerators.

The next installment in this series describes multiprocessor and multicore architectures.

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017

João Manuel Paiva Cardoso , Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.

José Gabriel de Figueiredo Coutinho , Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.

Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.