High-performance embedded computing -- Target architectures

João Cardoso, José Gabriel Coutinho, and Pedro Diniz

December 11, 2017

João Cardoso, José Gabriel Coutinho, and Pedro DinizDecember 11, 2017

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

In this excerpt from the book, the authors discuss embedded-system architectures, key approaches used for enhancing performance, and the implications for power dissipation and energy consumption in the following installments:

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.


By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

Embedded systems are very diverse and can be organized in a myriad of ways. They can combine microprocessors and/or microcontrollers with other computing devices, such as application-specific processors (ASIPs), digital-signal processors (DSPs), and reconfigurable devices (e.g., FPGAs [1,2] and coarse-grained reconfigurable arrays—CGRAs—like TRIPS [3]), often in the form of a System-on-a-Chip (SoC). In the realm of embedded systems, and more recently in the context of scientific computing (see, e.g., [4]), the use of hardware accelerators [5] has been recognized as an efficient way to meet the required performance levels and/or energy savings (e.g., to extend battery life). Although the focus of this book is not on GPU computing (see, e.g., [6–8]), the use of GPUs is also addressed as one possible hardware accelerator in high-performance embedded systems.

Fig. 2.1 presents the main drivers that helped improve the performance of computing systems over the last three decades as well as the trends for the upcoming years. Clearly, frequency scaling was the main driver for improving performance until the mid-2000s. Since then, more complex technologies have been employed to increase the computational capacity of computing platforms, including more aggressive pipelining execution, superscalar execution, multicore architectures, single instruction, multiple data (SIMD) support, fused multiply-add (FMA) units, and chip multiprocessors (CMPs). The combination of hardware accelerators with many-core architectures is currently an important source of performance gains for emerging high-performance heterogeneous architectures accelerators, and we believe will continue to be prominent in the upcoming years.

click for larger image

FIG. 2.1 Indicative computer architecture trends in terms of performance enhancements.
Based on Lecture materials for “CSE 6230—High Performance Computing: Tools and Applications” course (Fall 2013): “Performance Tuning for CPU: SIMD Optimization”, by Marat Dukhan, Georgia Institute of Technology, United States.

In general, when requirements for a given application and target embedded computing system, such as execution time, power dissipation, energy consumption, and memory bandwidth, are not met, engineers and developers often resort to one or more of the following techniques:

  • Perform code optimizations to reduce the number of executed instructions or replace expensive instructions with instructions that use fewer resources while producing the same results.

  • Parallelize sections of code by taking advantage of the lack of control and data dependences, thus enabling the parallel execution of instructions by multiple functional units or multiple processor units.

  • Find the best trade-off between accuracy and other nonfunctional requirements by reducing computation requirements of specific calculations while meeting numerical accuracy requirements, for example.

  • Migrate (also known as offloading) sections of the code from the host CPU to one or more hardware accelerators (e.g., GPUs, FPGAs) to meet nonfunctional requirements, such as performance and energy efficiency. In some instances, developers may simply consider the use of a more suitable hardware platform, which includes microprocessors, peripherals, and accelerators with more powerful computing capabilities.


A recent trend in embedded computing platforms involves the inclusion of multiple computing components that are either homogeneous or heterogeneous, and are connected using traditional bus systems or using more advanced communication systems based on networking such as those found in network-on-a-chip (NoC) devices. Typically, these architectures include multiple distributed memories, in addition to shared memories, at distinct levels of the memory hierarchy and with different address space views.

click for larger image

FIG. 2.2 Block diagram of typical single CPU/single core architectures: (A) with all devices connected through a system bus; (B) with the possibility to have devices connected to the CPU using direct channels.

The simplest embedded computing system consists of a single CPU/core with a system bus that connects all the system devices, from I/O peripherals to memories, as illustrated by the block diagram in Fig. 2.2A. In some cases, computing devices (e.g., the CPU) have direct access to other components using point-to-point channels (Fig. 2.2B). In other cases, all the components are accessed using a memory-mapped addressing scheme where each component is selected and assigned to a specific region of the global address space.

In most high-performance systems, the CPU uses a hierarchical memory cache architecture. In such systems, the first level of memory has separate instruction and data caches [9]. For instance, in the system depicted in Fig. 2.2, the memory is organized as a level one (L1) instruction ($I) and data ($D) caches, a second level of cache (L2) and the main memory. However, there are cases, such as low-performance and safety-critical systems, where the system only supports one level of memory. This is because most safety-critical systems with stringent real-time requirements may not cope with the variability of the execution time that cache memories introduce. In particular, a system may not meet its real-time requirement as dictated by the worst-case execution time (WCET) when using cache memories, and therefore these systems may not include them.

Continue reading on page two >>


< Previous
Page 1 of 2
Next >

Loading comments...