High-performance embedded computing — Multiprocessor and multicore architectures

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.

By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

Modern microprocessors are based on multicore architectures consisting of a number of processing cores. Typically, each core has its own instruction and data memories (L1 caches) and all cores share a second level (L2) on-chip cache. Fig. 2.4 presents a block diagram of a typical multicore (a quad-core in this case) CPU computing system where all cores share an L2 cache. The CPU is also connected to an external memory and includes link controllers to access external system components. There are, however, multicore architectures where one L2 cache is shared by a subset of cores (e.g., each L2 cache is shared by two cores in a quad-core, or is shared by four cores in an octa-core CPU). This is common in computing systems with additional memory levels. The external memories are often grouped in multiple levels and use different storage technologies. Typically, the first level is organized using SRAM devices, whereas the second level uses DDRAMs.

click for larger image

FIG. 2.4 Block diagram of a typical multicore architecture (quad-core CPU).


Several platforms provide FPGA-based hardware extensions to commodity CPUs. Examples include the Intel QuickAssist QPI-FPGA [10], IBM Netezza [11], CAPI [12], and Xilinx Zynq [13]. Other platforms, such as Riffa [14], focus on vendor-independent support by providing an integration framework to interface FPGA-based accelerators with the CPU system bus using the PCI Express (PCIe) links.

Other system components, such as GPIO, UART, USB interface, PCIe, network coprocessor, and power manager, are connected via a fast link possibly being memory mapped. In other architectures, however, the CPU connects to these subsystems (including memory) exclusively using fast links and/or switch fabrics (e.g., via a partial crossbar), thus providing point-to-point communication channels between the architecture components.

In computing systems requiring higher performance demands, it is common to include more than one multicore CPU (e.g., with all CPUs integrated as a CMP (chip multiprocessor)). Figs. 2.5 and 2.6 illustrate two possible organizations of CMPs, one using a distributed memory organization (Fig. 2.5) and another one using a shared memory organization (Fig. 2.6).

click for larger image

FIG. 2.5 Block diagram of a typical multiple CPU/multiple cores architecture (with two quad-core CPUs) using a distributed memory organization.

Fig. 2.5 presents an example of a nonuniform memory access architecture (NUMA). In such systems, the distributed memories are viewed as one combined memory by all CPUs; however, access times and throughputs differ depending on the location of the memory and the CPU. For instance, the memory accesses of the CPU located at the opposite side of where the target memory is located incurs in a larger latency than the accesses to a nearby memory.

CPUs also provide parallel execution support for multiple threads per core. Systems with more than one multicore CPU have the potential to have many concurrently executing threads, thus supporting multithreaded applications.

click for larger image

FIG. 2.6 Block diagram of a typical multiple CPU/multiple cores architecture (with two quad-core CPUs) using a shared memory organization.


The type of interconnections used in a target architecture depends on the level of performance required and the platform vendor.

An example of a switch fabric is the TeraNet,a and an example of a fast link is HyperLink.a They are both used in some ARM-based SoCs proposed by Texas Instruments Inc,a providing efficient interconnections of the subsystems to the ARM multicore CPUs and to the external accelerators.

In Intel-based computing systems, the memory subsystem is usually connected via Intel Scalable Memory Interfaces (SMI) provided by the integrated memory controllers in the CPU. They also include a fast link connecting other subsystems to the CPU using the Intel QuickPath Interconnect (QPI),b a point-to-point interconnect technology. AMD provides the HyperTransportc technology for point-to-point links.

a Texas Instruments Inc. AM5K2E0x multicore ARM KeyStone II system-on-chip (SoC). SPRS864D—November 2012—revised March 2015.
b Intel Corp. An introduction to the Intel QuickPath interconnect. January 30, 2009.
c API networks accelerates use of hypertransport technology with launch of industry’s first hypertransport technology-to-PCI bridge chip. (Press release). HyperTransport Consortium. 2001-04-02.


A current trend in computing organization is the Heterogeneous System-on-a-Chip (SoC) with multiple independent and distinct CPUs, where each processor has an arbitrary number of cores and other computing devices such as hardware accelerators. Fig. 2.7 shows a block diagram of an architecture with two distinct CPUs (one with two cores and the other with four cores) and a GPU. This computer organization is similar to the one used in mobile devices such as smartphones (see Heterogeneous SoCs).

click for larger image

FIG. 2.7 Block diagram of a typical heterogeneous architecture.


Currently, parallel computing architectural models are being extensively used in part due to the existence of hardware implementations supporting them. One well-known example is the computing model proposed in the context of OpenCL (Open Computing Language) [15,16] and instances of the model are herein referred as OpenCL computing devices. This model is supported by a wide range of heterogeneous architectures, including multicore CPUs, and hardware accelerators such as GPUs and FPGAs. In the OpenCL approach, a program consists of two parts: the host program running on the CPU, and a set of kernels running on OpenCL-capable computing devices acting as coprocessors. Chapter 7 provides details about the OpenCL language and tools.


An example of a SoC with organization similar to the one presented in Fig. 2.7 is the Exynos 7 Hexa 7650 (Exynos 7650)a designed and manufactured by Samsung Electronics for some of the Samsung smartphones. The SoC includes one 1.8 GHz ARMV8A dual-core Cortex-A72, one 1.3 GHz ARMV8 quad-core Cortex-A53 (and with seamless support of 32-bit and 64-bit instruction sets), and one Mali-T860MP3 GPU.

a https://en.wikipedia.org/wiki/Exynos.

click for larger image

FIG. 2.8 High-level block diagram of the OpenCL platform model.

Fig. 2.8 shows a high-level block diagram of a typical OpenCL platform which may include one or more computing devices. Each device includes several Computing Units (CUs) and each Computing Unit consists of several Processing Elements (PEs). A single kernel can execute in one or more PEs in parallel within the same CU or in multiple CUs. The computing devices (e.g., a GPU board with support to OpenCL) are connected to a host CPU via a shared bus.

PEs are typically SIMD units responsible for executing sequences of instructions without conditional control flow. In some cases, control flow can be supported by applying if-conversion and/or by executing both branches and then selecting the required results by multiplexing the values provided by the different branches.

As depicted in Fig. 2.9, an OpenCL computing device may also include a hierarchical memory organization consisting of private memories (registers) tightly coupled to the PEs (one memory per PE), a local memory per CU, a global read-only memory that can be written by the host CPU but only read by the computing device, and a global memory read/written by the computing device and by the host CPU. Each CU’s local memory is shared by all the CU’s PEs.

The OpenCL kernels execute through parallel work items (each one with an associated ID and seen as hardware threads) and the work items define groups (known as work groups). Each work group executes on a CU and its associated work items execute on its PEs.

The OpenCL platform is very suitable for computations structured in SIMT (Single Instruction, Multiple Thread) as is the case when targeting GPUs. Recently, the main FPGA vendors have adhered to the OpenCL model and provide toolchain support to map the model to the FPGA resources (i.e., making FPGAs OpenCL computing devices).

click for larger image

FIG. 2.9 Block diagram of the OpenCL model [15].

The next installment in this series describes core-based architectural enhancements used to enhance performance.  

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017

João Manuel Paiva Cardoso , Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.

José Gabriel de Figueiredo Coutinho , Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.

Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.