High-performance embedded computing — Core-based enhancements

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.

By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

In addition to the recent multicore trend, CPU cores have also been enhanced to further support parallelism and specific type of computations. This includes the use of SIMD, FMA units, and support for multithreading.


Single Instruction, Multiple Data (SIMD) units refer to hardware components that perform the same operation on multiple data operands concurrently. Typically, a SIMD unit receives as input two vectors (each one with a set of operands), performs the same operation on both sets of operands (one operand from each vector), and outputs a vector with the results. Fig. 2.10 illustrates a simple example of a SIMD unit executing four operations in parallel (as represented as follows).

FIG. 2.10 Example of a SIMD unit, executing in this example the same four operations (w/different operands) in parallel.


SIMD units have been available in Intel microprocessors since the advent of the MMX, SSE (Streaming SIMD Extensions), and AVX (Advanced Vector Extensions) ISA extensions [ 17 ]. The MMX extensions were initially included to speed up the performance of multimedia applications and other application domains requiring image and signal processing.

ARM has also introduced SIMD extensions to ARM-Cortex architectures with their NEON technology. The NEON SIMD unit is 128-bit wide and includes 16 128-bit registers that can be used as 32 64-bit registers. These registers can be thought as vectors of elements of the same data type, being the data types signed/unsigned 8, 16, 32, 64-bit, and single precision floating point. The following example shows how a vector statement involving single-precision floating-point data can be implemented in a Cortex-A9 using the SIMD unit.

Typically, the operations in SIMD units include basic arithmetic operations (such as addition, subtraction, multiplication, negation) and other operations such as absolute (abs) and square root (sqrt).

Another factor contributing to the increased performance of SIMD units is the fact that multiple data items can be simultaneously loaded/stored from/to memory exploiting the full width of the memory data bus. Fig. 2.11 depicts a simple illustrative example using SIMD units and vector processing. As can be seen in Fig. 2.11C, the code using SIMD units executes ¼ of the instructions in ¼ of the clock cycles when compared with code executing without SIMD units (Fig. 2.11B).

FIG. 2.11 Example of the use of a SIMD unit: (A) simple segment of code; (B) symbolic assembly without using SIMD support; (C) symbolic assembly considering SIMD support.

To exploit SIMD units, it is very important to be able to combine multiple load or store accesses in a single SIMD instruction. This can be achieved when using contiguous memory accesses, e.g., in the presence of unit stride accesses, and when array elements are aligned. In the previous example, arrays A, B, and C are accessed with unit stride and we assume they are aligned (i.e., the base address of each element starts in the beginning of a word, i.e., the memory address is a multiple of 4 in a byte- addressable 32-bit machine). When dealing with nonaligned arrays inside loops, it is common to align the addresses (and thus enable the use of SIMD instructions) by applying loop peeling transformations (see Chapter 5). In addition, and to match the array dimensions to the SIMD vector lengths, compilers often apply partial loop unrolling. We provide more details about how to exploit vectorization in high-level descriptions in Chapter 6.


Fused Multiply-Add (FMA) units perform fused operations such as multiply-add and multiply-subtract. The main idea is to provide a CPU instruction that can perform operations with three input operands and an output result. Fig. 2.12 shows an example of an FMA unit. In this example, we consider the support of instructions D=A*B+C and D=A*B C.

It is also common for FMA units to support single, double precision floating-point and integer operations, and depending on the data types, to include a rounding stage following the last operation (single rounding step as opposed to two rounding steps) when not using fused operations.

FIG. 2.12 An example of an FMA unit.

Depending on the processor architecture, the input/output of the FMA units might be associated with four distinct registers or three distinct registers, with one register shared between the result and one of the input operands of the FMA unit. The latter is depicted in the FMA unit in Fig. 2.12 when D is one of A, B, or C (depending of the FMA instruction).

In addition to the SIMD hardware support, some recent processor architectures not only include FMA units but also FMA vector operations (an example is the Intel 64 and IA-32 Architectures [18] and the ARM Cortex). In this case, a single SIMD instruction may perform the same fused operation over multiple data inputs, as represented by the following vector forms for D = A*B + C and D = A*B C:


The FMA units recently included in Intel microprocessors are able to perform fused operations such as multiply-add, multiply-subtract, multiply add/subtract interleave, signed-reversed multiply on multiply-add and on multiply-subtract. These recent FMA extensions [ 18 ] provide 36 256-bit floating-point instructions to compute on 256-bit vectors, and additional 128-bit and scalar FMA instructions.

There are two types of FMA instructions: FMA3 and FMA4. An FMA3 instruction supports three input operands (i.e., three registers), and its result must be stored in one of the input registers. An FMA4 instruction, on the other hand, supports three input operands with the result stored in a dedicated output register. For instance, the Intel FMA instruction VFMADD213PS (FMA3) computes $0 = $1 x $0 + $2 while the instruction VFMADDPS (FMA4) is able to compute $0 = $1 x $2 + $3. Most recent processors implement FMA instructions using the FMA3 type.

High-performance ARM microprocessors also include FMA support with the Vector Floating Point (VFP) unit [ 19 ]. The following example shows how two vector statements involving single- precision floating-point data and a temporary vector E (not live after these two statements) can be implemented in a Cortex-A9 using fused operations and SIMD.

It is possible that the numerical results of FMA instructions differ from the results using non-FMA instructions due to the different numerical rounding schemes used in intermediate values. For instance, the floating-point expression d=a*b+c; implies in the case of an FMA instruction that the multiplication a*b is performed with higher precision, and the result of the addition is rounded to produce the desired floating- point precision (i.e., the precision associated to the floating-point type of d). This computation is performed differently when using non-FMA instructions, which would first compute t=a*b with the same floating-point type for t, a, and b, and then compute d=t+c. In this case, the use of an FMA instruction may produce results with a higher accuracy than the corresponding non-FMA instructions.


Modern microprocessors support simultaneous multithreading (SMT) by providing multiple cores and by duplicating hardware in a single core to allow native support of parallel thread execution. The execution of multiple threads within the same core is realized by time multiplexing its hardware resources and by fast context switching. The Intel Hyper-Threading2 is an example of such technology, which efficiently supports the balanced execution of two threads on the same core. Processors without multithreading, on the other hand, execute threads sequentially without interleaving their execution. Chapter 6 (Section 6.5) explains how to exploit multithreading using OpenMP.

The next installment in this series describes hardware accelerators used to enhance performance.  

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017

João Manuel Paiva Cardoso , Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.

José Gabriel de Figueiredo Coutinho , Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.

Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.