High-performance embedded computing -- Core-based enhancements

João Cardoso, José Gabriel Coutinho, and Pedro Diniz

January 08, 2018

João Cardoso, José Gabriel Coutinho, and Pedro DinizJanuary 08, 2018

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.


By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

In addition to the recent multicore trend, CPU cores have also been enhanced to further support parallelism and specific type of computations. This includes the use of SIMD, FMA units, and support for multithreading.


Single Instruction, Multiple Data (SIMD) units refer to hardware components that perform the same operation on multiple data operands concurrently. Typically, a SIMD unit receives as input two vectors (each one with a set of operands), performs the same operation on both sets of operands (one operand from each vector), and outputs a vector with the results. Fig. 2.10 illustrates a simple example of a SIMD unit executing four operations in parallel (as represented as follows).

FIG. 2.10 Example of a SIMD unit, executing in this example the same four operations (w/different operands) in parallel.


SIMD units have been available in Intel microprocessors since the advent of the MMX, SSE (Streaming SIMD Extensions), and AVX (Advanced Vector Extensions) ISA extensions [17]. The MMX extensions were initially included to speed up the performance of multimedia applications and other application domains requiring image and signal processing.

ARM has also introduced SIMD extensions to ARM-Cortex architectures with their NEON technology. The NEON SIMD unit is 128-bit wide and includes 16 128-bit registers that can be used as 32 64-bit registers. These registers can be thought as vectors of elements of the same data type, being the data types signed/unsigned 8, 16, 32, 64-bit, and single precision floating point. The following example shows how a vector statement involving single-precision floating-point data can be implemented in a Cortex-A9 using the SIMD unit.

Typically, the operations in SIMD units include basic arithmetic operations (such as addition, subtraction, multiplication, negation) and other operations such as absolute (abs) and square root (sqrt).

Another factor contributing to the increased performance of SIMD units is the fact that multiple data items can be simultaneously loaded/stored from/to memory exploiting the full width of the memory data bus. Fig. 2.11 depicts a simple illustrative example using SIMD units and vector processing. As can be seen in Fig. 2.11C, the code using SIMD units executes ¼ of the instructions in ¼ of the clock cycles when compared with code executing without SIMD units (Fig. 2.11B).

FIG. 2.11 Example of the use of a SIMD unit: (A) simple segment of code; (B) symbolic assembly without using SIMD support; (C) symbolic assembly considering SIMD support.

To exploit SIMD units, it is very important to be able to combine multiple load or store accesses in a single SIMD instruction. This can be achieved when using contiguous memory accesses, e.g., in the presence of unit stride accesses, and when array elements are aligned. In the previous example, arrays A, B, and C are accessed with unit stride and we assume they are aligned (i.e., the base address of each element starts in the beginning of a word, i.e., the memory address is a multiple of 4 in a byte- addressable 32-bit machine). When dealing with nonaligned arrays inside loops, it is common to align the addresses (and thus enable the use of SIMD instructions) by applying loop peeling transformations (see Chapter 5). In addition, and to match the array dimensions to the SIMD vector lengths, compilers often apply partial loop unrolling. We provide more details about how to exploit vectorization in high-level descriptions in Chapter 6.

Continue reading on page two >>


< Previous
Page 1 of 2
Next >

Loading comments...