High-performance embedded computing -- Core-based enhancements
Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.
Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.
Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.
2.3 CORE-BASED ARCHITECTURAL ENHANCEMENTS
By João Cardoso, José Gabriel Coutinho, and Pedro Diniz
In addition to the recent multicore trend, CPU cores have also been enhanced to further support parallelism and specific type of computations. This includes the use of SIMD, FMA units, and support for multithreading.
2.3.1 SINGLE INSTRUCTION, MULTIPLE DATA UNITS
Single Instruction, Multiple Data (SIMD) units refer to hardware components that perform the same operation on multiple data operands concurrently. Typically, a SIMD unit receives as input two vectors (each one with a set of operands), performs the same operation on both sets of operands (one operand from each vector), and outputs a vector with the results. Fig. 2.10 illustrates a simple example of a SIMD unit executing four operations in parallel (as represented as follows).
FIG. 2.10 Example of a SIMD unit, executing in this example the same four operations (w/different operands) in parallel.
Typically, the operations in SIMD units include basic arithmetic operations (such as addition, subtraction, multiplication, negation) and other operations such as absolute (abs) and square root (sqrt).
Another factor contributing to the increased performance of SIMD units is the fact that multiple data items can be simultaneously loaded/stored from/to memory exploiting the full width of the memory data bus. Fig. 2.11 depicts a simple illustrative example using SIMD units and vector processing. As can be seen in Fig. 2.11C, the code using SIMD units executes ¼ of the instructions in ¼ of the clock cycles when compared with code executing without SIMD units (Fig. 2.11B).
FIG. 2.11 Example of the use of a SIMD unit: (A) simple segment of code; (B) symbolic assembly without using SIMD support; (C) symbolic assembly considering SIMD support.
To exploit SIMD units, it is very important to be able to combine multiple load or store accesses in a single SIMD instruction. This can be achieved when using contiguous memory accesses, e.g., in the presence of unit stride accesses, and when array elements are aligned. In the previous example, arrays A, B, and C are accessed with unit stride and we assume they are aligned (i.e., the base address of each element starts in the beginning of a word, i.e., the memory address is a multiple of 4 in a byte- addressable 32-bit machine). When dealing with nonaligned arrays inside loops, it is common to align the addresses (and thus enable the use of SIMD instructions) by applying loop peeling transformations (see Chapter 5). In addition, and to match the array dimensions to the SIMD vector lengths, compilers often apply partial loop unrolling. We provide more details about how to exploit vectorization in high-level descriptions in Chapter 6.