High-performance embedded computing -- Parallelism and compiler optimization

João Cardoso, José Gabriel Coutinho, and Pedro Diniz

April 24, 2018

João Cardoso, José Gabriel Coutinho, and Pedro DinizApril 24, 2018

Editor's Note: With the emergence of heterogeneous multicore processors for embedded systems, developers can take advantage of powerful platforms for executing complex algorithms moving from the cloud to IoT edge devices. To take full advantage of these architectures, however, developers need to understand the nature of code retargeting mechanisms to optimize resource utilization and algorithm execution. 

This series on code retargeting complements a previous set of articles that explored advanced architectures for embedded computing -- both series excerpted from the book, Embedded Computing for High Performance. In this series, the authors discuss details of code retargeting, beginning with Part 1 on code retargeting mechanisms and continuing here with a discussion of parallel execution and compiler options. 

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.


By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

While there have been considerable advances in compiler technology targeting CPU platforms, many of the optimizations performed automatically are limited to single-threaded execution, thus missing out vast amounts of computational capacity one can attain from multicore and distributed computing. As such, developers must manually exploit the architectural features of CPUs to attain the maximum possible performance to satisfy stringent application requirements, such as handling very large workloads.


Fig. 6.2 illustrates the different parallelization opportunities for commodity CPU platforms. At the bottom of this figure, we have SIMD processing units available in each CPU core which simultaneously execute the same operation on multiple data elements, thus exploiting data-level parallelism. To enable this level of parallelism, compilers automatically convert scalar instructions into vector instructions in a process called auto-vectorization (Section 6.4). Data-level parallelism can also be enabled in the context of multicore architectures where multiple instances of a region of computations, each one processing a distinct set of data elements, can be executed in parallel using thread-level approaches such as OpenMP (Section 6.5). Next, we have task-level parallelism, which requires developers to explicitly define multiple concurrent regions of their application also using approaches such as OpenMP.

click for larger image

FIG. 6.2 Different levels of parallelism in CPU-based platforms: multiprocessor, multicore and SIMD processor units.

These concurrent regions of an application, implemented in OpenMP as threads, share the same address space and are scheduled by the operating system to the available CPU cores. Large workloads can be further partitioned to a multiprocessor platform with distributed memory, where each processor has its own private memory and address space, and communicate with each other using message-passing, such as MPI (Section 6.6). Finally, data movement is an important consideration to ensure data arrives at high enough rates to sustain a high computational throughput. For CPU platforms, one important source of optimization at this level is to leverage their hierarchical cache memory system (Section 6.7).


The GCC compiler supports many compilation optionsa controlled by specific flags at the command line. The “–E” option outputs the code after the preprocessing stage (i.e., after applying and expanding macros, such as #define, #ifdef, preprocessor directives). The “-S” option outputs, as “.s” files, the generated assembly code. The “-fverbose-asm” flag outputs the assembly code with additional information.

When dealing with optimizations of math operations (and without being conservative as to preserve original precision/accuracy), one can exercise the flag “-funsafe-math-optimizations” (or “-Ofast”). When available, one can take advantage of FMA (fused multiply-add) instructions using the flag “-ffp-contract1⁄4fast” (or “-Ofast”).

When targeting native CPU instructions one can use “-march1⁄4native,” or to explicitly define a particular target such as “-march1⁄4corei7-avx” or a specific unit such as one of the SIMD units: -msse4.2, -mavx, -mavx2, -msse -msse2 -msse3 -msse4 -msse4.1, -mssse3, etc.

In terms of reporting developers can use flags such as “-fopt-info-optimized” which output information about the applied optimizations.

In terms of specific loop and code optimizations, there are several flags to enable them. For example, “-floop-interchange” applies loop interchange, “-fivopts” performs induction variable optimizations (strength reduction, induction variable merging, and induction variable elimination) on trees.



The most popular CPU compilers and tools supporting high-level descriptions (e.g., in C, C++) include GCC, LLVM, and ICC (Intel C/C++ compiler). These compilers support command-line options to control and guide the code generation process, for instance, by selecting a predefined sequence of compiler optimizations aimed at maximizing the performance of the application or alternatively reducing its size. As an example, the widely popular GNU GCC compiler includes 44 optimizations enabled using the flag –O1, an additional 43 if –O2 is used, and an additional 12 if –O3 is used (see Ref. [4]). Moreover, the programmer may explicitly invoke other transformations at the command-line compiler invocation. Examples of compilation options include:

  • Compiling for reduced program size (e.g., gcc –Os). Here the compiler tries to generate code with the fewest number of instructions as possible and without considering performance improvements. This might be important when targeting computing systems with strict storage requirements;

  • Compiling with a minimal set of optimizations, i.e., using only simple compiler passes (e.g., gcc –O0), which can assist debugging;

  • Compiling with optimizations that do not substantially affect the code size but in general improve performance (e.g., gcc –O1);

  • Compiling with a selected few optimizations, such that do not lead to numerical accuracy issues (e.g., gcc –O2);

  • Compiling with more aggressive optimizations, e.g., considering loop unrolling, function inlining, and loop vectorization (e.g., gcc –O3);

  • Compiling with more advanced optimizations, possibly with a significant impact on accuracy for certain codes where the sequence of arithmetic instructions may be reordered (e.g., gcc –Ofast);


The GCC compiler includes a number of loop transformationsa such as “-floop-interchange,” “-floop-strip-mine,” “-floop-block,” “-ftree-loop-distribution,” and “-floop-unroll-and-jam” (supported via the Graphite frameworkb). For some of the compiler flags, it is possible to define the value of parameters that may control internal loop transformation heuristics. For example, the parameter “max-unroll-times1⁄4n” (that can be used when invoking gcc and following the “—param” option) asserts the maximum number (n) of unrolling operations on a single loop. To force loop unrolling, one can use the flag “-funroll-loops” and use parameters such as “max-unroll-times,” “max-unrolled-insns,” and “max-average-unrolled-insns” to control this process.


Continue reading on page two >>


< Previous
Page 1 of 2
Next >

Loading comments...