High-performance embedded computing - Parallelism and compiler optimization - Embedded.com

High-performance embedded computing — Parallelism and compiler optimization

Editor's Note: With the emergence of heterogeneous multicore processors for embedded systems, developers can take advantage of powerful platforms for executing complex algorithms moving from the cloud to IoT edge devices. To take full advantage of these architectures, however, developers need to understand the nature of code retargeting mechanisms to optimize resource utilization and algorithm execution. 

This series on code retargeting complements a previous set of articles that explored advanced architectures for embedded computing — both series excerpted from the book, Embedded Computing for High Performance.  In this series, the authors discuss details of code retargeting, beginning with Part 1 on code retargeting mechanisms and continuing here with a discussion of parallel execution and compiler options. 

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.

By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

While there have been considerable advances in compiler technology targeting CPU platforms, many of the optimizations performed automatically are limited to single-threaded execution, thus missing out vast amounts of computational capacity one can attain from multicore and distributed computing. As such, developers must manually exploit the architectural features of CPUs to attain the maximum possible performance to satisfy stringent application requirements, such as handling very large workloads.


Fig. 6.2 illustrates the different parallelization opportunities for commodity CPU platforms. At the bottom of this figure, we have SIMD processing units available in each CPU core which simultaneously execute the same operation on multiple data elements, thus exploiting data-level parallelism. To enable this level of parallelism, compilers automatically convert scalar instructions into vector instructions in a process called auto-vectorization (Section 6.4). Data-level parallelism can also be enabled in the context of multicore architectures where multiple instances of a region of computations, each one processing a distinct set of data elements, can be executed in parallel using thread-level approaches such as OpenMP (Section 6.5). Next, we have task-level parallelism, which requires developers to explicitly define multiple concurrent regions of their application also using approaches such as OpenMP.

click for larger image

FIG. 6.2 Different levels of parallelism in CPU-based platforms: multiprocessor, multicore and SIMD processor units.

These concurrent regions of an application, implemented in OpenMP as threads, share the same address space and are scheduled by the operating system to the available CPU cores. Large workloads can be further partitioned to a multiprocessor platform with distributed memory, where each processor has its own private memory and address space, and communicate with each other using message-passing, such as MPI (Section 6.6). Finally, data movement is an important consideration to ensure data arrives at high enough rates to sustain a high computational throughput. For CPU platforms, one important source of optimization at this level is to leverage their hierarchical cache memory system (Section 6.7).


The GCC compiler supports many compilation optionsa controlled by specific flags at the command line. The “–E” option outputs the code after the preprocessing stage (i.e., after applying and expanding macros, such as #define, #ifdef, preprocessor directives). The “-S” option outputs, as “.s” files, the generated assembly code. The “-fverbose-asm” flag outputs the assembly code with additional information.

When dealing with optimizations of math operations (and without being conservative as to preserve original precision/accuracy), one can exercise the flag “-funsafe-math-optimizations” (or “-Ofast”). When available, one can take advantage of FMA (fused multiply-add) instructions using the flag “-ffp-contract1⁄4fast” (or “-Ofast”).

When targeting native CPU instructions one can use “-march1⁄4native,” or to explicitly define a particular target such as “-march1⁄4corei7-avx” or a specific unit such as one of the SIMD units: -msse4.2, -mavx, -mavx2, -msse -msse2 -msse3 -msse4 -msse4.1, -mssse3, etc.

In terms of reporting developers can use flags such as “-fopt-info-optimized” which output information about the applied optimizations.

In terms of specific loop and code optimizations, there are several flags to enable them. For example, “-floop-interchange” applies loop interchange, “-fivopts” performs induction variable optimizations (strength reduction, induction variable merging, and induction variable elimination) on trees.

a https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.


The most popular CPU compilers and tools supporting high-level descriptions (e.g., in C, C++) include GCC, LLVM, and ICC (Intel C/C++ compiler). These compilers support command-line options to control and guide the code generation process, for instance, by selecting a predefined sequence of compiler optimizations aimed at maximizing the performance of the application or alternatively reducing its size. As an example, the widely popular GNU GCC compiler includes 44 optimizations enabled using the flag –O1, an additional 43 if –O2 is used, and an additional 12 if –O3 is used (see Ref. [4]). Moreover, the programmer may explicitly invoke other transformations at the command-line compiler invocation. Examples of compilation options include:

  • Compiling for reduced program size (e.g., gcc –Os). Here the compiler tries to generate code with the fewest number of instructions as possible and without considering performance improvements. This might be important when targeting computing systems with strict storage requirements;

  • Compiling with a minimal set of optimizations, i.e., using only simple compiler passes (e.g., gcc –O0), which can assist debugging;

  • Compiling with optimizations that do not substantially affect the code size but in general improve performance (e.g., gcc –O1);

  • Compiling with a selected few optimizations, such that do not lead to numerical accuracy issues (e.g., gcc –O2);

  • Compiling with more aggressive optimizations, e.g., considering loop unrolling, function inlining, and loop vectorization (e.g., gcc –O3);

  • Compiling with more advanced optimizations, possibly with a significant impact on accuracy for certain codes where the sequence of arithmetic instructions may be reordered (e.g., gcc –Ofast);


The GCC compiler includes a number of loop transformationsa such as “-floop-interchange,” “-floop-strip-mine,” “-floop-block,” “-ftree-loop-distribution,” and “-floop-unroll-and-jam” (supported via the Graphite frameworkb ). For some of the compiler flags, it is possible to define the value of parameters that may control internal loop transformation heuristics. For example, the parameter “max-unroll-times1⁄4n” (that can be used when invoking gcc and following the “—param” option) asserts the maximum number (n) of unrolling operations on a single loop. To force loop unrolling, one can use the flag “-funroll-loops” and use parameters such as “max-unroll-times,” “max-unrolled-insns,” and “max-average-unrolled-insns” to control this process.

a https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.
b https://gcc.gnu.org/wiki/Graphite.

Continue reading on page two >>


It is important that compilers report information about the effectiveness of their transformations so that developers can decide which transformations to use and to tune each of them via the parameters they support. To this effect, GCCa supports reporting flags. To output the intermediate representation (IR) for each function after applying loop vectorization one can use the flag “-fdump-tree-vect-blocks1⁄4foo. dump.” The “-fdump” flag has other useful options such as “-fdump-tree-pre” and “-fdump-tree-pre-blocks,” whereas the “-fopt-info” flag with its various options such as “-fopt-info-vec-all” which reports detailed information after applying vectorization optimizations. With respect to loop vectorization information, one can use the flag “-ftree-vectorizer-verbose 1⁄4 n” where n (ranging from 1 to 7) represents the level of information to be reported (a higher value of n means that more details are reported).

a https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html.

Overall, these compilation options offer developers a simple mechanism to find suitable trade-offs between code size and performance. However, such mechanisms are not extended to support compiler options aiming at minimizing power and/or energy consumption. Still, optimizations aimed at reducing the total amount of work tend to minimize both energy consumption and execution time.

Compilers also allow users to select specific compilation options from a vast repertory of flags, often controlled using their command-line interface. The order in which the transformations associated with these flags are applied is not, however, typically controlled by the user (i.e., users may select some of the compilation phases but not their ordering). Common compiler flags control the following compiler functionality:

  • Function inlining and heuristics to control function inlining (e.g., maximum number of statements and maximum number of instructions of the function to be inlined)—in gcc, passed as arguments using the “-param” command-line option.

  • Loop unrolling and heuristics to control loop unrolling, such as the unrolling factor.

  • Loop vectorization and the SIMD unit to be targeted (this depends on backward compatibility and is mostly used when targeting 86 architectures).

  • Math options which may result in accuracy loss. The associated compiler flags may also be needed to enable the use of FMA units.

  • Target a specific machine which may improve results but may lead to code incompatible with other machines. We note that it is common for these compiler flags to also require the internal execution of specific compiler analyses and preparation phases, as well as their corresponding phase orderings. The number of available compiler flags is typically in the order of hundreds, and there are several options to report analysis information on the use of compiler optimizations (e.g., report on attempts to auto-vectorize loops in the application code). Usually, most of these flags can also be activated with the use of source code directives (e.g., using pragmas in the case of the C programming language). Note that each compiler supports its own set of pragmas for controlling the compiler options.


The problem of selecting a set of optimizations and a sequence of such optimizations is known as phase selection and phase ordering, respectively, and has been a long-standing research problem in the compiler community. While some combinations of transformations are clearly synergistic (such as constant propagation followed by algebraic simplification and strength reduction), the feasibility and profitability of other transformation sequences are far less obvious since many transformations adversely interact. Not surprisingly, there has been an extensive body of research in this area with the use of optimization techniques, ranging from generalized optimization [5] to machine learning [6,7] and genetic algorithms [8,9], as well as random-based search. Mainly, the specialization of phase ordering, taking into account specific function properties and/or application domain, may provide significant improvements in performance, power, and energy, when compared to the best achievements using general predefined optimization sequences provided by –Ox options and compiler flags.

Very few compilers allow developers to define a specific ordering of its phases. An example of a compiler without user’s support for specifying a specific phase ordering is the GCC compiler. The ordering can be, however, specified by modifying the compiler itself, but this requires general expertize about the inner works of a compiler, the compiler source code, and deep knowledge about target compiler architecture. In contrast, the LLVM compiler infrastructure [10] allows developers to specify the ordering of the compiler optimizations in addition to the predefined optimization sequence options.


The LLVM compiler infrastructure [10]a provides a tool named optb that receives as input a sequence of optimizationsc and applies that sequence to the LLVM IR (e.g., stored in a file using the bitcoded format, which is the binary format used by LLVM to store the IR and that can be generated by the clang frontend). This IR can then be translated to assembly code using the llc tool.e

An example of a compiler sequence (phase order) that can be applied with opt is presented below. In this example, we consider a C program named x.c. The clang tool translates this program to the LLVM representation which is then passed to the opt tool. The opt tool then applies a sequence of compiler optimizations specified through the command line to generate an LLVM IR file. Finally, llc translates this LLVM IR file to assembly code, which is subsequently translated to an executable by gcc.

a The LLVM compiler infrastructure, http://llvm.org/.
b opt—LLVM optimize, http://llvm.org/docs/CommandGuide/opt.html.
c LLVM’s analysis and transform passes, http://llvm.org/docs/Passes.html.
d LLVM bitcode file format, http://llvm.org/docs/BitCodeFormat.html.
e llc -LLVM static compiler, http://llvm.org/docs/CommandGuide/llc.html.

This excerpt ends with this section, but this chapter in the book continues with discussions on loop vectorization, shared and distributed memory approaches, cache optimization, and more. 

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017

João Manuel Paiva Cardoso , Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.

José Gabriel de Figueiredo Coutinho , Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.

Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.