Editor's Note: With the emergence of heterogeneous multicore processors for embedded systems, developers can take advantage of powerful platforms for executing complex algorithms moving from the cloud to IoT edge devices. To take full advantage of these architectures, however, developers need to understand the nature of code retargeting mechanisms to optimize resource utilization and algorithm execution.
This series on code retargeting complements a previous set of articles that explored advanced architectures for embedded computing — both series excerpted from the book, Embedded Computing for High Performance. In this series, the authors discuss details of code retargeting, beginning with Part 1 on code retargeting mechanisms and continuing here with a discussion of parallel execution and compiler options.
Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.
Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.
6.3 PARALLELISM AND COMPILER OPTIONS
By João Cardoso, José Gabriel Coutinho, and Pedro Diniz
While there have been considerable advances in compiler technology targeting CPU platforms, many of the optimizations performed automatically are limited to single-threaded execution, thus missing out vast amounts of computational capacity one can attain from multicore and distributed computing. As such, developers must manually exploit the architectural features of CPUs to attain the maximum possible performance to satisfy stringent application requirements, such as handling very large workloads.
6.3.1 PARALLEL EXECUTION OPPORTUNITIES
Fig. 6.2 illustrates the different parallelization opportunities for commodity CPU platforms. At the bottom of this figure, we have SIMD processing units available in each CPU core which simultaneously execute the same operation on multiple data elements, thus exploiting data-level parallelism. To enable this level of parallelism, compilers automatically convert scalar instructions into vector instructions in a process called auto-vectorization (Section 6.4). Data-level parallelism can also be enabled in the context of multicore architectures where multiple instances of a region of computations, each one processing a distinct set of data elements, can be executed in parallel using thread-level approaches such as OpenMP (Section 6.5). Next, we have task-level parallelism, which requires developers to explicitly define multiple concurrent regions of their application also using approaches such as OpenMP.
click for larger image
FIG. 6.2 Different levels of parallelism in CPU-based platforms: multiprocessor, multicore and SIMD processor units.
These concurrent regions of an application, implemented in OpenMP as threads, share the same address space and are scheduled by the operating system to the available CPU cores. Large workloads can be further partitioned to a multiprocessor platform with distributed memory, where each processor has its own private memory and address space, and communicate with each other using message-passing, such as MPI (Section 6.6). Finally, data movement is an important consideration to ensure data arrives at high enough rates to sustain a high computational throughput. For CPU platforms, one important source of optimization at this level is to leverage their hierarchical cache memory system (Section 6.7).
6.3.2 COMPILER OPTIONS
The most popular CPU compilers and tools supporting high-level descriptions (e.g., in C, C++) include GCC, LLVM, and ICC (Intel C/C++ compiler). These compilers support command-line options to control and guide the code generation process, for instance, by selecting a predefined sequence of compiler optimizations aimed at maximizing the performance of the application or alternatively reducing its size. As an example, the widely popular GNU GCC compiler includes 44 optimizations enabled using the flag –O1, an additional 43 if –O2 is used, and an additional 12 if –O3 is used (see Ref. ). Moreover, the programmer may explicitly invoke other transformations at the command-line compiler invocation. Examples of compilation options include:
Compiling for reduced program size (e.g., gcc –Os). Here the compiler tries to generate code with the fewest number of instructions as possible and without considering performance improvements. This might be important when targeting computing systems with strict storage requirements;
Compiling with a minimal set of optimizations, i.e., using only simple compiler passes (e.g., gcc –O0), which can assist debugging;
Compiling with optimizations that do not substantially affect the code size but in general improve performance (e.g., gcc –O1);
Compiling with a selected few optimizations, such that do not lead to numerical accuracy issues (e.g., gcc –O2);
Compiling with more aggressive optimizations, e.g., considering loop unrolling, function inlining, and loop vectorization (e.g., gcc –O3);
Compiling with more advanced optimizations, possibly with a significant impact on accuracy for certain codes where the sequence of arithmetic instructions may be reordered (e.g., gcc –Ofast);
Continue reading on page two >>
Overall, these compilation options offer developers a simple mechanism to find suitable trade-offs between code size and performance. However, such mechanisms are not extended to support compiler options aiming at minimizing power and/or energy consumption. Still, optimizations aimed at reducing the total amount of work tend to minimize both energy consumption and execution time.
Compilers also allow users to select specific compilation options from a vast repertory of flags, often controlled using their command-line interface. The order in which the transformations associated with these flags are applied is not, however, typically controlled by the user (i.e., users may select some of the compilation phases but not their ordering). Common compiler flags control the following compiler functionality:
Function inlining and heuristics to control function inlining (e.g., maximum number of statements and maximum number of instructions of the function to be inlined)—in gcc, passed as arguments using the “-param” command-line option.
Loop unrolling and heuristics to control loop unrolling, such as the unrolling factor.
Loop vectorization and the SIMD unit to be targeted (this depends on backward compatibility and is mostly used when targeting 86 architectures).
Math options which may result in accuracy loss. The associated compiler flags may also be needed to enable the use of FMA units.
Target a specific machine which may improve results but may lead to code incompatible with other machines. We note that it is common for these compiler flags to also require the internal execution of specific compiler analyses and preparation phases, as well as their corresponding phase orderings. The number of available compiler flags is typically in the order of hundreds, and there are several options to report analysis information on the use of compiler optimizations (e.g., report on attempts to auto-vectorize loops in the application code). Usually, most of these flags can also be activated with the use of source code directives (e.g., using pragmas in the case of the C programming language). Note that each compiler supports its own set of pragmas for controlling the compiler options.
6.3.3 COMPILER PHASE SELECTION AND ORDERING
The problem of selecting a set of optimizations and a sequence of such optimizations is known as phase selection and phase ordering, respectively, and has been a long-standing research problem in the compiler community. While some combinations of transformations are clearly synergistic (such as constant propagation followed by algebraic simplification and strength reduction), the feasibility and profitability of other transformation sequences are far less obvious since many transformations adversely interact. Not surprisingly, there has been an extensive body of research in this area with the use of optimization techniques, ranging from generalized optimization  to machine learning [6,7] and genetic algorithms [8,9], as well as random-based search. Mainly, the specialization of phase ordering, taking into account specific function properties and/or application domain, may provide significant improvements in performance, power, and energy, when compared to the best achievements using general predefined optimization sequences provided by –Ox options and compiler flags.
Very few compilers allow developers to define a specific ordering of its phases. An example of a compiler without user’s support for specifying a specific phase ordering is the GCC compiler. The ordering can be, however, specified by modifying the compiler itself, but this requires general expertize about the inner works of a compiler, the compiler source code, and deep knowledge about target compiler architecture. In contrast, the LLVM compiler infrastructure  allows developers to specify the ordering of the compiler optimizations in addition to the predefined optimization sequence options.
This excerpt ends with this section, but this chapter in the book continues with discussions on loop vectorization, shared and distributed memory approaches, cache optimization, and more.
Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017
João Manuel Paiva Cardoso , Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.
José Gabriel de Figueiredo Coutinho , Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.
Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.