High-performance embedded computing -- Parallelism and compiler optimization
Editor's Note: With the emergence of heterogeneous multicore processors for embedded systems, developers can take advantage of powerful platforms for executing complex algorithms moving from the cloud to IoT edge devices. To take full advantage of these architectures, however, developers need to understand the nature of code retargeting mechanisms to optimize resource utilization and algorithm execution.
This series on code retargeting complements a previous set of articles that explored advanced architectures for embedded computing -- both series excerpted from the book, Embedded Computing for High Performance. In this series, the authors discuss details of code retargeting, beginning with Part 1 on code retargeting mechanisms and continuing here with a discussion of parallel execution and compiler options.
Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.
Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.
6.3 PARALLELISM AND COMPILER OPTIONS
By João Cardoso, José Gabriel Coutinho, and Pedro Diniz
While there have been considerable advances in compiler technology targeting CPU platforms, many of the optimizations performed automatically are limited to single-threaded execution, thus missing out vast amounts of computational capacity one can attain from multicore and distributed computing. As such, developers must manually exploit the architectural features of CPUs to attain the maximum possible performance to satisfy stringent application requirements, such as handling very large workloads.
6.3.1 PARALLEL EXECUTION OPPORTUNITIES
Fig. 6.2 illustrates the different parallelization opportunities for commodity CPU platforms. At the bottom of this figure, we have SIMD processing units available in each CPU core which simultaneously execute the same operation on multiple data elements, thus exploiting data-level parallelism. To enable this level of parallelism, compilers automatically convert scalar instructions into vector instructions in a process called auto-vectorization (Section 6.4). Data-level parallelism can also be enabled in the context of multicore architectures where multiple instances of a region of computations, each one processing a distinct set of data elements, can be executed in parallel using thread-level approaches such as OpenMP (Section 6.5). Next, we have task-level parallelism, which requires developers to explicitly define multiple concurrent regions of their application also using approaches such as OpenMP.
click for larger image
FIG. 6.2 Different levels of parallelism in CPU-based platforms: multiprocessor, multicore and SIMD processor units.
These concurrent regions of an application, implemented in OpenMP as threads, share the same address space and are scheduled by the operating system to the available CPU cores. Large workloads can be further partitioned to a multiprocessor platform with distributed memory, where each processor has its own private memory and address space, and communicate with each other using message-passing, such as MPI (Section 6.6). Finally, data movement is an important consideration to ensure data arrives at high enough rates to sustain a high computational throughput. For CPU platforms, one important source of optimization at this level is to leverage their hierarchical cache memory system (Section 6.7).
6.3.2 COMPILER OPTIONS
The most popular CPU compilers and tools supporting high-level descriptions (e.g., in C, C++) include GCC, LLVM, and ICC (Intel C/C++ compiler). These compilers support command-line options to control and guide the code generation process, for instance, by selecting a predefined sequence of compiler optimizations aimed at maximizing the performance of the application or alternatively reducing its size. As an example, the widely popular GNU GCC compiler includes 44 optimizations enabled using the flag –O1, an additional 43 if –O2 is used, and an additional 12 if –O3 is used (see Ref. ). Moreover, the programmer may explicitly invoke other transformations at the command-line compiler invocation. Examples of compilation options include:
Compiling for reduced program size (e.g., gcc –Os). Here the compiler tries to generate code with the fewest number of instructions as possible and without considering performance improvements. This might be important when targeting computing systems with strict storage requirements;
Compiling with a minimal set of optimizations, i.e., using only simple compiler passes (e.g., gcc –O0), which can assist debugging;
Compiling with optimizations that do not substantially affect the code size but in general improve performance (e.g., gcc –O1);
Compiling with a selected few optimizations, such that do not lead to numerical accuracy issues (e.g., gcc –O2);
Compiling with more aggressive optimizations, e.g., considering loop unrolling, function inlining, and loop vectorization (e.g., gcc –O3);
Compiling with more advanced optimizations, possibly with a significant impact on accuracy for certain codes where the sequence of arithmetic instructions may be reordered (e.g., gcc –Ofast);
Continue reading on page two >>