Editor's Note: With the emergence of heterogeneous multicore processors for embedded systems, developers can take advantage of powerful platforms for executing complex algorithms moving from the cloud to IoT edge devices. To take full advantage of these architectures, however, developers need to understand the nature of code retargeting mechanisms to optimize resource utilization and algorithm execution.
This series on code retargeting complements a previous set of articles that explored advanced architectures for embedded computing — both series excerpted from the book, Embedded Computing for High Performance. In this series, the authors discuss details of code retargeting, beginning with this installment on code retargeting mechanisms.
Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.
Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.
By João Cardoso, José Gabriel Coutinho, and Pedro Diniz
In this chapter, we focus exclusively on code retargeting issues in the context of CPU-based architectures and where the application source code is optimized to execute more efficiently on a given platform. In Chapter 7, we extend the topic of code retargeting to cover heterogeneous platforms, such as GPUs and FPGAs.
Here, we assume as input the application’s source code, written in a high-level language such as C/C++, possibly even translated from higher levels of abstraction as described in Chapter 3. This source code captures the algorithmic behavior of the application while being independent of the target platform. As a first step, this code version is compiled and executed on a CPU platform, allowing developers to test the correctness of the application with respect to its specification.
However, as computer architectures are becoming increasingly complex, with more processing cores, more heterogeneous and more distributed, applications compiled for existing machines will not fully utilize the target platform resources, even when binary compatibility is guaranteed, and thus do not run as efficiently as they could. In particular, modern-day computing platforms (see Chapter 2 for a brief overview) support many independent computational units to allow parallel computations, contain specialized architectural features to improve concurrency and computational efficiency, support hierarchical memories to increase data locality and reduce memory latency, and include fast interconnects and buses to reduce data movement overhead. Hence, to maximize performance and efficiency it is imperative to leverage all the underlying features of a computing platform.
While imperative languages, such as C and C++, allow developers to write portable applications using high-level abstractions, the complexity of modern computing platforms makes it hard for compilers to derive optimized and efficient code for each platform architecture. [Note: In this chapter, as with most of the literature, we use the term “optimization” loosely as a synonym for “performance improvement” as in general (and excluding trivial computations) program optimization is an undecidable problem at compile time (or design time).] Retargetable compilers do exist, but they typically limit themselves to generate code for different CPU instruction set architectures. This limitation forces developers to manually optimize and refine their applications to support the target computation platform. This process requires considerable expertise to port the application, including applying code optimization strategies based on best practices, understanding available retargeting mechanisms, and deploying and testing the application. The resulting code, while being more efficient, often becomes difficult to maintain since it becomes polluted with artifacts used for optimization (e.g., with new API calls and data structures, code transformations). For this reason, developers often keep two versions of their source code: a functional version used to verify correctness and an optimized version which runs more efficiently on a specific platform.
This chapter is organized as follows. In Section 6.2, we provide a brief overview of common retargeting mechanisms. Section 6.3 briefly describes parallelism opportunities in CPU-based platforms and compiler options, including phase selection and ordering. Section 6.4 focuses on loop vectorization to maximize single-threaded execution. Section 6.5 covers multithreading on shared memory multicore architectures, and Section 6.6 describes how to leverage platforms with multiprocessors using distributed memory. Section 6.7 explains CPU cache optimizations. Section 6.8 presents examples of LARA strategies related to code retargeting. The remaining sections provide further reading on the topics presented in this chapter.
To illustrate the various retargeting mechanisms described in the following sections, we consider the multiplication of two matrices A (nX rows and nY columns) and B (nY rows and nZ columns) producing a matrix C (nX rows and nZ columns), generically described as follows:
This matrix product computation is a key operation in many algorithms and can be potentially time consuming for very large matrix size inputs. Although several optimized algorithms have been proposed for the general matrix multiplication operation, we focus, as a starting point, on a simple implementation in C, described as follows:
To compute each element of matrix C (see Fig. 6.1), the algorithm iterates over row i of matrix A and column j of matrix B , multiplying in a pair-wise fashion the elements from both matrices, and then adding the resulting products to obtain C ij .
FIG. 6.1 Matrix multiplication kernel.
Although this is a simple example, it provides a rich set of retargeting and optimization opportunities (clearly not exhaustively) on different platforms. Yet, its use is not meant to provide the most effective implementation for each target system. In practice, we recommend using domain-specific libraries, such as BLAS , whenever highly optimized implementations are sought.
6.2 RETARGETING MECHANISMS
Developers have at their disposal different mechanisms to optimize and fine-tune their applications for a given computing platform. Given the complexity of today’s platforms and the lack of design tools to fully automate the optimization process, there is often a trade-off between how much effort an expert developer spends optimizing their application and how portable the optimized code becomes. In this context we can identify the following retargeting mechanisms:
Compiler control options. This mechanism controls the code optimization process by configuring the platform’s toolchain options. For instance, some options select the target processor architecture (e.g., 32- or 64-bit), others select the use of a specific set of instructions (e.g., SSE, SSE2, AVX, AVX2), or to set the level of optimization (favoring, for instance, faster generated code over code size). Additional options enable language features and language standards (including libraries). In general, this mechanism requires the least amount of developer effort, but it is also the least effective in fully exploiting the capabilities of modern-day computation platforms given the inherent limitations of today’s compilers.
Code transformations. Programmers often need to manually rewrite their code to adopt practices and code styles specific to the platform’s toolchain and runtime system. This is the case, for instance, with the use of hardware compilation tools for application-specific architectures on FPGAs. These tools typically impose restrictions on the use of certain programming language constructs and styles (e.g., arrays instead of pointers), as otherwise they can generate inefficient hardware designs. There are also transformations (see Chapter 5) one can apply systematically to optimize the application on a hardware platform while guaranteeing compilation correctness with respect to the original code. Even when performing automatic code transformations by a compiler, developers might need to identify a sequence of code transformations with the correct parametrization (e.g., loop unroll factors) to derive implementations complying with given nonfunctional requirements. There are instances, however, when developers must completely change the algorithm to better adapt to a platform, for example, by replacing a recursive algorithm with an iterative stream-based version when targeting an FPGA-based accelerator.
Code annotations. Developers can annotate their programs using comments or directives (e.g., pragmas in C/C++) to support a programming model or to guide a compiler optimization process. Code annotations are not invasive and usually do not require changing the code logic. Nonetheless, this method has been successfully used by developers to parallelize their sequential applications by identifying all the concurrent regions of their code. Such approaches include directive-driven programming models as is the example of OpenMP . The code annotation mechanism has the advantage to allow programs to preserve a high degree of portability since code annotations can be easily removed or ignored if not activated or even supported.
Domain-specific libraries. Libraries are often employed to extend the compiler capabilities and generate efficient code. In contrast with code annotations, integrating a library into an application often requires considerable changes in the application source code, including introducing new data structures and replacing existing code with library calls. Hence, portability can become an issue if the library is not widely supported on different platforms. Still, libraries can provide an abstraction layer for an application domain, allowing developers to write applications using the library interface without having to provide implementation details. This way, applications can be ported with little effort to a different platform as long as an optimized library implementation is available for that platform, thus also providing some degree of performance portability. An example of a portable application domain library is ATLAS  for linear algebra, which is available for many platforms. In addition, libraries can support a programming model for a specific computation platform, allowing the generation of efficient code. The MPI library , for instance, supports platforms with distributed memory.
Domain-specific languages. Another mechanism for retargeting applications involves rewriting parts of the source code using a domain-specific language (DSL)  to support an application domain or a specific type of platform. DSLs can be embedded within the host language (e.g., C++, Scala), often leading to “cleaner” code when compared to code that results from the use of domain-specific libraries. Alternatively, DSLs can be realized using a separate language and/or programming model (e.g., OpenCL) than the one employed by the host. In this case, the DSL requires its own toolchain support. Other widely used approaches to assist code retargeting include: compiler auto-vectorization (in the context of CPU SIMD units), OpenMP (in the context of parallelization using shared memory architectures), OpenCL (in the context of CPUs, GPUs, and FPGAs), MPI (in the context of distributed memory architectures), and High-Level Synthesis (HLS) tools to translate C programs to reconfigurable hardware provided by FPGAs.
Table 6.1 summarizes these approaches and the retargeting mechanisms used by each of them: the auto-vectorization process is mainly driven by code transformations—manual or automatic—to infer CPU vector instructions; OpenMP requires code annotations and library calls to parallelize code targeting multicore/processor platforms with shared memory, and with the latest OpenMP specification the support for vectorization; MPI requires library calls to query the environment and trigger communication between nodes, and code must be restructured to support the MPI model; OpenCL requires modifying the host code to query the platform and configure the devices for execution and translating hotspots into OpenCL kernels; HLS requires the use of directives and code transformations to make the source code more amenable to FPGA acceleration by the HLS tool, and the host code must be modified to interface to the hardware accelerator.
The next installment in this series discusses parallelism and compilers.
Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017
João Manuel Paiva Cardoso , Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.
José Gabriel de Figueiredo Coutinho , Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.
Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.