High-performance embedded computing -- Code retargeting mechanisms
Editor's Note: With the emergence of heterogeneous multicore processors for embedded systems, developers can take advantage of powerful platforms for executing complex algorithms moving from the cloud to IoT edge devices. To take full advantage of these architectures, however, developers need to understand the nature of code retargeting mechanisms to optimize resource utilization and algorithm execution.
This series on code retargeting complements a previous set of articles that explored advanced architectures for embedded computing -- both series excerpted from the book, Embedded Computing for High Performance. In this series, the authors discuss details of code retargeting, beginning with this installment on code retargeting mechanisms.
Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.
Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.
By João Cardoso, José Gabriel Coutinho, and Pedro Diniz
In this chapter, we focus exclusively on code retargeting issues in the context of CPU-based architectures and where the application source code is optimized to execute more efficiently on a given platform. In Chapter 7, we extend the topic of code retargeting to cover heterogeneous platforms, such as GPUs and FPGAs.
Here, we assume as input the application’s source code, written in a high-level language such as C/C++, possibly even translated from higher levels of abstraction as described in Chapter 3. This source code captures the algorithmic behavior of the application while being independent of the target platform. As a first step, this code version is compiled and executed on a CPU platform, allowing developers to test the correctness of the application with respect to its specification.
However, as computer architectures are becoming increasingly complex, with more processing cores, more heterogeneous and more distributed, applications compiled for existing machines will not fully utilize the target platform resources, even when binary compatibility is guaranteed, and thus do not run as efficiently as they could. In particular, modern-day computing platforms (see Chapter 2 for a brief overview) support many independent computational units to allow parallel computations, contain specialized architectural features to improve concurrency and computational efficiency, support hierarchical memories to increase data locality and reduce memory latency, and include fast interconnects and buses to reduce data movement overhead. Hence, to maximize performance and efficiency it is imperative to leverage all the underlying features of a computing platform.
While imperative languages, such as C and C++, allow developers to write portable applications using high-level abstractions, the complexity of modern computing platforms makes it hard for compilers to derive optimized and efficient code for each platform architecture. [Note: In this chapter, as with most of the literature, we use the term “optimization” loosely as a synonym for “performance improvement” as in general (and excluding trivial computations) program optimization is an undecidable problem at compile time (or design time).] Retargetable compilers do exist, but they typically limit themselves to generate code for different CPU instruction set architectures. This limitation forces developers to manually optimize and refine their applications to support the target computation platform. This process requires considerable expertise to port the application, including applying code optimization strategies based on best practices, understanding available retargeting mechanisms, and deploying and testing the application. The resulting code, while being more efficient, often becomes difficult to maintain since it becomes polluted with artifacts used for optimization (e.g., with new API calls and data structures, code transformations). For this reason, developers often keep two versions of their source code: a functional version used to verify correctness and an optimized version which runs more efficiently on a specific platform.
This chapter is organized as follows. In Section 6.2, we provide a brief overview of common retargeting mechanisms. Section 6.3 briefly describes parallelism opportunities in CPU-based platforms and compiler options, including phase selection and ordering. Section 6.4 focuses on loop vectorization to maximize single-threaded execution. Section 6.5 covers multithreading on shared memory multicore architectures, and Section 6.6 describes how to leverage platforms with multiprocessors using distributed memory. Section 6.7 explains CPU cache optimizations. Section 6.8 presents examples of LARA strategies related to code retargeting. The remaining sections provide further reading on the topics presented in this chapter.
To illustrate the various retargeting mechanisms described in the following sections, we consider the multiplication of two matrices A (nX rows and nY columns) and B (nY rows and nZ columns) producing a matrix C (nX rows and nZ columns), generically described as follows:
This matrix product computation is a key operation in many algorithms and can be potentially time consuming for very large matrix size inputs. Although several optimized algorithms have been proposed for the general matrix multiplication operation, we focus, as a starting point, on a simple implementation in C, described as follows:
To compute each element of matrix C (see Fig. 6.1), the algorithm iterates over row i of matrix A and column j of matrix B, multiplying in a pair-wise fashion the elements from both matrices, and then adding the resulting products to obtain Cij.
FIG. 6.1 Matrix multiplication kernel.
Although this is a simple example, it provides a rich set of retargeting and optimization opportunities (clearly not exhaustively) on different platforms. Yet, its use is not meant to provide the most effective implementation for each target system. In practice, we recommend using domain-specific libraries, such as BLAS , whenever highly optimized implementations are sought.