High-performance embedded computing -- Performance
Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.
Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.
Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.
By João Cardoso, José Gabriel Coutinho, and Pedro Diniz
In the embedded systems domain, an important challenge for developers is to meet nonfunctional requirements, such as execution time, memory capacity, and energy consumption. In this context, a developer must consider and then evaluate different solutions to optimize the performance of a system. As part of this evaluation, it is important for developers to identify the most suitable performance metrics to guide this process.
The performance of an application is defined as the arithmetic inverse of the application’s execution time. Other common metrics involve the identification of the number of clock cycles used to execute a function, a specific code section, or the entire application. Additionally, metrics can be driven by the application’s non-functional requirements, for instance: task latency measures the number of clock cycles required to perform a specific task, and task throughput measures the number of tasks completed per time unit. Common measures of throughput also include packets or frames per second and samples per clock cycle.
Scalability is also often an important design consideration as it highlights how the application performance changes with different dataset sizes and/or when using additional cores/CPUs/hardware accelerators. Scalability drives common resource usage analysis such as the impact of the number of threads on the application execution time and energy consumption, and is key to understanding the required level of parallelization and hardware resources to meet specific performance goals.
In terms of raw performance metrics, the execution time of an application or task, designated as Texec, is computed as the number of clock cycles the hardware (e.g., the CPU) takes to complete it, multiplied by the period of the operating clock (or divided by the clock frequency) as presented in Eq. (2.1). In most cases, when dealing with CPUs, the number of clock cycles measured is not the number of CPU clock cycles elapsed, but the number of cycles reported by a hardware timer, which executes at a much lower clock frequency. In this case, Eq. (2.1) is still valid but the period (T) or the clock frequency (f ) must reflect the hardware timer used.
When considering offloading computations to a hardware accelerator, the execution time is affected by the three main factors in Eq. (2.2), namely the execution time of the section of the application running on the CPU (TCPU), the execution time associated to the data communication (TComm) between the CPU and the hardware accelerator (HA), and the execution time of the section of the application running on the HA (THA). The execution time model reflected by Eq. (2.2) considers no overlap in terms of the execution of the CPU of the hardware accelerator and data communication, i.e., the CPU is stalled while the hardware accelerator is executing. Still, it is possible to consider a full/partial overlap between data communications and HA execution and/or between CPU execution and HA execution in the model reflected by reducing the values of the terms TComm and THA in Eq. (2.2) to account for the observed overlap.
When considering hardware accelerators, Eq. (2.1) measures TCPU, which is the time the application spends on CPU execution (including multiple CPUs and/or many-cores). The number of elapsed clock cycles is measured from the beginning of the execution until the end, which is a reliable measurement as this is the component where the execution starts and finishes.
Another useful metric in many contexts is the speedup of the execution which quantifies the performance improvement of an optimized version over a baseline implementation. Eq. (2.3) presents how the speedup can be calculated with an optimized version (Perfoptimized) and the baseline performance (Perfbaseline), i.e., the performance achieved by the application and/or system we want to improve.
When focusing on performance improvements, it is common to first identify the most critical sections (hotspots) of the code and then attempt to optimize these sections. The identification of these critical sections is usually based on the use of profiling tools (e.g., GNU gprof ) and/or of code analysis (see Chapter 4), as well as performance models. A well-known rule of thumb (known as the 90/10 locality rule ) states that “90% of the execution time of an application is spent in 10% of the code.” In other words, the critical sections of the applications are in approximately 10% of the code, most often in the form of loops.