High-performance embedded computing — Performance

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.

By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

In the embedded systems domain, an important challenge for developers is to meet nonfunctional requirements, such as execution time, memory capacity, and energy consumption. In this context, a developer must consider and then evaluate different solutions to optimize the performance of a system. As part of this evaluation, it is important for developers to identify the most suitable performance metrics to guide this process.

The performance of an application is defined as the arithmetic inverse of the application’s execution time. Other common metrics involve the identification of the number of clock cycles used to execute a function, a specific code section, or the entire application. Additionally, metrics can be driven by the application’s non-functional requirements, for instance: task latency measures the number of clock cycles required to perform a specific task, and task throughput measures the number of tasks completed per time unit. Common measures of throughput also include packets or frames per second and samples per clock cycle.

Scalability is also often an important design consideration as it highlights how the application performance changes with different dataset sizes and/or when using additional cores/CPUs/hardware accelerators. Scalability drives common resource usage analysis such as the impact of the number of threads on the application execution time and energy consumption, and is key to understanding the required level of parallelization and hardware resources to meet specific performance goals.

In terms of raw performance metrics, the execution time of an application or task, designated as Texec, is computed as the number of clock cycles the hardware (e.g., the CPU) takes to complete it, multiplied by the period of the operating clock (or divided by the clock frequency) as presented in Eq. (2.1). In most cases, when dealing with CPUs, the number of clock cycles measured is not the number of CPU clock cycles elapsed, but the number of cycles reported by a hardware timer, which executes at a much lower clock frequency. In this case, Eq. (2.1) is still valid but the period (T) or the clock frequency (f ) must reflect the hardware timer used.

When considering offloading computations to a hardware accelerator, the execution time is affected by the three main factors in Eq. (2.2), namely the execution time of the section of the application running on the CPU (TCPU), the execution time associated to the data communication (TComm) between the CPU and the hardware accelerator (HA), and the execution time of the section of the application running on the HA (THA). The execution time model reflected by Eq. (2.2) considers no overlap in terms of the execution of the CPU of the hardware accelerator and data communication, i.e., the CPU is stalled while the hardware accelerator is executing. Still, it is possible to consider a full/partial overlap between data communications and HA execution and/or between CPU execution and HA execution in the model reflected by reducing the values of the terms TComm and THA in Eq. (2.2) to account for the observed overlap.

When considering hardware accelerators, Eq. (2.1) measures TCPU, which is the time the application spends on CPU execution (including multiple CPUs and/or many-cores). The number of elapsed clock cycles is measured from the beginning of the execution until the end, which is a reliable measurement as this is the component where the execution starts and finishes.

Another useful metric in many contexts is the speedup of the execution which quantifies the performance improvement of an optimized version over a baseline implementation. Eq. (2.3) presents how the speedup can be calculated with an optimized version (Perfoptimized) and the baseline performance (Perfbaseline), i.e., the performance achieved by the application and/or system we want to improve.

When focusing on performance improvements, it is common to first identify the most critical sections (hotspots) of the code and then attempt to optimize these sections. The identification of these critical sections is usually based on the use of profiling tools (e.g., GNU gprof ) and/or of code analysis (see Chapter 4), as well as performance models. A well-known rule of thumb (known as the 90/10 locality rule [9]) states that “90% of the execution time of an application is spent in 10% of the code.” In other words, the critical sections of the applications are in approximately 10% of the code, most often in the form of loops.


Amdahl’s law [22,23] states that the performance improvement of a program is limited by the sections that must be executed sequentially, and thus can be used to estimate the potential for speeding up applications using parallelization and/or hardware acceleration. In short, Amdahl’s law states that the speedup achieved by accelerating portions of an application is limited by the code sections that are not accelerated. More formally, the achievable performance speedup is limited by the fraction of the sequential (not improved) execution time of the application (1-f) and by the actual speedup (s) one can achieve for the fraction remainder fraction f of the application’s execution. This application speedup can thus be expressed by Eq. (2.4).

This relation thus indicates that one must improve significant fractions of the application’s execution f that represent a significant fraction of the global execution time if we are to attain substantial overall performance improvements. Also, this relation reveals that if a computation has a substantial fraction f that cannot be (or is not) improved, that fraction f limits the achievable speedup of the overall application. As the performance speedup s of the fraction f increases, possibly using additional resources [Note: Under the assumption that there are no hardware resources dependences or data dependences precluding the effective use of these resources for concurrent execution.] and/or by the use of advanced compiler transformations, the overall application speedup will asymptotically tend to the values given by Eq. (2.5). 

Fig. 2.18 shows the speedup limit when considering different fractions f being accelerated and considering s as infinite. As can be observed, to attain a speedup of 2, the fraction f of the original application code whose execution time needs to be reduced to a negligible amount is 50%. To attain a speedup of 10, the fraction f needs to be 90%. So, higher performance improvements can only be achieved if an increasingly large section of the code has their execution time substantially reduced. In other words, the maximum achievable performance is thus limited by the intrinsically sequential sections of the applications’ code that cannot or are simply not subject to optimizations.

click for larger image

FIG. 2.18 Speedup limit with respect to the fraction of the execution time of the application being accelerated (considering the theoretical case of reducing to 0 s of execution time those fractions, e.g., by using infinite hardware resources).

This speedup analysis, while theoretical, is in practice very important as it can provide insights into the speedup limits, thus providing an early cost-benefit analysis for programmer effort and/or additional hardware resource use.

As an example of the usefulness of such a performance model, even if simplified, we consider the Libmad library for MP3 decoding and its theoretical limits for possible acceleration. Fig. 2.19 shows the profiling results when running this program in a typical desktop PC. Using the idealized performance model with no communication cost, we would expect a maximum speedup of 1.11 when considering the hardware acceleration of dct32, a function very amenable to hardware implementation. To achieve a 10 speedup, we must consider accelerating at least the top 9 functions revealed by the profiling results, which represent over 90% of the overall program’s execution time. A plot of the maximum attainable speedup is depicted in Fig. 2.20. Here, the speedup is computed by the cumulative optimization of the first most time-consuming functions in the Libmad library. This plot captures the results of three experiments. The first experiment reveals an idealized speedup assuming no communication cost with a hardware accelerator nor any other source of overhead. The second experiment assumes the communication cost is, for each function offloaded to the hardware accelerator, 1% of its execution time (assuming an hypothetical case where the cost of data communication is proportional to the computing time). The last experiment assumes this communication overhead climbs to 5%. The exponential effect shown in Fig. 2.18 is very noticeable in the first experiment as presented in Fig. 2.20. The ideal acceleration of the last three less time-consuming functions presented (i.e., “III_freqinver,” “III_reorder,” and “III_exponents”), all with negligible execution time when considering the overall execution time of the program (see Fig. 2.19), allows to increase the speedup from 52.29 to 70.53 , 97.84 , and 126.38 , respectively.

click for larger image

FIG. 2.19 Profiling results for the Libmad library for MP3 decoding considering the 20 most time-consuming functions.

click for larger image

FIG. 2.20 Maximum attainable speedup for various communication cost scenarios when cumulatively optimizing each source code function of the libmad library code (note that this is simply indicative and assumes that all the selected functions can be offloaded to hardware accelerators and their implementation in those accelerators have negligible execution time).

Clearly, this idealized scenario is overly optimistic and in practice infeasible. Also, it can be immediately apparent that the communication and synchronization costs need to be kept to a minimum as they quickly erode any performance gains due to acceleration of even a substantial fraction of the application’s execution time. Yet, even a simple model can help developers choose a feasible section of the application’s source code to improve on.


The roofline model [24,25] is an increasingly popular method for capturing the compute-memory ratio of a computation and hence quickly identify if the computation is compute or memory bound. The roofline model is often represented as a two-dimensional plot with Performance (Flops/Cycle or Flops/s) in the Y-axis and operational (also known as arithmetic or numeric) intensity (Flops/Byte) in the X-axis, as depicted in Fig. 2.21.

click for larger image

FIG. 2.21 Illustrative roofline model plot with several architectural bandwidth bounds and instruction optimization bounds.

This plot delimits various performance regions based on the architectural features of the computing system at hand where computational kernels can be placed depending on the features used and compiler transformations exercised. Clearly, the region labeled 0 is infeasible as it implies bandwidth values beyond what the computing system can provide. The line labeled “peak stream bandwidth” thus identifies the bandwidth upper-bound for any given code design. The plot also contains other “upper-bound” lines corresponding to the use of architectural features or elements. For instance, the line labeled “without software prefetching” indicates that codes not taking advantage of prefetching cannot move beyond this performance bound line. Similarly, other horizontal lines bound the performance of a code design if these do not use SIMD instructions or simply do not take advantage of instruction-level parallelism (ILP). Ideally, developers would want to have their code reach as high as possible, and break some of these ceilings, by using compiler transformations and code restructuring to exhibit the highest possible performance rate.

In short, the goal of the roofline model is to set expectations for performance for specific architectures and specific code to guide the developers to tune their code and take advantage of the target computing systems. This model can also be used to expose performance bottlenecks and to guide code transformations. For codes that are inherently compute bound, i.e., they have a high operational intensity, they can be “pulled” from region ➀ to region ➂ with the use of source and compiler optimizations. However, if the code is inherently memory bound (lower operational intensity), the most effective performance improvement may be to engage in transformations that reduce or hide the latency of the memory operations by pre-fetching or exploiting application-level data reuse (in effect eliminating memory accesses). As an example, we depict in Fig. 2.22 [26] the roofline plots for three representative Intel MKL (Math Kernel Library) kernels (“daxpy,” “dgemv,” “dgemm”) and an FFT kernel for various input dataset sizes and considering sequential and parallel versions of the code executing in an i7 Core Sandy Bridge E system.

click for larger image

FIG. 2.22 Examples of roofline plots for Intel MKL BLAS 1–3 and FFT kernels on an i7-3930 K Core Sandy Bridge E system: sequential code (A) and parallel code (B). [From Ofenbeck G, Steinmann R, Cabezas VC, Spampinato DG, Pu€schel M. Applying the roofline model, In: IEEE international symposium on performance analysis of systems and software (ISPASS’2014), Monterey, CA, March 23–25, 2014. p. 76–85. ]

The peak performance and memory bandwidth depicted in the roofline plots are estimated using specific microbenchmarks on single and multicore platforms and exercising different memory access patterns. The use of multicores in Fig. 2.22B from a single-core execution in Fig. 2.22A has two major effects. First, it increases the maximum performance ceiling from 8 to 48 Flops/Cycle, as expected due to the use of multiple independent cores. Second, it increases the peak bandwidth from 6.2 to 10.31 Bytes/Cycle, likely due to optimized memory management techniques used in NUMA architectures where memory is allocated “near” the core where the thread is executed (e.g., first-touch policy). Kernels daxpy and dgemv are memory bound (low operational intensity) in both sequential and parallel cases, while dgemm is compute bound reaching close to the peak performance in both cases. The FFT kernel, on the other hand, went from compute bound in the sequential version to memory bound in the parallel version, which means that any optimization applied to the FFT kernel aiming to improve its performance will first hit the bandwidth ceiling.


In other contexts, it is required that specific sections of the code or tasks execute within prescribed time bounds. As such, one must model its execution time. This is the case with many embedded systems such as real-time and safety-critical systems, which require that the execution of specific tasks be time bounded so as to ensure the correct behavior per system requirements (see, e.g., [27], Chapter 10). As an example, in an auto engine, specific operation controls must occur within very stringent timing constraints. Similarly, in a nuclear reactor software, safety-related control actions have to take place at precise time once alarm conditions are triggered. In these contexts, developers must guarantee the worst-case execution time (WCET) for specific tasks to be below a specific time bound so that, and it is common practice, they can execute within the schedule time slot allocated to them by the underlying real-time operating system (RTOS).

Embedded software developers have resorted to a combination of static analysis (instruction counting and execution path coverage analysis) and restriction on the sharing of resources with other tasks, or even preemption, so that feasible execution time upper bounds can be derived (see e.g., [28]).

The existence of architectural elements, such as cache memories and branch prediction with nondeterministic timing behavior, renders the estimation of WCET, even for input bounded code, very complex (see e.g., [29]). Because of these limitations and in instances where satisfying strict timing deadlines are of paramount importance, e.g., in safety applications, software developers opt to use hardware components which do not exhibit nondeterministic behavior often at the expense of lower expected average performance.

The next installment in this series discusses power and energy consumption concerns.   

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017

João Manuel Paiva Cardoso , Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.

José Gabriel de Figueiredo Coutinho , Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.

Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.