High-performance embedded computing -- Comparing results

João Cardoso, José Gabriel Coutinho, and Pedro Diniz

March 12, 2018

João Cardoso, José Gabriel Coutinho, and Pedro DinizMarch 12, 2018

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.


By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

When comparing execution times, power dissipation, and energy consumption, it is common to compare the average (arithmetic mean) results of a number of application runs. This reduces the impact of possible measurement fluctuations as most of times measurements can be influenced by unrelated CPU activities and by the precision of the measuring techniques. Depending on the number of measurements, it might be also important to report standard deviations. When comparing speedups and throughputs, however, it is convenient to compare the geometric mean of the speedups/ throughputs achieved by the optimizations or improvements for each benchmark/ application used in the experiments, as opposed to the use of arithmetic mean for speedups which may lead to wrong conclusions. [Note: How did this get published? Pitfalls in experimental evaluation of computing systems. Talk by Jos e Nelson Amaral, Languages, Compilers, Tools and Theory for Embedded Systems (LCTES’12), June 12, 2012, Beijing.] In most cases, the measurements use real executions in the target platform and take advantage of the existence of hardware timers to measure clock cycles (and correspondent execution time) and of the existence of sensors to measure the current being supplied. Most embedded computing platforms do not include current sensors and it is sometimes needed to use a third- party board that can be attached to the power supply (e.g., by using a shunt resistor between the power supply and the device under measurement).

There are cases where performance evaluations are carried out using cycle- accurate simulators (i.e., simulators that execute at the clock cycle level and thus report accurate latencies) and power/energy models. Cycle-accurate simulations can be very time-consuming and an alternative is to use instruction-level simulators (i.e., simulators that focus on the execution of the instructions but not of the clock cycles being elapsed) and/or performance/power/energy models. Instruction-level simulators are used by most virtual platforms to simulate entire systems, including the presence of operating systems, since they provide faster simulations.

The comparison of results is conducted in many cases using metrics calculated from actual measurements. The most relevant example is the speedup which allows the comparison of performance improvements over a baseline (reference) solution. In the case of multicore and many-core architectures, in addition to the characteristics of the target platform (e.g., memories, CPUs, hardware accelerators) it is common to report the number of cores and the kind of cores used for a specific implementation, the number of threads, and the clock frequencies used.

Regarding the use of FPGAs, it is common to evaluate the number of hardware resources used and the maximum clock frequency achieved by a specific design. These metrics are reported by vendor-specific tools at more than one level of the toolchain, with the highest accuracy provided by the lower levels of the toolchain. Typically, the metrics reported depend on the target FPGA vendor or family of FPGAs. In case of Xilinx FPGAs, the metrics reported include the number of LUTs (distinguishing the ones used as registers, as logic, or as both), slices, DSPs, and BRAMs.


This chapter described the main architectures currently used for high-performance embedded computing and for general purpose computing. The descriptions include multicore and many-core integrated circuits (ICs) and hardware accelerators (e.g., GPUs and FPGAs). We presented aspects to take into account in terms of performance and discussed the impact of offloading computations to hardware accelerators and the use of the roofline model to reveal the potential for performance improvements of an application. Since power dissipation and energy consumption are of paramount importance in embedded systems, even when the primary goal is to provide high performance, we described the major contributing factors for power dissipation and energy consumption, and some techniques to reduce them.


This chapter focused on many topics for which there exists an extensive bibliography. We include in this section some references we believe are appropriate as a starting point for readers interested in learning more about some of the topics covered in this chapter.

Continue reading on page two >>


< Previous
Page 1 of 2
Next >

Loading comments...