High-performance embedded computing — Comparing results

Editor's Note: Interest in embedded systems for the Internet of Things often focuses on physical size and power consumption. Yet, the need for tiny systems by no means precludes expectations for greater functionality and higher performance. At the same time, developers need to respond to growing interest in more powerful edge systems able to mind large stables of connected systems while running resource-intensive algorithms for sensor fusion, feedback control, and even machine learning. In this environment and embedded design in general, it's important for developers to understand the nature of embedded systems architectures and methods for extracting their full performance potential. In their book, Embedded Computing for High Performance, the authors offer a detailed look at the hardware and software used to meet growing performance requirements.

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN318 during checkout.

Adapted from Embedded Computing for High Performance, by João Cardoso, José Gabriel Coutinho, Pedro Diniz.

By João Cardoso, José Gabriel Coutinho, and Pedro Diniz

When comparing execution times, power dissipation, and energy consumption, it is common to compare the average (arithmetic mean) results of a number of application runs. This reduces the impact of possible measurement fluctuations as most of times measurements can be influenced by unrelated CPU activities and by the precision of the measuring techniques. Depending on the number of measurements, it might be also important to report standard deviations. When comparing speedups and throughputs, however, it is convenient to compare the geometric mean of the speedups/ throughputs achieved by the optimizations or improvements for each benchmark/ application used in the experiments, as opposed to the use of arithmetic mean for speedups which may lead to wrong conclusions. [Note: How did this get published? Pitfalls in experimental evaluation of computing systems. Talk by Jos e Nelson Amaral, Languages, Compilers, Tools and Theory for Embedded Systems (LCTES’12), June 12, 2012, Beijing.] In most cases, the measurements use real executions in the target platform and take advantage of the existence of hardware timers to measure clock cycles (and correspondent execution time) and of the existence of sensors to measure the current being supplied. Most embedded computing platforms do not include current sensors and it is sometimes needed to use a third- party board that can be attached to the power supply (e.g., by using a shunt resistor between the power supply and the device under measurement).

There are cases where performance evaluations are carried out using cycle- accurate simulators (i.e., simulators that execute at the clock cycle level and thus report accurate latencies) and power/energy models. Cycle-accurate simulations can be very time-consuming and an alternative is to use instruction-level simulators (i.e., simulators that focus on the execution of the instructions but not of the clock cycles being elapsed) and/or performance/power/energy models. Instruction-level simulators are used by most virtual platforms to simulate entire systems, including the presence of operating systems, since they provide faster simulations.

The comparison of results is conducted in many cases using metrics calculated from actual measurements. The most relevant example is the speedup which allows the comparison of performance improvements over a baseline (reference) solution. In the case of multicore and many-core architectures, in addition to the characteristics of the target platform (e.g., memories, CPUs, hardware accelerators) it is common to report the number of cores and the kind of cores used for a specific implementation, the number of threads, and the clock frequencies used.

Regarding the use of FPGAs, it is common to evaluate the number of hardware resources used and the maximum clock frequency achieved by a specific design. These metrics are reported by vendor-specific tools at more than one level of the toolchain, with the highest accuracy provided by the lower levels of the toolchain. Typically, the metrics reported depend on the target FPGA vendor or family of FPGAs. In case of Xilinx FPGAs, the metrics reported include the number of LUTs (distinguishing the ones used as registers, as logic, or as both), slices, DSPs, and BRAMs.


This chapter described the main architectures currently used for high-performance embedded computing and for general purpose computing. The descriptions include multicore and many-core integrated circuits (ICs) and hardware accelerators (e.g., GPUs and FPGAs). We presented aspects to take into account in terms of performance and discussed the impact of offloading computations to hardware accelerators and the use of the roofline model to reveal the potential for performance improvements of an application. Since power dissipation and energy consumption are of paramount importance in embedded systems, even when the primary goal is to provide high performance, we described the major contributing factors for power dissipation and energy consumption, and some techniques to reduce them.


This chapter focused on many topics for which there exists an extensive bibliography. We include in this section some references we believe are appropriate as a starting point for readers interested in learning more about some of the topics covered in this chapter.

Readers interested in the advances of reconfigurable computing and reconfigurable architectures can refer to, e.g., [1,36–38]. An introduction to FPGA architectures is provided in Ref. [2]. The overviews in Refs. [39,40] provide useful information on code transformations and compiler optimizations for reconfigurable architectures in the context of reconfigurable computing. Overlay architectures provide interesting approaches and have attracted the attention of academia as well as of industry (see, e.g., [20,41]).

A useful introduction about GPU computing is presented in Ref. [6]. In addition, there have been several research efforts comparing the performance of GPUs vs FPGAs (see, e.g., [42]).

Computation offloading has been focus of many authors, not only regarding the migration of computations to local hardware accelerators but also to servers and/or HPC platforms. Examples of computation offloading include the migration of computations from mobile devices to servers (see, e.g., [43]).

Exploiting parallelism is critical for enhancing performance. In this context, when improving the performance of an application, [Note that the Amdahl’s law can be also applied to other metrics such as power and energy consumption.] developers should consider Amdahl’s law and its extensions to the multicore era [44,45] to guide code transformations and optimizations, as well as code partitioning and mapping.

There have been many approaches proposed to reduce power and energy consumption using DVFS (see, e.g., [46]). A recent and useful survey about DVFS and DPM is presented in [47]. Readers interested in power-aware computing topics are referred to Ref. [48]. Mobile devices have typically strict power and energy consumption requirements. Thus several approaches have addressed APIs for monitoring power dissipation in mobile devices. An energy consumption model for Android- based smartphones is PowerTutor [49]. Readers interested in self-build energy model approaches can refer to Ref. [50].

While the use of simple performance models is well understood for single-core architectures, further understanding is needed to extend or revisit Amdahl’s law and its main lessons and extensions to the multicore era as addressed by recent research [44,45]. The complexity of emerging architectures with processor heterogeneity and custom processing engines will also likely lead to advances in more sophisticated models such as the roofline model [51,25] briefly described in this chapter. 


[1] Pocek K, Tessier R, DeHon A. Birth and adolescence of reconfigurable computing: a survey of the first 20 years of field-programmable custom computing machines. In: The highlights of the first twenty years of the IEEE intl. symp. on field-programmable custom computing machines, Seattle, WA, April; 2013.

[2] Kuon I, Tessier R, Rose J. FPGA architecture: survey and challenges. Found Trends Electron Des Autom 2008;2(2):135–253.

[3] Burger D, Keckler SW, McKinley KS, Dahlin M, John LK, Lin C, etal. Scaling to the end of silicon with EDGE architectures. Computer 2004;37(7):44–55.

[4] Weber R, Gothandaraman A, Hinde RJ, Peterson GD. Comparing hardware accelerators in scientific applications: a case study. IEEE Trans Parallel Distrib Syst 2011;22 (1):58–68.

[5] Ogawa Y, Iida M, Amagasaki M, Kuga M, Sueyoshi T. A reconfigurable java accelerator with software compatibility for embedded systems. SIGARCH Comput Archit News 2014;41(5):71–6.

[6] Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC. GPU computing. Proc IEEE 2008;96(5):879–99.

[7] Nickolls J, Dally WJ. The GPU computing era. IEEE Micro 2010;30(2):56–69.

[8] Mittal S, Vetter JS. A Survey of CPU-GPU heterogeneous computing techniques. ACM Comput Surv 2015;47(4). Article 69, 35 pages.

[9] Hennessy JL, Patterson DA. Computer architecture—a quantitative approach. 5th ed. San Francisco, CA, USA: Morgan Kaufmann; 2012.

[10] Oliver N, Sharma RR, Chang S, Chitlur B, Garcia E, Grecco J, et al. A reconfigurable computing system based on a cache-coherent fabric. In: Proceedings of the international conference on reconfigurable computing and FPGAs (Reconfig’11). Washington, DC: IEEE Computer Society; 2011. p. 80–5.

[11] Singh M, Leonhardi B. Introduction to the IBM Netezza warehouse appliance. In: Litoiu M, Stroulia E, MacKay S, editors. Proceedings of the 2011 conference of the center for advanced studies on collaborative research (CASCON’11). Riverton, NJ: IBM Corp; 2011. p. 385–6.

[12] Stuecheli J, Bart B, Johns CR, Siegel MS. CAPI: a coherent accelerator processor interface. IBM J Res Dev 2015;59(1)7:1–7:7.

[13] Crockett LH, Elliot RA, Enderwitz MA, Stewart RW. The Zynq book: embedded processing with the ARM cortex-A9 on the Xilinx Zynq-7000 all programmable SoC. UK: Strathclyde Academic Media; 2014.

[14] Jacobsen M, Richmond D, Hogains M, Kastner R. RIFFA 2.1: a reusable integration framework for FPGA accelerators. ACM Trans Reconfig Technol Syst 2015;8(4):22.

[15] Howes L. The OpenCL specification. Version: 2.1, document revision: 23, Khronos OpenCL Working Group. Last revision date: November 11, 2015. https://www.khronos. org/registry/cl/specs/opencl-2.1.pdf.

[16] Kaeli DR, Mistry P, Schaa D, Zhang DP. Heterogeneous computing with OpenCL 2.0. 1st ed. San Francisco, CA: Morgan Kaufmann Publishers Inc.; 2015.

[17] Kusswurm D. Modern X86 assembly language programming: 32-Bit, 64-Bit, SSE, and AVX. 1st ed. Berkely, CA: Apress; 2014.

[18] Intel Corp., Intel® 64 and IA-32 architectures optimization reference manual. Order number: 248966-033; June 2016.

[19] Langbridge JA. Professional embedded ARM development. 1st ed. Birmingham: Wrox Press Ltd.; 2014.

[20] Bakos JD. High-performance heterogeneous computing with the convey HC-1. Comput Sci Eng 2010;12(6):80–7.

[21] Xilinx Inc., Zynq-7000 all programmable SoC overview. Product specification, DS190 (v1.10); September 27, 2016.

[22] Amdahl GM. Computer architecture and Amdahl’s law. Computer 2013;46(12):38–46.

[23] Amdahl GM. Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings AFIPS spring joint computer conference; 1967. p. 483–5.

[24] Williams S, Waterman A, Patterson D. Roofline: an insightful visual performance model for multicore architectures. Commun ACM Apr. 2009;52(4):65–76.

[25] Williams S. The roofline model. In: Bailey DH, Lucas RF, Williams S, editors. Performance tuning of scientific applications. Boca Raton, FL, USA: Chapman & Hall/CRC Computational Science, CRC Press; 2010. p. 195–215.

[26] Ofenbeck G, Steinmann R, Cabezas VC, Spampinato DG, Puschel M. Applying the roofline model. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’2014), Monterey, CA, March 23–25; 2014. p. 76–85.

[27] Kopetz H. Real-time systems: design principles for distributed embedded applications. New York, NY, USA: Springer; 2011.

[28] Altmeyer S, Gebhard G. WCET analysis for preemptive scheduling. In: Proc. of the 8th intl. workshop on worst-case execution time (WCET) analysis; 2008.

[29] Hardy D, Puaut I. WCET analysis of instruction cache hierarchies. J Syst Archit 2011;57(7):677–94. ISSN: 1383–7621.

[30] Le Sueur E, Heiser G. Dynamic voltage and frequency scaling: the laws of diminishing returns. In: Proceedings of the 2010 International conference on power aware computing and systems (HotPower’10). Berkeley, CA: USENIX Association; 2010. p. 1–8.

[31] Laros III J, Pedretti K, Kelly SM, Shu W, Ferreira K, Vandyke J, et al. Energy delay product. In: Energy-efficient high performance computing. Springer briefs in computer science. London: Springer; 2013. p. 51–5.

[32] Benini L, Bogliolo A, De Micheli G. A survey of design techniques for system-level dynamic power management. In: De Micheli G, Ernst R, Wolf W, editors. Readings in hardware/software co-design. Norwell, MA: Kluwer Academic Publishers; 2001. p. 231–48.

[33] Jha NK. Low power system scheduling and synthesis. In: Proceedings of the IEEE/ACM international conference on computer-aided design (ICCAD’01). Piscataway, NJ: IEEE Press; 2001. p. 259–63.

[34] Dennard RH, Gaensslen FH, Rideout VL, Bassous E, LeBlanc AR. Design of ion- implanted MOSFET’s with very small physical dimensions. IEEE J Solid State Circuits 1974;SC–9(5):256–68.

[35] Esmaeilzadeh H, Blem E, St.Amant R, Sankaralingam K, Burger D. Dark silicon and the end of multicore scaling. SIGARCH Comput Archit News 2011;39(3):365–76.

[36] Trimberger S. Three ages of FPGAs: a retrospective on the first thirty years of FPGA technology. Proc IEEE 2015;103:318–31.

[37] Hauck S, DeHon A, editors. Reconfigurable computing: the theory and practice of FPGA-based computation. San Francisco, CA, USA: Morgan Kaufmann/Elsevier; 2008.

[38] Najjar WA, Ienne P. Reconfigurable computing. IEEE Micro 2014;34(1):4–6. special issue introduction.

[39] Cardoso JMP, Diniz PC. Compilation techniques for reconfigurable architectures. New York, NY, USA: Springer; 2008. October.

[40] Cardoso JMP, Diniz PC, Weinhardt M. Compiling for reconfigurable computing: a survey. ACM Comput Surv 2010;42(4):1–65.

[41] Rashid R, Steffan JG, Betz V. Comparing performance, productivit and scalability of the TILT overlay processor to OpenCL HLS. In: IEEE Conference on field-programmable technology (FPT’14); 2014. p. 20–7.

[42] Cope B, Cheung PYK, Luk W, Howes LW. Performance comparison of graphics processors to reconfigurable logic: a case study. IEEE Trans Comput 2010;59(4):433–48.

[43] Kun Y, Shumao O, Hsiao-Hwa C. On effective offloading services for resource- constrained mobile devices running heavier mobile internet applications. IEEE Commun Mag 2008;46(1):56–63.

[44] Hill MD, Marty MR. Amdahl’s law in the multicore era. IEEE Comput 2008;41:33–8.

[45] Cassidy AS, Andreou AG. Beyond Amdahl’s law: an objective function that links multiprocessor performance gains to delay and energy. IEEE Trans Comput 2012;61(8):1110–26.

[46] Saha S, Ravindran B. An experimental evaluation of real-time DVFS scheduling algorithms. In: Proc. of the 5th annual international systems and storage conf. (SYSTOR’12). New York, NY: ACM; 2012. Article 7, 12 pages.

[47] Bambagini M, Marinoni M, Aydin H, Buttazzo G.Energy-aware scheduling for real-time systems: a survey. ACM Trans Embed Comput Syst 2016;15(1):1–34.

[48] Melhem R, Graybill R, editors. Power aware computing. Norwell, MA, USA: Kluwer Academic/Plenum Publishers; May, 2002.

[49] Zhang L, Tiwana B, Qian Z, Wang Z, Dick RP, Morley Mao Z, et al. Accurate online power estimation and automatic battery behavior based power model generation for smartphones. In: Proc. of the 8th Int. conf. on hardware/software codesign and system synthesis (CODES+ISSS 2010), Scottsdale, AZ, October 24–28; 2010. p. 105–14.

[50] Dong M, Zhong L. Self-constructive high-rate system energy modeling for battery- powered mobile systems. In: Proceedings of the 9th international conference on Mobile systems, applications, and services (MobiSys’11). New York, NY: ACM; 2011. p. 335–48.

[51] Williams S, Waterman A, Patterson D.Roofline: an insightful visual performance model for multicore architectures. Commun ACM 2009;52(4):65–76.

The next installment in this series discusses code retargeting for CPU-based platforms.

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2017

João Manuel Paiva Cardoso , Associate Professor, Department of Informatics Engineering (DEI), Faculty of Engineering, University of Porto, Portugal. Previously I was Assistant Professor in the Department of Computer Science and Engineering, Instituto Superior Técnico (IST), Technical University of Lisbon (UTL), in Lisbon (April 4, 2006- Sept. 3, 2008), and Assistant Professor (2001-2006) in the Department of Electronics and Informatics Engineering (DEEI), Faculty of Sciences and Technology, at the University of Algarve, and Teaching Assistant in the same university (1993-2001). I have been a senior researcher at INESC-ID (Systems and Computer Engineering Institute) in Lisbon. I was member of INESC-ID from 1994 to 2009.

José Gabriel de Figueiredo Coutinho , Research Associate, Imperial College. He is involved in the EU FP7 HARNESS project to intergrate heterogeneous hardware and network technologies into data centre platforms, to vastly increase performance, reduce energy consumption, and lower cost profiles for important and high-value cloud applications such as real-time business analytics and the geosciences. His research interests include database functionality on heterogeneous systems, cloud computing resource management, and performance-driven mapping strategies.

Pedro C. Diniz received his M.Sc. in Electrical and Computer Engineering from the Technical University in Lisbon, Portugal and his Ph.D. from the University of California, Santa Barbara in Computer Science in 1997. Since 1997 he has been a researcher with the University of Southern California’s Information Sciences Institute (USC/ISI) and an Assistant Professor of Computer Science at the University of Southern California in Los Angeles, California. He has lead and participated in many research projects funded by the U.S. government and the European Union (UE) and has authored or co-authored many internationally recognized scientific journal papers and over 100 international conference papers. Over the years he has been heavily involved in the scientific community in the area of high-performance computing, reconfigurable and field-programmable computing.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.