High-performance embedded computing -- Performance

João Cardoso, José Gabriel Coutinho, and Pedro Diniz

January 23, 2018

João Cardoso, José Gabriel Coutinho, and Pedro DinizJanuary 23, 2018


Amdahl’s law [22,23] states that the performance improvement of a program is limited by the sections that must be executed sequentially, and thus can be used to estimate the potential for speeding up applications using parallelization and/or hardware acceleration. In short, Amdahl’s law states that the speedup achieved by accelerating portions of an application is limited by the code sections that are not accelerated. More formally, the achievable performance speedup is limited by the fraction of the sequential (not improved) execution time of the application (1-f) and by the actual speedup (s) one can achieve for the fraction remainder fraction f of the application’s execution. This application speedup can thus be expressed by Eq. (2.4).

This relation thus indicates that one must improve significant fractions of the application’s execution f that represent a significant fraction of the global execution time if we are to attain substantial overall performance improvements. Also, this relation reveals that if a computation has a substantial fraction f that cannot be (or is not) improved, that fraction f limits the achievable speedup of the overall application. As the performance speedup s of the fraction f increases, possibly using additional resources [Note: Under the assumption that there are no hardware resources dependences or data dependences precluding the effective use of these resources for concurrent execution.] and/or by the use of advanced compiler transformations, the overall application speedup will asymptotically tend to the values given by Eq. (2.5). 

Fig. 2.18 shows the speedup limit when considering different fractions f being accelerated and considering s as infinite. As can be observed, to attain a speedup of 2, the fraction f of the original application code whose execution time needs to be reduced to a negligible amount is 50%. To attain a speedup of 10, the fraction f needs to be 90%. So, higher performance improvements can only be achieved if an increasingly large section of the code has their execution time substantially reduced. In other words, the maximum achievable performance is thus limited by the intrinsically sequential sections of the applications’ code that cannot or are simply not subject to optimizations.

click for larger image

FIG. 2.18 Speedup limit with respect to the fraction of the execution time of the application being accelerated (considering the theoretical case of reducing to 0 s of execution time those fractions, e.g., by using infinite hardware resources).

This speedup analysis, while theoretical, is in practice very important as it can provide insights into the speedup limits, thus providing an early cost-benefit analysis for programmer effort and/or additional hardware resource use.

As an example of the usefulness of such a performance model, even if simplified, we consider the Libmad library for MP3 decoding and its theoretical limits for possible acceleration. Fig. 2.19 shows the profiling results when running this program in a typical desktop PC. Using the idealized performance model with no communication cost, we would expect a maximum speedup of 1.11 when considering the hardware acceleration of dct32, a function very amenable to hardware implementation. To achieve a 10 speedup, we must consider accelerating at least the top 9 functions revealed by the profiling results, which represent over 90% of the overall program’s execution time. A plot of the maximum attainable speedup is depicted in Fig. 2.20. Here, the speedup is computed by the cumulative optimization of the first most time-consuming functions in the Libmad library. This plot captures the results of three experiments. The first experiment reveals an idealized speedup assuming no communication cost with a hardware accelerator nor any other source of overhead. The second experiment assumes the communication cost is, for each function offloaded to the hardware accelerator, 1% of its execution time (assuming an hypothetical case where the cost of data communication is proportional to the computing time). The last experiment assumes this communication overhead climbs to 5%. The exponential effect shown in Fig. 2.18 is very noticeable in the first experiment as presented in Fig. 2.20. The ideal acceleration of the last three less time-consuming functions presented (i.e., “III_freqinver,” “III_reorder,” and “III_exponents”), all with negligible execution time when considering the overall execution time of the program (see Fig. 2.19), allows to increase the speedup from 52.29 to 70.53 , 97.84 , and 126.38 , respectively.

click for larger image

FIG. 2.19 Profiling results for the Libmad library for MP3 decoding considering the 20 most time-consuming functions.

click for larger image

FIG. 2.20 Maximum attainable speedup for various communication cost scenarios when cumulatively optimizing each source code function of the libmad library code (note that this is simply indicative and assumes that all the selected functions can be offloaded to hardware accelerators and their implementation in those accelerators have negligible execution time).

Clearly, this idealized scenario is overly optimistic and in practice infeasible. Also, it can be immediately apparent that the communication and synchronization costs need to be kept to a minimum as they quickly erode any performance gains due to acceleration of even a substantial fraction of the application’s execution time. Yet, even a simple model can help developers choose a feasible section of the application’s source code to improve on.

< Previous
Page 2 of 3
Next >

Loading comments...