How to Reduce Code Size (and Memory Cost) Without Sacrificing Performance
Today's intelligent compilers offer many options for squeezing more performance out of application code. Many of these optimizations, however, tend to increase overall code size.
As a result, once developers of optimized application code have reached the required performance specifications, there still remains the challenge of bringing code size back under control.
Through an iterative process of building application code using different compiler optimization options and profiling the result, developers can hone in and identify infrequently used and non-critical sections of code to trade off performance where it matters least for reduced code size, providing minimal impact on system performance. Often, varying compiler options to reduce code size can enable developers to decrease the amount of on-chip and external memory an application requires without adversely affecting performance, thereby reducing the overall bill of materials (BOM).
Optimizing for Code Size
Code size optimization is a part of the fine tuning phase of development, typically after debugging has been completed. At this point, compiler options have already been selected to optimize code so that the application meets performance specs. These options however were selected on the basis of performance only and typically are applied broadly to large sections of code. This fine-tuning stage represents an exploration to find a configuration of options that better maximizes the performance/code size trade-off better than those initially selected by the developer.
The challenge here is to shrink code size without adversely affecting performance. Compilers offer a number of optimizations, such as loop unrolling and in-lining code, that can increase performance while increasing code size, as shown in the code samples below.
The value of such optimizations is dependent upon the number of times a particular function is called. As shown, Code Sample 1 offers the smallest code size and lowest performance. Code Sample 3 offers the highest performance and largest code size. Code Sample 2 offers an alternative between these two extremes.
Consider two blocks of code, one part of a critical computational loop that represents a significant portion of CPU utilization for a particular algorithm and the other an initialization block that is only run once the first time an application is turned on. In both cases, performance can be improved by the compiler implementing loop unrolling and function in-lining at the expense of increasing code size. Doubling the performance in the critical computational loop provides a significant increase in overall performance. Doubling the performance in the initialization code, however, offers an increase of 0% in overall performance after the application has successfully powered on.
Typically, the initial optimization configuration created during the debugging phase affects the system in broad ways, such as unrolling loops that may or may not be critical to performance. Profiling code helps establish where optimizations will yield the greatest recurring gains in performance compared to the increase in code size. In the example above, optimizing initialization code to reduce code size will leave more headroom to optimize code in critical loops for performance.
It is important to realize that the initial optimization configuration is made up of less efficient optimizations than is ideally possible. For example, eliminating the performance optimization in initialization code actually frees up memory without adversely affecting performance. This additional memory can then be used to optimize other code that will yield more significant benefits. As a result, tuning for code size could actually result in a smaller program with improved performance.
Maximizing the Performance/Code Size Trade-Off
In this stage, developers begin to narrow the scope of configuration options to specific functions or portions of functions, making the trade-off of larger code for improved performance where doing so yields the most benefit. Note that fine tuning for code size does not involve program data in any way. For example, whether to place data on-chip or in external memory needed to be decided during the debug phase in order to meet performance goals. This optimization stage focuses only on code size, trading off available performance headroom for reduced code space requirements.
Code size tuning is a comparative process. For example, the relative value of optimizing code that initializes a loop is less than that compared to code within the loop but more than code that manages, say, the user interface. The more often code is called and the less memory required to implement an optimization, the greater the impact of the optimization. Put in mathematical terms, the ideal optimization configuration at a functional level is the summation of all functions based on their size, speed, and number of times they are executed:
Herein lies the challenge to developers. The number of optimization options available, as well as the range (i.e., scope) of code over which they are applied, makes it extremely difficult for developers to determine the most efficient opportunities for trading off between performance and code size. For large applications, the number of possible compiler option combinations at the functional level is tremendous. If performed manually, profiling an application can be extremely tedious and time-consuming and is not an efficient use of a developer’s valuable time. Even if recompiling an entire application with new compiler switches and then executing code to profile performance is a matter of minutes using automated tools, employing brute force to calculate all the possible scenarios simply is not feasible.
Reducing the Problem
The most significant factor in accelerating the optimization of code size is avoid extensive time-consuming profiling on a configuration-by-configuration basis. Fortunately, this optimization problem can be significantly reduced using statistical methods to locate a probable estimate. In this way, not only can developers quickly identify a configuration of compiler optimization options that meets both code size and performance requirements, the entire process can be reduced to a matter of hours compared to weeks.
This is possible because developers only need to find a combination of optimization options that meets or exceeds performance and memory specs. For example, say that a particular processor offers two memory sizes. Once code size has been reduced to the point that the smaller memory capacity is sufficient, there's no additional benefit to be gained from further reducing code size. Thus, any combination of optimization options that results in this code size or smaller while providing sufficient performance is good enough.
The fact that the developer only has to find one combination that meets performance and code size requirements – not just the best one – significantly simplifies this process by greatly expanding the set of acceptable combinations. As a result, developers can use an iterative process to locate an acceptable combination, in many cases only requiring a single iteration.
Instead of starting with a single configuration combination and making iterative changes that either increase or decrease the overall efficiency of optimizations, developers can instead use several configurations to not only profile code execution but to profile optimization efficiency as well. For example, profiling two combinations that represent optimization extremes (i.e. unroll every loop and unroll no loops) provides the minimum and maximum performance and code size for each function based on this optimization. Code size is calculated by the actual size of the function and performance is profiled by executing the application to determine how many times the function is called under typical operating conditions.
Initially, this only provides two system-level data points. However, from these two profiles it is possible to estimate the impact on performance and code size of applying or not applying the optimization to each function. As per Equations 4 and 5, below, the performance and code size of unrolling every loop would be:
The relative impact of unrolling and rolling different loops can be easily and accurately estimated by substituting Fxunrolled and Fxrolled in all of the possible combinations. Doing so will provide a performance/code size curve across this optimization option (see Figure 1, below).
|Figure 1: The performance versus code size curve shows estimated results for each of the possible optimization configurations. A developer can locate an acceptable configuration by choosing any point along the curve meeting below the minimum performance requirement and to the left of the maximum acceptable memory will be sufficient.(Source: TI)|
The curve shows performance versus code size estimates for each of the possible optimization configurations. A developer can locate an acceptable configuration by choosing a point below the minimum performance requirement and to the left of the maximum acceptable memory. Any point along the curve meeting both of these requirements will be sufficient.
The beauty of creating the performance/code size curve is that it enables developers to likewise trade increased code size for more performance. Since memory comes in quantum sizes, once code fits into the smallest memory size for the processor (or it has been determined that getting down to the next smaller size is not possible), there’s no reason to shrink code size any further.
The curve allows developers to then apply code size back to performance, yielding benefits such as operating the CPU at a lower frequency to achieve better power consumption, especially effective if dynamic power-saving mechanisms are in place.
Using three configurations representing various optimization extremes has been found to provide a sufficient representation of the effects of each optimization option to provide a performance/code size curve that maximizes the performance/code size trade-off. Developers can select a point anywhere on the curve that meets either performance or code size requirements, based on the particular limitations of the application.
The final step is to recompile the code with this particular combination to establish that the actual performance and code size requirements are met. Minor variations can arise from real-time system events such as interrupts that vary code execution order and actual execution time. Also, determining the actual number of times a function is called can be tricky; for example, a function may be in-lined in some cases and not in others. Additionally, the in-lined version of the function may be optimized differently in different functions. In general, estimates are accurate within a few percent.
The advantage of the statistical approach is that the process requires approximately as much time as it takes to recompile and profile an application for the three extreme configurations. Typically, creating the performance/code size curves takes significantly less time than running even a single profile, given the relative simplicity of these calculations compared to simulating the application.
This said, there are certain techniques that can be employed to speed up curve processing and simplify the final set of choices presented to the developer. For example, a function may not change significantly between two configuration options, either resulting in equivalent code or a one cycle or one bite difference.
In such cases, the weight of this function can be reduced to a single value, eliminating the need to iterate across the function and substantially reducing the number of calculations that need to be made. Alternatively, a developer has the choice to define critical functions, based on profiling information or an understanding of the code. In this way certain optimizations can be forced, further reducing the statistical space.
In the case that none of the points on the curve meet the performance and code size specs for the application, the developer can select additional or new combination extremes to expose a new curve and subset of combinations until the developer finds one to meet specifications. Of course, it is possible that the code size expectations are unrealistic and that the application needs to be redesigned.
It may also be the case that profiling has not accurately weighted the most critical sections of code. Consider a digital camera with a function that displays the current image on the CMOS sensor onto the display. While a user is framing a picture, this function is called quite frequently. When the photo is taken, however, the function is not called.
Profiling data would reveal that the display function is called with a high frequency. At first glance, this would seem like excellent code to optimize for speed. The reality is that the display function is called when the processor is doing little else so increased performance is not important; the processor will simply sit idle longer.
This difficulty can be resolved with exit points. Exit points are used in profiling to end a program that would otherwise run forever. Exit points, however, can also be used to further fine tune performance by limiting the subset of code that is profiled. Using several start and exit points, a developer can limit performance profiling to specific operating circumstances, such as when the system is stressed.
In the case of a digital camera, placing a start point at the moment when the picture button is pressed and an exit point at the moment the picture is captured will profile the most compute intensive functions of the processor; i.e., those times when the processor is stretched to capacity. Exit point techniques can also exclude binary library code which cannot be optimized. Limiting profiling in this way can maximize performance and code size trade offs.
Even Finer Tuning
Note that this technique has been described in granularity down to the functional level. Many functions with critical loops, however, are also comprised of initialization and exception code. If the function is optimized as a whole, the optimization of non-critical code will reduce the statistical efficiency of the function. In those cases where the developer is having trouble meeting code size and performance requirements, specifying optimizations for different sections of key functions will improve overall efficiency.
Depending upon the profiling tools available and means of specifying optimization choices it may be challenging to apply and measure the effects of specifying optimizations for different sections of a function. If so, it might make sense to make critical loop code, for example, a separate function.
Since this function is only called once in the code, it is relatively straightforward to ensure that the compiler will inline the loop back into the function without introducing any overhead or affecting performance. However, now the critical loop is a separate function itself and can be optimized appropriately apart from its accompany initialization and exit code.
By varying compiler-based optimizations, developers can trade off between performance and code size. By combining automated profiling and statistically-based estimation techniques, developers can reduce the human effort -- and error – involved in managing code size.
Resulting performance/code size curves enable developers to quickly and easily select not only a highly efficient configuration that would otherwise take weeks to discover, but to further maximize application performance by applying optimizations where they yield the most benefit. In this way, developers are sure to minimize both performance and memory requirements at the same time.
Ruby Li is System Software Engineer at Texas Instruments, Inc.