Achieving better software performance through memory-oriented code optimization - Part 1
Editor’s Note: In Part 1 in a series excerpted from his book Software Engineering for Embedded Systems, Mike Brogioli of Polymathic Consulting provides tips on various software code and compiler optimization techniques that can be used to improve memory performance in embedded systems.
Optimization metrics for compiled code are not always measured in execution clock cycles on the target architecture. Consider modern cellular telephone or wireless devices, which may download executables over a wireless network connection or backhaul infrastructure. In such cases, it is often advantageous for the compiler to reduce the size of the compiled code that must be downloaded to the wireless device. By reducing the size of the code, savings are achieved in terms of bandwidth required for each wireless point of download.
Optimization metrics such as memory system performance of compiled code are often important to developers. These are metrics correlated to the dynamic run-time behavior of not only the compiled code on the target processor, but also the underlying memory system, caches, DRAM, and buses, etc.
By efficiently arranging the data within the application or, more specifically, the order in which data and corresponding data structures are accessed by the application dynamically at run-time, significant performance improvements can be gained at the memory-system level. In addition, vectorizing compilers can improve performance due to spatial locality of data when SIMD instruction sets are present and varying memory-system alignment conditions are met.
The next section describes optimization techniques that may be used to improve application code size. The first techniques presented fall under the category of compiler “flag mining”, which is the means by which different permutations of compile-time options are used to achieve the desired results on generated code. In addition, lower-level system details are presented, such as application binary interfaces and multiple encoding instruction set architectures, as vehicles to further reduce code size in the presence of resource-constrained systems.
Code size optimizations
In compiling a source code project for execution on a target architecture, it is often desirable for the resulting code size to be reduced as much as possible. Reasons for this pertain to both the amount of space in memory the code will occupy at program run-time and the potential reduction in the amount of instruction cache needed by the device. In reducing the code size of a given executable, a number of factors can be tweaked during the compilation process to accommodate this need.
Compiler flags and flag mining. Typically, users will begin by configuring the compiler to build the program for size optimization, frequently using a compiler command line option such as —Os, as is available in the GNU GCC compiler as of version 4.5. When building for code size, it is not uncommon for the compiler to disable other optimizations that frequently result in improvements in the run-time performance of the code. Examples of these might be loop optimizations such as loop unrolling, or software pipelining, which typically are performed in an attempt to increase the run-time performance of the code at the cost of increases in the compiled code size. This is due to the fact that the compiler will insert additional code into the optimized loops such as prologue and epilogue code in the case of software pipelining, or additional copies of the loop body in the case of loop unrolling.
In the event that users do not want to disable all optimization or build exclusively at optimization level —O0 with code size optimization enabled, users may also want to disable functionality such as function inlining via either a compiler command line option or compiler pragma, depending on the build tools system and functionality supported. It is often the case that at higher levels of program optimization, specifically when optimizing for program run-time performance, compilers will attempt to inline copies of a function, whereby the body of the function code is inlined into the calling procedure, rather than the calling procedure being required to make a call into a callee procedure, resulting in a change of program flow and obvious system side effects. By specifying either as a command line option or a via a customer compiler pragma, the user can prevent the tools from inadvertently inlining various functions which would result in an increase in the overall code size of the compiled application.
When a development team is building code for a production release, or in a user case scenario when debugging information is no longer needed in the executable, it may also be beneficial to strip out debugging information and symbol table information. In doing this, significant reductions in object file and executable file sizes can be achieved. Furthermore, in stripping out all label information, some level of IP protection may be afforded to the user in that consumers of the executable will have a difficult time reverse engineering the various functions being called within the program.
Target ISA for size and performance tradeoffs
Various target architectures in the embedded space may afford additional degrees of freedom when trying to reduce code size of the input application. Quite often it is advantageous for the system developer to take into consideration not only the algorithmic complexity and software architecture of their code, but also the types of arithmetic required and how well those types of arithmetic and system requirements map to the underlying target architecture. For example, an application that requires heavy use of 32-bit arithmetic may run functionally correctly on an architecture that is primarily tuned for 16-bit arithmetic; however, an architecture tuned for 32-bit arithmetic can provide a number of improvements in terms of both performance, code size, and perhaps power consumption.
Variable-length instruction encoding is one particular technology that a given target architecture may support, which can be effectively exploited by the build tools to reduce overall code size. In variable-length instruction coding schemes, certain instructions within the target processor’s ISA may have what is referred to as “premium encodings”, whereby those instructions most commonly used can be represented in a reduced binary footprint. One example of this might be a 32-bit embedded Power Architecture device, whereby frequently used instructions such as integer add are also represented with a premium 16-bit encoding. When the source application is compiled for size optimization, the build tools will attempt to map as many instructions as possible to their premium encoding counterpart in an attempt to reduce the overall footprint of the resulting executable.
Freescale Semiconductor supports this feature in the Power Architecture cores for embedded computing, as well as in their StarCore line of DSPs. Other embedded processor designs such as those by ARM Limited and Texas Instruments’ DSP have also employed variable encoding formats for premium instructions, in an effort to curb the resulting executable’s code size footprint.
In the case of Freescale’s Power Architecture, Freescale states that both standard 32-bit code and 16-bit premium-encoded code can be mixed interchangeably within the executable on a flash page size access basis. Other architectures may opt to specify the encoding within some format of prefix bits, allowing an even finer level of code intermingling.
It should be mentioned than the reduced-footprint premium encoding of instructions in a variable-length encoding architecture often comes at the cost of reduced functionality. This is due to the reduction in the number of bits that are afforded in encoding the instruction, often reduced from 32 bits to 16 bits.
An example of a non-premium encoding instruction versus a premium encoding instruction might be an integer arithmetic ADD instruction. On a non-premium-encoded variant of the instruction, the source and destination operations of the ADD instruction may be any of the 32 general-purpose integer registers within the target architecture’s register file. In the case of a premium-encoded instruction, whereby only 16 bits of encoding space are afforded, the premium-encoded ADD instruction may only be permitted to use R0—R7 as source and destination registers, in an effort to reduce the number of bits used in the source and register destination encodings. Although it may not readily be apparent to the application programmer, this can result in subtle, albeit minor, performance degradations. These are often due to additional copy instructions that may be required to move source and destination operations around to adjacent instructions in the assembly schedule because of restrictions placed on the premium-encoded variants.
As evidence of the benefits and potential drawbacks of using variable-length encoding instruction set architectures as a vehicle for code size reduction, benchmarking of typical embedded codes when targeting Power Architecture devices has shown VLE, or variable- length encoding, enabled code to be approximately 30% smaller in code footprint size than standard Power Architecture code while only exhibiting a 5% reduction in code performance. Resulting minor degradations in code performance are typical, due to limitations in functionality when using a reduced instruction encoding format of an instruction.
Floating-point arithmetic and arithmetic emulation may be another somewhat obfuscated source of code size explosion. Consider the case in which the user’s source code contains loops of intensive floating-point arithmetic when targeting an architecture lacking native floating-point functionality in hardware. In order to functionally support the floating-point arithmetic, the build tools will often need to substitute in code to perform floating-point arithmetic emulation at program run-time. This typically entails trapping to a floating-point emulation library that provides the required functionality, such as floating-point division, using the existing non-floating-point instructions natively supported on the target architecture.
As one might predict, it is not uncommon for a given floating-point emulation routine to require hundreds of target processor clock cycles to emulate the floating-point operation, which execute over tens if not hundreds of floating-point emulation instructions. In addition to the obvious performance overhead incurred versus code targeting a processor with native floating-point support in hardware, significant code size increases will occur due to the inclusion of floating-point emulation libraries or inlined floating-point emulation code. By correctly matching the types of arithmetic contained in the source application with the underlying native hardware support of the target architecture, reductions in the overall resulting executable size can be achieved with some effort.
Tuning the ABI for code size
In software engineering, the application binary interface (ABI) is the low-level software interface between a given program and the operating system, system libraries, and even inter-module communication within the program itself. The ABI itself is a specification for how a given system represents items such as data types, data sizes, alignment of data elements and structures, calling conventions and related modes of operations. In addition, a given ABI may specify the binary format of object files and program libraries. The calling convention and alignment may be areas of interest to those wishing to reduce the overall code size of their application by using a custom calling convention within their particular application.
A given target processor and related ABI will often specify a calling convention to be used between functions within the application, as well as calls to the underlying operating system, run-time libraries and so forth. It is often desirable for a vendor to specify a default calling convention that affords a reasonable level of performance for the general use case when making calls between caller and callee procedures within an application.
At the same time, such a default calling convention may also attempt to make reasonable reductions in the code size generated in both the caller and callee procedures for maintaining machine- level state consistency between both the caller and callee. Often, however, this is not ideal for an application developer who demands either tight restrictions on code size or, in other cases, high levels of compiled code performance in key system kernels of hot paths within the call graph.
Consider for example the function defined in Figure 12.1, which passes a large number of 16-bit integer values from the caller to the callee procedure:
short tap_00, tap_01, tap_02, tap_03,
tap_04, tap_05, tap_06, tap_07,
// some computation occurs, setting up taps
callee_result = callee_procedure (tap_00, tap_01,
//subsequent computation occurs based on results
long callee_procedure(short tap_00, short tap_01,
short tap_02, short tap_03,
short tap_04, short tap_05,
short tap_06, short tap_07)
// do computation
Looking at this example, it can be seen that the caller procedure computes a number of 16- bit values that must be passed as input parameters to the callee procedure. The callee procedure will then use these input values to compute some result that is passed back to the caller procedure to use in subsequent computation.
Let’s also assume that we are dealing with a somewhat trivialized ABI that is succinct for this illustrative example. The ABI assumes a 32-bit general purpose embedded architecture that has a 32-bit general purpose register file. The default calling convention for this ABI states that the first two char, short, or integer, values that are passed to a callee procedure get passed in general-purpose registers R00 and R01, with subsequent parameters being passed from the caller to the callee via the stack. This might be typical for a processor targeting a mobile embedded device that was sensitive to both performance and code size. The resulting assembly might look something like Figure 12.2.
;* NOTE: Using default ABI, R00 and R01 can be used to pass
;* parameters from caller to callee, all other parameters
;* must be passed via the stack.
;* SP_TAP_00 contains tap_00
;* SP_TAP_01 contains tap_01
;* SP_TAP_02 contains tap_02
;* SP_TAP_03 contains tap_03
;* SP_TAP_04 contains tap_04
;* SP_TAP_05 contains tap_05
;* SP_TAP_06 contains tap_06
;* SP_TAP_07 contains tap_07
;* some computation setting tap_00 .. tap_07 in local memory
;* and various bookkeeping.
;* all parameters that cannot be passed via default ABI
;* configuration must be pushed onto the stack.
LOAD R00, (SP+TAP_03);
PUSH R00 /* SP+4
LOAD R00, (SP+TAP_04);
PUSH R00 /* SP+4
LOAD R00, (SP+TAP_05);
PUSH R05 /* SP+4
LOAD R00, (SP+TAP_06);
PUSH R00 /* SP+4
LOAD R00, (SP+TAP_07);
PUSH R00 /* SP+4
;* R00 contains tap_00
;* R01 contains tap_01
;* tap_2 through tap_07 have been passed via the stack, as seen
;* previously being setup in caller_procedure via the push operations.
;* Upon entry, callee_procedure must transfer all of the input parameters
;* passed via the stack into registers for local computation. This
;* requires additional instructions both on the caller side (to put on
;* the stack) as well as the callee size (to restore from the stack).
;* NOTE: INSERT PROS AND CONS
;* ADJUST STACK POINTER TO NOW POINT TO CALLEE'S STACK FRAME
;* SO WE CAN ACESS DATA PASSED STACK IN ABI COMPLIANCE
POP R07; ;* tap_07 into R07, SP-=4
POP R06; ;* tap_06 into R06, SP-=4
POP R05; ;* tap_05 into R05, SP-=4
POP R04; ;* tap_04 into R04, SP-=4
POP R03; ;* tap_03 into R03, SP-=4
POP R02; ;* tap_02 into R02, SP-=4
;* Perform local computation on input parameters now stored
;* in registers R00-R007,storing result into SP+RESULT_OFFSET