Achieving better software performance through memory-oriented code optimization – Part 1

Editor’s Note: In Part 1 in a series excerpted from his book Software Engineering for Embedded Systems , Mike Brogioli of Polymathic Consulting provides tips on various software code and compiler optimization techniques that can be used to improve memory performance in embedded systems.

Optimization metrics for compiled code are not always measured in execution clock cycles on the target architecture. Consider modern cellular telephone or wireless devices, which may download executables over a wireless network connection or backhaul infrastructure. In such cases, it is often advantageous for the compiler to reduce the size of the compiled code that must be downloaded to the wireless device. By reducing the size of the code, savings are achieved in terms of bandwidth required for each wireless point of download.

Optimization metrics such as memory system performance of compiled code are often important to developers. These are metrics correlated to the dynamic run-time behavior of not only the compiled code on the target processor, but also the underlying memory system, caches, DRAM, and buses, etc.

By efficiently arranging the data within the application or, more specifically, the order in which data and corresponding data structures are accessed by the application dynamically at run-time, significant performance improvements can be gained at the memory-system level. In addition, vectorizing compilers can improve performance due to spatial locality of data when SIMD instruction sets are present and varying memory-system alignment conditions are met.

The next section describes optimization techniques that may be used to improve application code size. The first techniques presented fall under the category of compiler “flag mining”, which is the means by which different permutations of compile-time options are used to achieve the desired results on generated code. In addition, lower-level system details are presented, such as application binary interfaces and multiple encoding instruction set architectures, as vehicles to further reduce code size in the presence of resource-constrained systems.

Code size optimizations
In compiling a source code project for execution on a target architecture, it is often desirable for the resulting code size to be reduced as much as possible. Reasons for this pertain to both the amount of space in memory the code will occupy at program run-time and the potential reduction in the amount of instruction cache needed by the device. In reducing the code size of a given executable, a number of factors can be tweaked during the compilation process to accommodate this need.

Compiler flags and flag mining. Typically, users will begin by configuring the compiler to build the program for size optimization, frequently using a compiler command line option such as —Os, as is available in the GNU GCC compiler as of version 4.5. When building for code size, it is not uncommon for the compiler to disable other optimizations that frequently result in improvements in the run-time performance of the code. Examples of these might be loop optimizations such as loop unrolling, or software pipelining, which typically are performed in an attempt to increase the run-time performance of the code at the cost of increases in the compiled code size. This is due to the fact that the compiler will insert additional code into the optimized loops such as prologue and epilogue code in the case of software pipelining, or additional copies of the loop body in the case of loop unrolling.

In the event that users do not want to disable all optimization or build exclusively at optimization level —O0 with code size optimization enabled, users may also want to disable functionality such as function inlining via either a compiler command line option or compiler pragma, depending on the build tools system and functionality supported. It is often the case that at higher levels of program optimization, specifically when optimizing for program run-time performance, compilers will attempt to inline copies of a function, whereby the body of the function code is inlined into the calling procedure, rather than the calling procedure being required to make a call into a callee procedure, resulting in a change of program flow and obvious system side effects. By specifying either as a command line option or a via a customer compiler pragma, the user can prevent the tools from inadvertently inlining various functions which would result in an increase in the overall code size of the compiled application.

When a development team is building code for a production release, or in a user case scenario when debugging information is no longer needed in the executable, it may also be beneficial to strip out debugging information and symbol table information. In doing this, significant reductions in object file and executable file sizes can be achieved. Furthermore, in stripping out all label information, some level of IP protection may be afforded to the user in that consumers of the executable will have a difficult time reverse engineering the various functions being called within the program.

Target ISA for size and performance tradeoffs
Various target architectures in the embedded space may afford additional degrees of freedom when trying to reduce code size of the input application. Quite often it is advantageous for the system developer to take into consideration not only the algorithmic complexity and software architecture of their code, but also the types of arithmetic required and how well those types of arithmetic and system requirements map to the underlying target architecture. For example, an application that requires heavy use of 32-bit arithmetic may run functionally correctly on an architecture that is primarily tuned for 16-bit arithmetic; however, an architecture tuned for 32-bit arithmetic can provide a number of improvements in terms of both performance, code size, and perhaps power consumption.

Variable-length instruction encoding is one particular technology that a given target architecture may support, which can be effectively exploited by the build tools to reduce overall code size. In variable-length instruction coding schemes, certain instructions within the target processor’s ISA may have what is referred to as “premium encodings”, whereby those instructions most commonly used can be represented in a reduced binary footprint. One example of this might be a 32-bit embedded Power Architecture device, whereby frequently used instructions such as integer add are also represented with a premium 16-bit encoding. When the source application is compiled for size optimization, the build tools will attempt to map as many instructions as possible to their premium encoding counterpart in an attempt to reduce the overall footprint of the resulting executable.

Freescale Semiconductor supports this feature in the Power Architecture cores for embedded computing, as well as in their StarCore line of DSPs. Other embedded processor designs such as those by ARM Limited and Texas Instruments’ DSP have also employed variable encoding formats for premium instructions, in an effort to curb the resulting executable’s code size footprint.

In the case of Freescale’s Power Architecture, Freescale states that both standard 32-bit code and 16-bit premium-encoded code can be mixed interchangeably within the executable on a flash page size access basis. Other architectures may opt to specify the encoding within some format of prefix bits, allowing an even finer level of code intermingling.

It should be mentioned than the reduced-footprint premium encoding of instructions in a variable-length encoding architecture often comes at the cost of reduced functionality. This is due to the reduction in the number of bits that are afforded in encoding the instruction, often reduced from 32 bits to 16 bits.

An example of a non-premium encoding instruction versus a premium encoding instruction might be an integer arithmetic ADD instruction. On a non-premium-encoded variant of the instruction, the source and destination operations of the ADD instruction may be any of the 32 general-purpose integer registers within the target architecture’s register file. In the case of a premium-encoded instruction, whereby only 16 bits of encoding space are afforded, the premium-encoded ADD instruction may only be permitted to use R0—R7 as source and destination registers, in an effort to reduce the number of bits used in the source and register destination encodings. Although it may not readily be apparent to the application programmer, this can result in subtle, albeit minor, performance degradations. These are often due to additional copy instructions that may be required to move source and destination operations around to adjacent instructions in the assembly schedule because of restrictions placed on the premium-encoded variants.

As evidence of the benefits and potential drawbacks of using variable-length encoding instruction set architectures as a vehicle for code size reduction, benchmarking of typical embedded codes when targeting Power Architecture devices has shown VLE, or variable- length encoding, enabled code to be approximately 30% smaller in code footprint size than standard Power Architecture code while only exhibiting a 5% reduction in code performance. Resulting minor degradations in code performance are typical, due to limitations in functionality when using a reduced instruction encoding format of an instruction.

Floating-point arithmetic and arithmetic emulation may be another somewhat obfuscated source of code size explosion. Consider the case in which the user’s source code contains loops of intensive floating-point arithmetic when targeting an architecture lacking native floating-point functionality in hardware. In order to functionally support the floating-point arithmetic, the build tools will often need to substitute in code to perform floating-point arithmetic emulation at program run-time. This typically entails trapping to a floating-point emulation library that provides the required functionality, such as floating-point division, using the existing non-floating-point instructions natively supported on the target architecture.

As one might predict, it is not uncommon for a given floating-point emulation routine to require hundreds of target processor clock cycles to emulate the floating-point operation, which execute over tens if not hundreds of floating-point emulation instructions. In addition to the obvious performance overhead incurred versus code targeting a processor with native floating-point support in hardware, significant code size increases will occur due to the inclusion of floating-point emulation libraries or inlined floating-point emulation code. By correctly matching the types of arithmetic contained in the source application with the underlying native hardware support of the target architecture, reductions in the overall resulting executable size can be achieved with some effort.

Tuning the ABI for code size
In software engineering, the application binary interface (ABI) is the low-level software interface between a given program and the operating system, system libraries, and even inter-module communication within the program itself. The ABI itself is a specification for how a given system represents items such as data types, data sizes, alignment of data elements and structures, calling conventions and related modes of operations. In addition, a given ABI may specify the binary format of object files and program libraries. The calling convention and alignment may be areas of interest to those wishing to reduce the overall code size of their application by using a custom calling convention within their particular application.

A given target processor and related ABI will often specify a calling convention to be used between functions within the application, as well as calls to the underlying operating system, run-time libraries and so forth. It is often desirable for a vendor to specify a default calling convention that affords a reasonable level of performance for the general use case when making calls between caller and callee procedures within an application.

At the same time, such a default calling convention may also attempt to make reasonable reductions in the code size generated in both the caller and callee procedures for maintaining machine- level state consistency between both the caller and callee. Often, however, this is not ideal for an application developer who demands either tight restrictions on code size or, in other cases, high levels of compiled code performance in key system kernels of hot paths within the call graph.

Consider for example the function defined in Figure 12.1 , which passes a large number of 16-bit integer values from the caller to the callee procedure:

     void caller_procedure(void)
     {
        short tap_00, tap_01, tap_02, tap_03,
              tap_04, tap_05, tap_06, tap_07,
              long callee_result;

        // some computation occurs, setting up taps

        callee_result = callee_procedure (tap_00, tap_01,
                                         
tap_02, tap_03,
                                          tap_04, tap_05,
                                          tap_06,
tap_07)

        //subsequent computation occurs based on results
     }

     long callee_procedure(short tap_00, short tap_01,
                           short  tap_02, short tap_03,
                           short tap_04, short tap_05,
                           short tap_06, short
tap_07)
     {
          long result;
          // do computation
          return result;
     }
                

Figure 12.1: Example C — language caller and callee procedure.

Looking at this example, it can be seen that the caller procedure computes a number of 16- bit values that must be passed as input parameters to the callee procedure. The callee procedure will then use these input values to compute some result that is passed back to the caller procedure to use in subsequent computation.

Let’s also assume that we are dealing with a somewhat trivialized ABI that is succinct for this illustrative example. The ABI assumes a 32-bit general purpose embedded architecture that has a 32-bit general purpose register file. The default calling convention for this ABI states that the first two char, short, or integer, values that are passed to a callee procedure get passed in general-purpose registers R00 and R01, with subsequent parameters being passed from the caller to the callee via the stack. This might be typical for a processor targeting a mobile embedded device that was sensitive to both performance and code size. The resulting assembly might look something like Figure 12.2 .

     ;***********************************************
     ;* NOTE: Using default ABI, R00 and R01 can be used to pass
     ;* parameters from caller to callee, all other parameters
     ;* must be passed via the stack.
     ;* SP_TAP_00 contains tap_00
     ;* SP_TAP_01 contains tap_01
     ;* SP_TAP_02 contains tap_02
     ;* SP_TAP_03 contains tap_03
     ;* SP_TAP_04 contains tap_04
     ;* SP_TAP_05 contains tap_05
     ;* SP_TAP_06 contains tap_06
     ;* SP_TAP_07 contains tap_07
     ;*
     ;*********************************************************************
     _caller_procedure
          ;* some computation setting tap_00 .. tap_07 in local memory
          ;* and various bookkeeping.

          ;* all parameters that cannot be passed via default ABI
          ;* configuration must be pushed onto the stack.

     LOAD R00, (SP+TAP_03);
     PUSH R00                  /* SP+4

     LOAD R00, (SP+TAP_04);
     PUSH R00                  /* SP+4

     LOAD R00, (SP+TAP_05);
     PUSH R05                  /* SP+4

     LOAD R00, (SP+TAP_06);
     PUSH R00                  /* SP+4

     LOAD R00, (SP+TAP_07);
     PUSH R00                  /* SP+4

     ;*******************************************************************
     ;* R00 contains tap_00
     ;* R01 contains tap_01
     ;* tap_2 through tap_07 have been passed via the stack, as seen
     ;* previously being setup in caller_procedure via the push operations.
     ;* Upon entry, callee_procedure must transfer all of the input parameters
     ;* passed via the stack into registers for local computation.  This
     ;* requires additional instructions both on the caller side (to put on
     ;* the stack) as well as the callee size (to restore from the stack).
     ;*
     ;* NOTE: INSERT PROS AND CONS
     ;*
     ;*
     ;*           ;**************************************************************************

     _Callee procedure:
          ;* ADJUST STACK POINTER TO NOW POINT TO CALLEE'S STACK FRAME
          ;* SO WE CAN ACESS DATA PASSED STACK IN ABI COMPLIANCE
          ;*

          POP R07;        ;* tap_07 into R07, SP-=4
          POP R06;        ;* tap_06 into R06, SP-=4
          POP R05;        ;* tap_05 into R05, SP-=4
          POP R04;    &nbWe can see here that the default ABI has been used as the vehicle forcommunications between both the caller and callee procedures namedcaller_procedure() and callee_procedure() respectively. Looking at theassembly code generated for the caller_procedure, we can see that thelocal variables computed within caller_procedure, namely tap_00 throughtap_07, are loaded from memory within the local procedure, and copiedonto the stack for passing to the callee routine, namelycallee_procedure. Because the default calling convention as specified bythis example ABI states that the first two char, short or integral typeparameters may be passed via registers from the caller to the calleeprocedures, the compiler has taken the liberty of passing tap_00 andtap_01 using target processor registers R00 and R01 respectively.

Itis interesting to note that fewer instructions are required for settingup parameters to be passed via registers than for those to be passedvia the stack. Additionally, it can be seen from the callee procedurethat significantly more instructions must be inserted into the calleeprocedure by the compiler to restore parameters passed via copy on thestack from the caller function into registers for local computationwithin the callee routine. While this affords a very nice abstraction ofcomputation between the caller and the callee routines, clearly if theuser wishes to reduce the code size of their resulting executable onemight consider alternative means of communication between caller andcallee routines.

This is where custom calling conventions may beused with a given ABI to further improve performance, or in the case ofthis example to further reduce the code size and increase performance aswell. Suppose that now the user has altered the calling conventionwithin the ABI to be used for these two procedures. Let’s call this newcalling convention as specified by the user “user_calling_convention”.The user has now stated that rather than pass only the first twoparameters from the caller function to the callee in registers, withsubsequent parameters being passed via the stack, thatuser_calling_convention may pass up to eight parameters from the callerto the callee function via registers, namely R00—R07.

In doingthis, the tools will need to account for additional registers being usedfor parameter passing, and the bookkeeping required on both sides ofthe caller/callee world; however, for this user’s example code whichpasses large numbers of parameters from caller to callee, a benefit canbe gained. Figure 12.3 illustrates what assembly code the user couldexpect to be generated using this user_calling_convention customizationas specified by the developer.

Referring to Figure 12.3 ,it can be seen that by using the user_calling_convention, the resultingassembly generated by the compiler looks quite different from that ofthe default calling convention. By permitting the build tools to passadditional parameters between caller and callee functions usingregisters, a drastic reduction in the number of instructions generatedfor each procedure is evident. Specifically, the caller_procedure can beshown to require far fewer moves to the stack before the call to thecallee_procedure. This is due to the fact that additional hardwareregisters are now afforded to the calling convention, whereby valuesfrom the caller’s memory space may simply be loaded into registersbefore making the call rather than loading into registers and thencopying onto the stack (and possibly adjusting the stack pointerexplicitly).
      
     ;***********************************************
     ;* NOTE: Using default ABI, R00 and R01 can be used to pass
     ;* parameters from caller to callee, all other parameters
     ;* must be passed via the stack.
     ;* SP_TAP_00 contains tap_00
     ;* SP_TAP_01 contains tap_01
     ;* SP_TAP_02 contains tap_02
     ;* SP_TAP_03 contains tap_03
     ;* SP_TAP_04 contains tap_04
     ;* SP_TAP_05 contains tap_05
     ;* SP_TAP_06 contains tap_06
     ;* SP_TAP_07 contains tap_07
     ;*
     ;*********************************************************************
     _caller_procedure
          ;* some computation setting tap_00 .. tap_07 in local memory
          ;* and various bookkeeping.

          ;* all parameters that cannot be passed via default ABI
          ;* configuration must be pushed onto the stack.

     LOAD R00, (SP+TAP_03);
     PUSH R00                  /* SP+4

     LOAD R00, (SP+TAP_04);
     PUSH R00                  /* SP+4

     LOAD R00, (SP+TAP_05);
     PUSH R05                  /* SP+4

     LOAD R00, (SP+TAP_06);
     PUSH R00                  /* SP+4

     LOAD R00, (SP+TAP_07);
    
     ;*******************************************************************
     ;* R00 contains tap_00
     ;* R01 contains tap_01
     ;* tap_2 through tap_07 have been passed via the stack, as seen
     ;* previously being setup in caller_procedure via the push operations.
     ;* Upon entry, callee_procedure must transfer all of the input parameters
     ;* passed via the stack into registers for local computation.  This
     ;* requires additional instructions both on the caller side (to put on
     ;* the stack) as well as the callee size (to restore from the stack).
     ;*
     ;*
     ;*
     ;*
          ;**************************************************************************

     _Callee procedure:
          ;* ADJUST STACK POINTER TO NOW POINT TO CALLEE'S STACK FRAME
          ;* SO WE CAN ACESS DATA PASSED STACK IN ABI COMPLIANCE
          ;*

          POP R07;        ;* tap_07 into R07, SP-=4
          POP R06;        ;* tap_06 into R06, SP-=4
          POP R05;        ;* tap_05 into R05, SP-=4
          POP R04;        ;* tap_04 into R04, SP-=4
          POP R03;        ;* tap_03 into R03, SP-=4
          POP R02;        ;* tap_02 into R02, SP-=4

          ;* Perform local computation on input parameters now stored
          ;* in registers R00-R007,storing result into SP+RESULT_OFFSET
          ;*

Figure 12.3: Example assembly language based caller procedure (optimized).

Similarly,referring to the callee_procedure, it can be seen that a number ofinstructions have been removed from the previous example’s generatedassembly. Once again, this is due to the fact that parameters are nowbeing passed from the caller to the callee function via the registerfile, rather than pushing onto and pulling off the stack.

Assuch, the callee does not need the additional instruction overhead tocopy local copies from the stack into registers for local computation.In this particular example, not only is it likely that performanceimprovements will be seen due to fewer instructions being required toexecute dynamically at run-time, but code size has also been reduced dueto the number of instructions statically reduced in the executable.

Whilethis example has shown how custom calling conventions can be used aspart of an embedded system’s larger ABI to reduce code size, and tailormemory optimization, there are a number of other concepts that may alsoplay into this.

Beyond the scope of this example are subjectssuch as spill code insertion by the compiler, the compiler’s ability tocompute stack frame sizes to utilize standard MOVE instructions to/fromthe stack frame rather than PUSH/POP style instructions, and alsoSIMD-style move operations to the stack whereby increased instructiondensity is obtained, further increasing performance and reducing codesize overhead, are left as further reading and considered beyond thescope of this example.

Caveat emptor: compiler optimization orthogonal to code size!
Whencompiling code for a production release, developers often want toexploit as much compile-time optimization of their source code aspossible in order to achieve the best performance possible. Whilebuilding projects with —Os as an option will tune the code for optimalcode size, it may also restrict the amount of optimization that isperformed by the compiler due to such optimizations resulting inincreased code size. As such, a user may want to keep an eye out forerrant optimizations performed typically around loop nests andselectively disable them on a one-by-one use case rather than disablethem for an entire project build. Most compilers support a list ofpragmas that can be inserted to control compile-time behavior. Examplesof such pragmas can be found with the documentation for your targetprocessors’ build tools.

Software pipelining is one optimizationthat can result in increased code size due to additional instructionsthat are inserted before and after the loop body of the transformedloop. When the compiler or assembly programmer software pipelines aloop, overlapping iterations of a given loop nest are scheduledconcurrently with associated “set up” and “tear down” code insertedbefore and after the loop body.

These additional instructionsinserted in the set up and tear down, or prologue and epilogue as theyare often referred to in the compiler community, can result in increasedinstruction counts and code sizes. Typically a compiler will offer apragma such as “#pragma noswp” to disable software pipelining for agiven loop nest, or given loops within a source code file. Users maywant to utilize such a pragma on a loop-by-loop basis to reduceincreases in code size associated with select loops that may not beperformance-critical or on the dominant run-time paths of theapplication.

Loop unrolling is another fundamental compiler loopoptimization that often increases the performance of loop nests atrun-time. By unrolling a loop so that multiple iterations of the loopreside in the loop body, additional instruction-level parallelism isexposed for the compiler to schedule on the target processor; inaddition fewer branches with branch delay slots must be executed tocover the entire iteration space of the loop nest, potentiallyincreasing the performance of the loop as well.

Because multipleiterations of the loop are cloned and inserted into the loop body bythe compiler, however, the body of the loop nest typically grows as amultiple of the unroll factor. Users wishing to maintain a modest codesize may wish to selectively disable loop unrolling for certain loopswithin their code production, at the cost of compiled code run-timeperformance. By selecting those loop nest that may not be on theperformance-critical path of the application, savings in code size canbe achieved without impacting performance along the dominant run-timepath of the application.

Typically compilers will supportpragmas to control loop unrolling-related behavior, such as the minimumnumber of iterations a loop will exist or various unroll factors to passto the compiler. Examples of disabling loop unrolling via a pragma areoften of the form “#pragma nounroll”. Please refer to your localcompiler’s documentation for correct syntax on this and relatedfunctionality.

Procedure inlining is another optimization thataims to improve the performance of compiled code at the cost of compiledcode size. When procedures are inlined, the callee procedure that isthe target of a caller procedure’s callee invocation site is physicallyinlined into the body of the caller procedure. Consider the example in Figure 12.4 .

Insteadof making a call to callee_procedure() every time caller_procedure() isinvocated, the compiler may opt to directly substitute the body ofcallee_procedure into the body of caller_procedure to avoid the overheadassociated with the function call. In doing this, the statement a 1 bwill be substituted into the body of caller_procedure in the hope ofimproving run-time performance by eliminating the function calloverhead, and hopefully proving better instruction cache performance. Ifthis inlining is performed for all call sites of callee_procedurewithin the application, however, one can see how multiple inlinings canquickly lead to an explosion in the size of the application, especiallyfor examples where callee_procedure contains more than a simple additionstatement.

     INT caller_procedure(void)
     {
          int result, a, b;

          // intermediate computation
          result = callee_procedure();
          return result;
     }

     int callee_procedure(void)
     {
          return a + b;
     }

Figure 12.4: Candidate function inlining use case.

Assuch, users may wish to manually disable function inlining for theirentire application or for selective procedures via a compiler-providedpragma. Typical pragmas are of the form “#pragma noinline” and willprevent the tools from inlining the procedure marked at compilationtime.

Part 2: Memory layout optimization

Dr. Michael C. Brogioli is principal and founder at Polymathic Consulting as well as an adjuct professor of computer engineering at RiceUniversity, Houston, Texas. Prior to Polymathic, he was senior member ofthe technical staff and chief architect at Freescale Semiconductor. Inaddition to that he also served in several roles at TI's AdvancedArchitecture and Chip Technology Research and Intel's AdvanceMicroprocessor Research Lab. He holds a PhD/MSc in electrical andcomputer engineering from Rice as well as a BSc in electricalengineering from Renssealer Polytechnic Institute.

Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from Software Engineering for Embedded Systems , written and edited by Robert Oshana and Mark Kraeling.

1 thought on “Achieving better software performance through memory-oriented code optimization – Part 1

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.