Advanced Compiler Optimization Techniques

April 12, 2002

April 12, 2002


In recent years, a new generation of advanced embedded products has emerged. These products, ranging from laser printers to network routers to video games, increasingly call for enormous levels of processing power"levels which can only be achieved through the use of high-speed RISC and CISC microprocessors. While developers can now choose from a wide variety of low-cost, high performance processors, market demands for better, faster products have forced developers to seek increased performance in a variety of ways. However, since all developers have similar access to the latest high-speed hardware components, competitive advantage is difficult to achieve through hardware alone. As a result, competitive product advantage is increasingly being sought and achieved through superior software.

The shift in emphasis to superior software performance has raised the importance of software development tools dramatically. This is especially true for the compiler"the one software development tool with the greatest impact on a product's ultimate performance. With the widespread use of high-level languages such as C and C++ for embedded software development, compiler optimization technology plays a more critical role than ever in helping developers to achieve their overall design goals.

The purpose of this article is to help developers of embedded systems understand and appreciate the underlying technology and benefits of today's most advanced compiler optimization techniques.


Why Care About the Compiler?
C and C++ compilers have come a long way from the simple code translators of the past. Although the emergence of standards such as ANSI C have led some developers to treat compilers as commodity products, two forces have combined to create notable differences in optimization technology from one compiler to the next. One stems from the architectural nature of today's high-speed processors. With complex instruction pipelines and on-board instruction and data caches, today's advanced processors are highly dependent on the compiler to structure object code for optimal performance. Creating this optimal structure is difficult and highly CPU-specific, causing large differences in program performance depending on which optimization techniques are employed. The second force creating these compiler differences is market pressure. With developers demanding ever-higher performance and/or denser code, certain compiler vendors have committed themselves to devising increasingly sophisticated optimization techniques.

Compiler suites that employ the latest optimization techniques offer many benefits to embedded system developers. With challenging real-time performance goals, cost constraints, and the pressures to deliver complex products in less time, developers increasingly rely on a compiler's intimate knowledge of a processor's instruction set and behavior to produce optimal code. The benefits help developers to achieve fundamental project goals, such as:

  • Higher Performance
    Compilers employing the latest optimization technology routinely produce code 20-30% faster than standard compilers, and in some cases, two to three times faster. This kind of performance boost is critical for maximizing system performance. In addition, since a large majority (often more than 90%) of program execution time is spent in application code rather than in the operating system, compiler performance is much more significant than RTOS performance for overall application speed.

  • Lower Cost
    Superior optimization enables developers to reduce system cost and improve products in a number of ways. First, a higher performance product could be produced without having to resort to a higher-speed and higher cost processor. A slower processor also permits the use of lower-speed, less costly memory. Compiler optimization techniques also have a very significant effect on code size. With decreased memory requirements, developers can further reduce production costs by using less memory, or they may opt to add additional features to make their product more competitive.

  • Reduced Development Time
    Embedded software applications have become much more complex. As a result, code reuse is often essential to delivering new products on time. The emphasis on code reuse has led to a major shift from assembly code to high-level languages such as C and C++.

However, effective code reuse and maintenance requires more than just programming in a high-level language. It is important to use techniques such as structured and object-oriented programming tools and techniques to ensure modular and readable code. In the past, less sophisticated compiler optimization technologies forced developers to avoid such programming practices and hand-tune the source code for the target architecture. Today, a highly optimizing compiler enables developers to write the most readable and maintainable source code with the confidence that the compiler can generate the optimal binary implementation.

Some compilers also have optimizations specially designed for code tuned for older architectures, thus helping legacy code execute more efficiently on today's higher performance processors. This reduces the need to rewrite existing code.

Finally, embedded applications typically have stringent timing and throughput specifications. Usually these cannot be tested until development is near completion. However, performance problems detected at such a late stage can significantly delay project completion. A superior optimizing compiler makes it easier to meet performance specifications earlier and reduces the need for time-consuming hand-optimization.

With a compiler's optimization capability affecting so many aspects of product development, it is more important than ever to understand and evaluate a compiler's optimization technology. After all, developers put a lot of effort into the code they write and they should be careful to get the most out of it.


An Overview of Optimizing Compiler Technology
In order to take an in-depth look at some of most advanced optimization techniques, we will first review the different parts of an optimizing compiler and explain the terminology used throughout the remainder of this paper. As an example, we will use the technology from Wind River's Diab compiler because of its modularity and strong CPU-specific and application specific optimization techniques.

Figure 1:  Compiler optimization

The compiler "front-end" consists of a language specific parser that creates a language independent representation of the program using an "intermediate language".

The compiler back-end is divided into five stages, as indicated in figure 1. Optimization, in fact, occurs at each stage, although global optimization is the most important. The global optimizer performs both high-level program analysis and a wide range of general optimizations. These general optimizations are database-driven so they can be tuned to a particular architecture. For example, the global optimizer uses the database to determine the number of registers available for a particular architecture.

The code selection and generation processes utilize the same database as the global optimizer to determine the optimal assembly language instructions to implement the intermediate language operations. The code selection process reserves a number of registers for use as scratch pads for temporary variables. The prime purpose of the code generator is to finalize register usage by ensuring that any unused scratch registers are utilized for storing variables that would otherwise have to be fetched from memory.

Once register usage has been finalized, the dependencies between instructions become clear. This sets the stage for the peephole optimizer, which initially performs "obvious elimination", in which it scans the generated code for clearly inefficient sequences. However, its primary purpose is instruction rescheduling, which reduces the potential for run-time pipeline stalls.


Sophisticated Optimization Techniques
As mentioned before, one might be led to believe that the advent of faster processors diminishes the importance of the compiler. In reality, the reverse is true. Higher processor performance is being attained by a switch to faster processors, particularly RISC architectures. RISC processors achieve greater performance through their ability to execute relatively simple instructions very quickly. Since processor throughput is dependent on the execution unit constantly being fed instructions, performance can be severely degraded if the processor has to frequently wait while instructions or data are obtained from or written to external memory. The role of the compiler in minimizing such undesirable events is crucial for optimal performance.

With the latest CISC processors themselves becoming more RISC-like, and with market pressure for faster, denser code, a new class of compiler optimization techniques has emerged. In many cases, these new optimizations involve sophisticated program analysis techniques that have greatly broadened the scope for applying well-established optimizations such as constant or copy propagation. In effect, these new analysis techniques enable the compiler to critique high level C or C++ source code in order to combine optimizations more judiciously for maximum effect.

To describe how some of the latest optimization and analysis techniques work, we will examine the following leading edge techniques and illustrate their effect with demonstrative code fragments. The techniques we will examine are:

  • Interprocedural Analysis
  • Live analysis of global and static variables
  • Profile-driven optimization
  • Inlining
  • Complex branch optimization.

Interprocedural Analysis
As discussed earlier, performance can be seriously impacted if the processor has to interact too frequently with main memory. The most effective way to reduce memory accesses is to minimize load and store operations by keeping as many variables as possible in registers.

One particularly effective method for eliminating unnecessary loads and stores is Interprocedural Analysis, or IPA. IPA works by examining the actual source code of a function whenever it is called. This enables it to accurately determine whether the function will access any static variables currently in registers. Since IPA can work in conjunction with alias analysis, it can even determine which variables are being accessed as a result of pointer references. As result, it only stores into memory those static variables used by the function rather than all those currently held in registers. Since a function will rarely affect most of the static variables actually present in registers at the time it is called, this analysis prevents many unnecessary store (and subsequent load) operations.

For functions called in loop bodies, IPA enables more aggressive use of optimizations such as moving loop invariants outside the body of the loop. For example, if IPA reveals that the arguments of a subexpression calculated in the loop are not affected by the function call, the subexpression could be moved outside the loop where it is only calculated once.

int arr[100];
int c, d;

int foo(int a, int b)
{
return a/b;
}
void bar()
{
int i ;
for (i=0; i < 100;="" i++)="">
arr[i] = foo(c,d) ;
}
}

gets transformed to

void bar()
{
int i ;
temp = c/d;
for (i=0; i < 100;="" i++)="">
arr[i] = temp ;
}
}

In addition to eliminating redundant loads and stores, IPA enables the compiler to improve overall register utilization. Once the compiler knows exactly which registers need to be saved and restored, it can ignore normal calling conventions that place constrictions on how registers are treated when a function call is made. This results in two major benefits:

Any context saving of the preserved registers by the calling function can be minimized by saving only those registers actually used by the function.

The code at the function call site can now store variables in any scratch registers that are not used by the function. In the following sequence, the local variable copied from g is kept in the volatile register r4 after the optimizer determines that f1 does not use r4. This analysis frees up another register for other purposes and/or reduces function call overhead since a preserved register does not need to be preserved and restored on entry and exit.

lwzr4,g@sdarx(r0)
addir3,r4,1
blf1
addr3,r4,r3

Since a function may only use a few of the preserved and scratch registers, IPA can greatly increase the number of available registers without creating the need to perform time-consuming context saves and restores.

Diab's highly optimizing compiler suite provides a unique twist to IPA that goes beyond simply analyzing how the function uses registers and variables. It is common practice for applications to contain many conditional statements that check for errors. An error or exception handler is then invoked to respond to the problem. In the case of a critical error, it is likely that the error handler will abort or restart the system rather than return to the original point of execution. By examining whether a function returns or not, IPA can eliminate unnecessary context saving and restoring. However, of greater importance is a compiler's ability to completely ignore the operation of the function on static variables or registers used by the error handler to improve optimization of the code around the call point.


Live-Variable Analysis for Statics and Global Variables
Programs written for 8- and 16-bit architectures often make extensive use of global and static variables rather than local variables. This was done for performance reasons because such architectures typically provide a very limited number of registers. Unfortunately, when such code is executed on today's register-rich architectures it may substantially decrease performance by forcing the processor to access memory every time it reads or updates a variable value.

One solution is to rewrite such code, but most developers would prefer not to modify already working code, especially if the original programmer is no longer available. A better alternative is using the compiler to solve this problem. Live-variable analysis enables the compiler to determine which variables should be placed in registers. Although frequently used by compilers with local variables, performing such analysis for static and global variables is much more difficult. In particular, it is essential for the compiler to have sophisticated alias analysis. Alias analysis determines which variables are currently referenced by pointers, as illustrated by the following code fragment:

static int x, y, z ;
int *p = &x ;
z = y ;
*p = (4*j) ;
z += 45 ;
z = *p - z;

Alias analysis determines that *p cannot point to z. The compiler can now identify sections of code where there are frequent accesses to a particular global or static variable. The compiler then places the variable in a register for the duration of the code section to eliminate the need to access main memory. In the following code, all but the last occurrence of z has been replaced by a register:

static int x, y, z ;
int *p = &x ;
register int tmp;
tmp = y ;
*p = (4*j) ;
tmp += 45 ;
z = *p - tmp;


Profile-Driven Optimization
Many experienced embedded developers have used program performance analysis tools at some stage of a project. Such tools analyze run-time program execution and measure the amount of time spent in certain modules, and/or the amount of times a source line or function or block has been executed, or whether a source line has been executed at all. These tools often provide surprising insights into how an application is really behaving when executing on the target. Some compilers have features for generating program-profiling capabilities. Better yet, some can make use of the information gathered by a profiler to restructure a program for optimal performance.

Such an approach is known as profile-driven, or application specific, optimization. To employ this technique, the compiler generates a version of the application containing minimal instrumentation that writes out information on program behavior to a file or a buffer in memory, which can be uploaded to the host computer. This information is then fed back to the compiler the next time the application is compiled.

Figure 2:  Profile-driven optimization

The profiling data enables the optimizer to make more intelligent optimization decisions based on real world run-time data rather than speculation. For example, register allocation is based on the actual number of times a variable is used and if-then-else clauses are swapped if the first part is executed more often.

The compiler can also use the profiler data to combine optimizations more intelligently. A good example is that some optimizations, such as loop unrolling or inlining, trade off speed for size. It makes little sense to apply loop unrolling to rarely executed loops if minimal code size is also an important goal. However, these optimizations should be applied as aggressively as possible to frequently executed sections of code.

The profiling data also records how often branches are taken. For processors that support branch prediction, such as the PowerPC, this enables the compiler to set the branch prediction bit in the opcodes for conditional branch instructions. By accurately predicting whether a branch will be taken, the compiler can minimize pipeline stalls that force the processor to wait for instructions to be fetched from memory. For this C code :

while (long_time) {
if (rare_condition) {
do_something;
} else {
do_something_else;
}
}

The Diab compiler will generate a bc+ instruction for the conditional branch instruction indicating that the forward branch is predicted to be taken rather than the default not taken as the regular bc instruction would indicate.


Function Inlining
Function inlining improves performance by replacing a call to a function with the body of the function itself. This eliminates the overhead of jumping to and returning from a subroutine. While most compilers perform inlining, many have limitations that greatly reduce the number of functions actually inlined. For example, some compilers cannot inline functions with static members. Since by itself inlining may have a significant effect on program performance, it is important to verify that a compiler performs inlining effectively.

An additional advantage is that placing the function code "inline" exposes it to further optimization, enhancing performance even more. A good example of this is a swap function. The Diab compiler will eliminate the address and indirection operations in the process of inlining.

void swap (int *a, int *b)
{
int t = *a;
*a = *b;
*b = t;
}
main()
{
swap(&x,&y) ;
}

will get transformed to

main()
{
int t=x;
x = y;
y = t;
}

The Diab compiler can perform inlining even on recursive functions. While the number of levels of recursion that should be inlined may be determined by default or user-defined values, this is an excellent example of how profiling data can improve the effectiveness of a program. In this case, the compiler can determine the typical number of levels of recursion and inline the function the appropriate number of times before reverting to standard function calls.

Another manner in which profiling can boost the effectiveness of function inlining is through a technique called partial inlining. If profiling data reveals that the function begins with a simple conditional check that usually results in an immediate return, this part of the function can be inlined, but remainder treated as a true function call:

void foo ()
{
if (condition) {
/* many lines of function body */
}
}
bar()
{
foo() ;
}

bar will be transformed to:

bar()
{
if (condition) {
foo() ;
}
}

saving the function call overhead for foo.

Although function inlining is often thought of as an optimization that trades off increased program size for speed, this is not necessarily true. Especially in C++ programs, it is common to have functions that consist of simply one or two lines of code. In such cases, the instructions required to set up and return from a function call may exceed the number of instructions to directly execute the function, making function inlining a size reduction optimization. Rather than follow the standard practice of disabling inlining when optimizing for minimal code size, the Diab compiler checks the size of a function to verify if it can be inlined without increasing code size.


Complex Branch Optimization
With code reuse becoming a critical factor in delivering products to market faster, writing well-structured readable code is very important. In some cases structured programming techniques lead to code that is less efficient. Compilers, of course, do not have to worry about "code reviews" or program maintenance and have no qualms about using constructs such as unstructured jumps ("goto"). This enables a sophisticated optimizer to "reimplement" the program logic for greater performance.

A good example of an optimization that reimplements structured programs for higher performance is the complex branch optimization. This optimization rewrites code containing branches and fall-throughs to conditional branches where the outcome can be compared. A code fragment of a loop with multiple exits demonstrates how this optimization works:

fl = 0;
while (!fl) {
stml () ;
if (e1) break ;
if (e2) {
fl = 1 ;
}
}
if (f1) {
stm2 () ;
}

becomes:

L1:
stml ();
if (e1) goto L2; /* skip stm2 since f1 is 0 */
if (!e2) goto L1; /* if !e2 fall through */
stmt2 ();
L2:

The optimizer detects that the loop will always be entered and that stmt() will always be executed. The compiler recognizes that stmt2() will never be executed if fl is false and if e1 is true, fl will always be false. This enables it to produce a much more efficient implementation of the same code. However, it is not recommended for programmers to follow this coding style themselves!


Fine-Tuning
To extract maximum performance out of a particular processor, it is essential to perform architecture-specific optimizations. Although such optimizations have usually been associated with the code selection and code generation phases, to be truly effective each optimization phase must have an intimate knowledge of the target architecture. We will examine how a new technique known as architectural analysis is applied, in addition to examples of code selection, peephole, and instruction scheduling optimizations for the PowerPC, ColdFire, and 680X0/683XX (68K) processor families.


Architectural Analysis
In our earlier discussions of global optimizations, we illustrated several examples where the optimizer "reimplemented" program logic for greater efficiency. Architectural analysis enables the optimizer to reimplement program logic to take advantage of architectural specifics, such as addressing modes. For example, the global optimizer for the Diab 68K compiler would modify the following code fragment:

p[0] = 0 ;
p[1] = 1 ;
p += 2 ;

to:

*p++ = 0 ;
*p++ = 1 ;

or, in 68000 assembly language:

clr.b (a5)+
move.b #1,(a5)+

This takes advantage of the 68K's post-increment addressing mode. Of course, many C programs developed to execute on 68K family processors have been coded to take advantage of such. This becomes a drawback when porting an application to a RISC processor like the PowerPC, which lacks a post-increment addressing mode. Fortunately, architecture analysis works both ways, as illustrated by the following code fragment:

for(i = 0; i < 100;="" i++)="">
*p++ = 0 ;
}

The above code is very efficient for an architecture with a post-increment addressing mode, but the PowerPC store-with-update instruction is, in effect, a pre-incrementing mode. To utilize this, the compiler rewrites the code to:

p--;
for(i = 0; i < 100;="" i++)="">
*++p = 0 ;
}

The inner body in PowerPC assembly language:

stbu r4,1(r3) # *(r3+1) = r4, r3 += 1


Code Selection
The code selection phase performs optimization by replacing two or more operations in the intermediate language representation with a single real instruction. Architectural analysis enables the global optimizer to reorder operations in such a way as to maximize the opportunities for optimization in the code selection phase. The following example illustrates this point:

for(i = 0; i < 100;="" i++)="">
*++p = 0 ;
}

to:

for(i = 100; --i != 0; i) {
*++p = 0 ;
}

The test expression, --i != 0, can utilize the Count Register in the PowerPC architecture. Using this register, the whole test and branch can be replaced with one PowerPC instruction, bdnz. The complete loop becomes in PowerPC assembly language:

.L4:
stbu r4,1(r3)
bdnz .L4


Peephole Optimization
Traditionally, the purpose of a peephole optimizer has been to rectify failures by the earlier phases of optimization. With the emergence of advanced global optimization and code selection techniques, the role of the peephole optimizer is declining in importance. This is simply because advanced optimization techniques rarely produce sequences of obviously inefficient code. However, register allocation and code selection is an NP-complete problem and doing the "perfect" analysis would cause unacceptably long compile times. Thus, peephole optimization is still useful to fix inefficiencies for a few corner cases.


Instruction Scheduling
Instruction scheduling is a significant follow-on technique to peephole optimization. Since instruction scheduling often needs to understand data dependencies between instructions, it should be performed after all other optimizations. Instruction scheduling is a particularly good example of the growing need to tailor optimizations not only to a particular architecture, but also to specific variants within a processor family. This is because it is highly dependent on the implementation of the instruction pipeline.

The PPC603e and MPC860 (also known as PowerQUICC), both members of the PowerPC family, provide a good example of how instruction scheduling requirements differ radically even within the same processor family. The 603e is capable of issuing two instructions per cycle. In addition, it has built-in hardware support for managing data dependencies, such as one instruction requiring an operand being calculated by another instruction. Therefore, instruction scheduling for a 603e needs to order the code to maximize the number of instruction pairs that can be executed in parallel. For example:

sum += p[0] ;
sum += p[1] ;

This example includes two loads and two adds. Depending on the surrounding instructions, it is better on the 603e to make sure that it can issue two instructions per cycle, which means that the loads should be in different pairs:

lwz r12,0(r3) # these two instructions can be issued the same cycle
add r4,r4,r12
lwz r11,4(r3) # next cycle
add r4,r4,r11

In contrast, the 860 executes only a single instruction per cycle and lacks special hardware for preventing pipeline stalls caused by data dependencies. To produce optimal code, the instruction scheduler must focus on minimizing the number of times an instruction is using an operand calculated by, for example, a load instruction immediately before it. Otherwise the pipeline cannot overlap all its operations smoothly.

lwz r12,0(r3) # takes two cycles to load r12
lwz r11,4(r3) # load r11 now instead of stalling
add r4,r4,r12 # now r12 is ready
add r4,r4,r11 # and now r11 is ready


Conclusions
Compiler optimization technology has been steadily advancing as more sophisticated processors hit the market. For high-end embedded processors, today's most advanced compilers take advantage of both CPU-specific and application specific information to optimize code intelligently for optimal results. In addition, the introduction of sophisticated new optimization and analysis techniques enables state-of-the-art compilers to offer significant advantages over compilers employing less sophisticated optimization technology. With compiler performance being the single most important variable affecting program performance, developers should carefully examine a compiler's optimization capability before selecting a compiler.

Loading comments...

Most Read

  • No Articles

Most Commented

  • Currently no items

KNOWLEDGE CENTER