In recent years, a new generation of advanced embedded
products has emerged. These products, ranging from laser
printers to network routers to video games, increasingly call
for enormous levels of processing power"levels which can only
be achieved through the use of high-speed RISC and CISC
microprocessors. While developers can now choose from a wide
variety of low-cost, high performance processors, market
demands for better, faster products have forced developers to
seek increased performance in a variety of ways. However, since
all developers have similar access to the latest high-speed
hardware components, competitive advantage is difficult to
achieve through hardware alone. As a result, competitive
product advantage is increasingly being sought and achieved
through superior software.
The shift in emphasis to superior software performance has
raised the importance of software development tools
dramatically. This is especially true for the compiler"the one
software development tool with the greatest impact on a
product's ultimate performance. With the widespread use of
high-level languages such as C and C++ for embedded software
development, compiler optimization technology plays a more
critical role than ever in helping developers to achieve their
overall design goals.
The purpose of this article is to help developers of
embedded systems understand and appreciate the underlying
technology and benefits of today's most advanced compiler
optimization techniques.
Why Care About the Compiler?
C and C++ compilers have come a long way from the simple code
translators of the past. Although the emergence of standards
such as ANSI C have led some developers to treat compilers as
commodity products, two forces have combined to create notable
differences in optimization technology from one compiler to the
next. One stems from the architectural nature of today's
high-speed processors. With complex instruction pipelines and
on-board instruction and data caches, today's advanced
processors are highly dependent on the compiler to structure
object code for optimal performance. Creating this optimal
structure is difficult and highly CPU-specific, causing large
differences in program performance depending on which
optimization techniques are employed. The second force creating
these compiler differences is market pressure. With developers
demanding ever-higher performance and/or denser code, certain
compiler vendors have committed themselves to devising
increasingly sophisticated optimization techniques.
Compiler suites that employ the latest optimization
techniques offer many benefits to embedded system developers.
With challenging real-time performance goals, cost constraints,
and the pressures to deliver complex products in less time,
developers increasingly rely on a compiler's intimate knowledge
of a processor's instruction set and behavior to produce
optimal code. The benefits help developers to achieve
fundamental project goals, such as:
- Higher Performance
Compilers employing the latest optimization technology
routinely produce code 20-30% faster than standard compilers,
and in some cases, two to three times faster. This kind of
performance boost is critical for maximizing system
performance. In addition, since a large majority (often more
than 90%) of program execution time is spent in application
code rather than in the operating system, compiler
performance is much more significant than RTOS performance
for overall application speed.
- Lower Cost
Superior optimization enables developers to reduce system
cost and improve products in a number of ways. First, a
higher performance product could be produced without having
to resort to a higher-speed and higher cost processor. A
slower processor also permits the use of lower-speed, less
costly memory. Compiler optimization techniques also have a
very significant effect on code size. With decreased memory
requirements, developers can further reduce production costs
by using less memory, or they may opt to add additional
features to make their product more competitive.
- Reduced Development Time
Embedded software applications have become much more complex.
As a result, code reuse is often essential to delivering new
products on time. The emphasis on code reuse has led to a
major shift from assembly code to high-level languages such
as C and C++.
However, effective code reuse and maintenance requires more
than just programming in a high-level language. It is important
to use techniques such as structured and object-oriented
programming tools and techniques to ensure modular and readable
code. In the past, less sophisticated compiler optimization
technologies forced developers to avoid such programming
practices and hand-tune the source code for the target
architecture. Today, a highly optimizing compiler enables
developers to write the most readable and maintainable source
code with the confidence that the compiler can generate the
optimal binary implementation.
Some compilers also have optimizations specially designed
for code tuned for older architectures, thus helping legacy
code execute more efficiently on today's higher performance
processors. This reduces the need to rewrite existing code.
Finally, embedded applications typically have stringent
timing and throughput specifications. Usually these cannot be
tested until development is near completion. However,
performance problems detected at such a late stage can
significantly delay project completion. A superior optimizing
compiler makes it easier to meet performance specifications
earlier and reduces the need for time-consuming
hand-optimization.
With a compiler's optimization capability affecting so many
aspects of product development, it is more important than ever
to understand and evaluate a compiler's optimization
technology. After all, developers put a lot of effort into the
code they write and they should be careful to get the most out
of it.
An Overview of Optimizing Compiler Technology
In order to take an in-depth look at some of most advanced
optimization techniques, we will first review the different
parts of an optimizing compiler and explain the terminology
used throughout the remainder of this paper. As an example, we
will use the technology from Wind River's Diab compiler because
of its modularity and strong CPU-specific and application
specific optimization techniques.
Figure 1: Compiler optimization
The compiler "front-end" consists of a language specific
parser that creates a language independent representation of
the program using an "intermediate language".
The compiler back-end is divided into five stages, as
indicated in figure 1. Optimization, in fact, occurs at each
stage, although global optimization is the most important. The
global optimizer performs both high-level program analysis and
a wide range of general optimizations. These general
optimizations are database-driven so they can be tuned to a
particular architecture. For example, the global optimizer uses
the database to determine the number of registers available for
a particular architecture.
The code selection and generation processes utilize the same
database as the global optimizer to determine the optimal
assembly language instructions to implement the intermediate
language operations. The code selection process reserves a
number of registers for use as scratch pads for temporary
variables. The prime purpose of the code generator is to
finalize register usage by ensuring that any unused scratch
registers are utilized for storing variables that would
otherwise have to be fetched from memory.
Once register usage has been finalized, the dependencies
between instructions become clear. This sets the stage for the
peephole optimizer, which initially performs "obvious
elimination", in which it scans the generated code for clearly
inefficient sequences. However, its primary purpose is
instruction rescheduling, which reduces the potential for
run-time pipeline stalls.
Sophisticated Optimization Techniques
As mentioned before, one might be led to believe that the
advent of faster processors diminishes the importance of the
compiler. In reality, the reverse is true. Higher processor
performance is being attained by a switch to faster processors,
particularly RISC architectures. RISC processors achieve
greater performance through their ability to execute relatively
simple instructions very quickly. Since processor throughput is
dependent on the execution unit constantly being fed
instructions, performance can be severely degraded if the
processor has to frequently wait while instructions or data are
obtained from or written to external memory. The role of the
compiler in minimizing such undesirable events is crucial for
optimal performance.
With the latest CISC processors themselves becoming more
RISC-like, and with market pressure for faster, denser code, a
new class of compiler optimization techniques has emerged. In
many cases, these new optimizations involve sophisticated
program analysis techniques that have greatly broadened the
scope for applying well-established optimizations such as
constant or copy propagation. In effect, these new analysis
techniques enable the compiler to critique high level C or C++
source code in order to combine optimizations more judiciously
for maximum effect.
To describe how some of the latest optimization and analysis
techniques work, we will examine the following leading edge
techniques and illustrate their effect with demonstrative code
fragments. The techniques we will examine are:
- Interprocedural Analysis
- Live analysis of global and static variables
- Profile-driven optimization
- Inlining
- Complex branch optimization.
Interprocedural Analysis
As discussed earlier, performance can be seriously impacted if
the processor has to interact too frequently with main memory.
The most effective way to reduce memory accesses is to minimize
load and store operations by keeping as many variables as
possible in registers.
One particularly effective method for eliminating
unnecessary loads and stores is Interprocedural Analysis, or
IPA. IPA works by examining the actual source code of a
function whenever it is called. This enables it to accurately
determine whether the function will access any static variables
currently in registers. Since IPA can work in conjunction with
alias analysis, it can even determine which variables are being
accessed as a result of pointer references. As result, it only
stores into memory those static variables used by the function
rather than all those currently held in registers. Since a
function will rarely affect most of the static variables
actually present in registers at the time it is called, this
analysis prevents many unnecessary store (and subsequent load)
operations.
For functions called in loop bodies, IPA enables more
aggressive use of optimizations such as moving loop invariants
outside the body of the loop. For example, if IPA reveals that
the arguments of a subexpression calculated in the loop are not
affected by the function call, the subexpression could be moved
outside the loop where it is only calculated once.
int arr[100];
int c, d;
int foo(int a, int b)
{
return a/b;
}
void bar()
{
int i ;
for (i=0; i < 100; i++) {
arr[i] = foo(c,d) ;
}
}
gets transformed to
void bar()
{
int i ;
temp = c/d;
for (i=0; i < 100; i++) {
arr[i] = temp ;
}
}
In addition to eliminating redundant loads and stores, IPA
enables the compiler to improve overall register utilization.
Once the compiler knows exactly which registers need to be
saved and restored, it can ignore normal calling conventions
that place constrictions on how registers are treated when a
function call is made. This results in two major benefits:
Any context saving of the preserved registers by the calling
function can be minimized by saving only those registers
actually used by the function.
The code at the function call site can now store variables
in any scratch registers that are not used by the function. In
the following sequence, the local variable copied from g is
kept in the volatile register r4 after the optimizer determines
that f1 does not use r4. This analysis frees up another
register for other purposes and/or reduces function call
overhead since a preserved register does not need to be
preserved and restored on entry and exit.
lwzr4,g@sdarx(r0)
addir3,r4,1
blf1
addr3,r4,r3
Since a function may only use a few of the preserved and
scratch registers, IPA can greatly increase the number of
available registers without creating the need to perform
time-consuming context saves and restores.
Diab's highly optimizing compiler suite provides a unique
twist to IPA that goes beyond simply analyzing how the function
uses registers and variables. It is common practice for
applications to contain many conditional statements that check
for errors. An error or exception handler is then invoked to
respond to the problem. In the case of a critical error, it is
likely that the error handler will abort or restart the system
rather than return to the original point of execution. By
examining whether a function returns or not, IPA can eliminate
unnecessary context saving and restoring. However, of greater
importance is a compiler's ability to completely ignore the
operation of the function on static variables or registers used
by the error handler to improve optimization of the code around
the call point.
Live-Variable Analysis for Statics and Global
Variables
Programs written for 8- and 16-bit architectures often make
extensive use of global and static variables rather than local
variables. This was done for performance reasons because such
architectures typically provide a very limited number of
registers. Unfortunately, when such code is executed on today's
register-rich architectures it may substantially decrease
performance by forcing the processor to access memory every
time it reads or updates a variable value.
One solution is to rewrite such code, but most developers
would prefer not to modify already working code, especially if
the original programmer is no longer available. A better
alternative is using the compiler to solve this problem.
Live-variable analysis enables the compiler to determine which
variables should be placed in registers. Although frequently
used by compilers with local variables, performing such
analysis for static and global variables is much more
difficult. In particular, it is essential for the compiler to
have sophisticated alias analysis. Alias analysis determines
which variables are currently referenced by pointers, as
illustrated by the following code fragment:
static int x, y, z
;
int *p = &x ;
z = y ;
*p = (4*j) ;
z += 45 ;
z = *p - z;
Alias analysis determines that *p cannot point to z. The
compiler can now identify sections of code where there are
frequent accesses to a particular global or static variable.
The compiler then places the variable in a register for the
duration of the code section to eliminate the need to access
main memory. In the following code, all but the last occurrence
of z has been replaced by a register:
static int x, y, z
;
int *p = &x ;
register int tmp;
tmp = y ;
*p = (4*j) ;
tmp += 45 ;
z = *p - tmp;
Profile-Driven Optimization
Many experienced embedded developers have used program
performance analysis tools at some stage of a project. Such
tools analyze run-time program execution and measure the amount
of time spent in certain modules, and/or the amount of times a
source line or function or block has been executed, or whether
a source line has been executed at all. These tools often
provide surprising insights into how an application is really
behaving when executing on the target. Some compilers have
features for generating program-profiling capabilities. Better
yet, some can make use of the information gathered by a
profiler to restructure a program for optimal performance.
Such an approach is known as profile-driven, or application
specific, optimization. To employ this technique, the compiler
generates a version of the application containing minimal
instrumentation that writes out information on program behavior
to a file or a buffer in memory, which can be uploaded to the
host computer. This information is then fed back to the
compiler the next time the application is compiled.
Figure 2: Profile-driven optimization
The profiling data enables the optimizer to make more
intelligent optimization decisions based on real world run-time
data rather than speculation. For example, register allocation
is based on the actual number of times a variable is used and
if-then-else clauses are swapped if the first part is executed
more often.
The compiler can also use the profiler data to combine
optimizations more intelligently. A good example is that some
optimizations, such as loop unrolling or inlining, trade off
speed for size. It makes little sense to apply loop unrolling
to rarely executed loops if minimal code size is also an
important goal. However, these optimizations should be applied
as aggressively as possible to frequently executed sections of
code.
The profiling data also records how often branches are
taken. For processors that support branch prediction, such as
the PowerPC, this enables the compiler to set the branch
prediction bit in the opcodes for conditional branch
instructions. By accurately predicting whether a branch will be
taken, the compiler can minimize pipeline stalls that force the
processor to wait for instructions to be fetched from memory.
For this C code :
while (long_time)
{
if (rare_condition) {
do_something;
} else {
do_something_else;
}
}
The Diab compiler will generate a bc+ instruction for the
conditional branch instruction indicating that the forward
branch is predicted to be taken rather than the default not
taken as the regular bc instruction would indicate.
Function Inlining
Function inlining improves performance by replacing a call to a
function with the body of the function itself. This eliminates
the overhead of jumping to and returning from a subroutine.
While most compilers perform inlining, many have limitations
that greatly reduce the number of functions actually inlined.
For example, some compilers cannot inline functions with static
members. Since by itself inlining may have a significant effect
on program performance, it is important to verify that a
compiler performs inlining effectively.
An additional advantage is that placing the function code
"inline" exposes it to further optimization, enhancing
performance even more. A good example of this is a swap
function. The Diab compiler will eliminate the address and
indirection operations in the process of inlining.
void swap (int *a, int
*b)
{
int t = *a;
*a = *b;
*b = t;
}
main()
{
swap(&x,&y) ;
}
will get transformed to
main()
{
int t=x;
x = y;
y = t;
}
The Diab compiler can perform inlining even on recursive
functions. While the number of levels of recursion that should
be inlined may be determined by default or user-defined values,
this is an excellent example of how profiling data can improve
the effectiveness of a program. In this case, the compiler can
determine the typical number of levels of recursion and inline
the function the appropriate number of times before reverting
to standard function calls.
Another manner in which profiling can boost the
effectiveness of function inlining is through a technique
called partial inlining. If profiling data reveals that the
function begins with a simple conditional check that usually
results in an immediate return, this part of the function can
be inlined, but remainder treated as a true function call:
void foo ()
{
if (condition) {
/* many lines of function body */
}
}
bar()
{
foo() ;
}
bar will be transformed to:
bar()
{
if (condition) {
foo() ;
}
}
saving the function call overhead for foo.
Although function inlining is often thought of as an
optimization that trades off increased program size for speed,
this is not necessarily true. Especially in C++ programs, it is
common to have functions that consist of simply one or two
lines of code. In such cases, the instructions required to set
up and return from a function call may exceed the number of
instructions to directly execute the function, making function
inlining a size reduction optimization. Rather than follow the
standard practice of disabling inlining when optimizing for
minimal code size, the Diab compiler checks the size of a
function to verify if it can be inlined without increasing code
size.
Complex Branch Optimization
With code reuse becoming a critical factor in delivering
products to market faster, writing well-structured readable
code is very important. In some cases structured programming
techniques lead to code that is less efficient. Compilers, of
course, do not have to worry about "code reviews" or program
maintenance and have no qualms about using constructs such as
unstructured jumps ("goto"). This enables a sophisticated
optimizer to "reimplement" the program logic for greater
performance.
A good example of an optimization that reimplements
structured programs for higher performance is the complex
branch optimization. This optimization rewrites code containing
branches and fall-throughs to conditional branches where the
outcome can be compared. A code fragment of a loop with
multiple exits demonstrates how this optimization works:
fl = 0;
while (!fl) {
stml () ;
if (e1) break ;
if (e2) {
fl = 1 ;
}
}
if (f1) {
stm2 () ;
}
becomes:
L1:
stml ();
if (e1) goto L2; /* skip stm2 since f1 is 0 */
if (!e2) goto L1; /* if !e2 fall through */
stmt2 ();
L2:
The optimizer detects that the loop will always be entered
and that stmt() will always be executed. The compiler
recognizes that stmt2() will never be executed if fl is false
and if e1 is true, fl will always be false. This enables it to
produce a much more efficient implementation of the same code.
However, it is not recommended for programmers to follow this
coding style themselves!
Fine-Tuning
To extract maximum performance out of a particular processor,
it is essential to perform architecture-specific optimizations.
Although such optimizations have usually been associated with
the code selection and code generation phases, to be truly
effective each optimization phase must have an intimate
knowledge of the target architecture. We will examine how a new
technique known as architectural analysis is applied, in
addition to examples of code selection, peephole, and
instruction scheduling optimizations for the PowerPC, ColdFire,
and 680X0/683XX (68K) processor families.
Architectural Analysis
In our earlier discussions of global optimizations, we
illustrated several examples where the optimizer
"reimplemented" program logic for greater efficiency.
Architectural analysis enables the optimizer to reimplement
program logic to take advantage of architectural specifics,
such as addressing modes. For example, the global optimizer for
the Diab 68K compiler would modify the following code
fragment:
p[0] = 0 ;
p[1] = 1 ;
p += 2 ;
to:
*p++ = 0 ;
*p++ = 1 ;
or, in 68000 assembly language:
clr.b (a5)+
move.b #1,(a5)+
This takes advantage of the 68K's post-increment addressing
mode. Of course, many C programs developed to execute on 68K
family processors have been coded to take advantage of such.
This becomes a drawback when porting an application to a RISC
processor like the PowerPC, which lacks a post-increment
addressing mode. Fortunately, architecture analysis works both
ways, as illustrated by the following code fragment:
for(i = 0; i < 100;
i++) {
*p++ = 0 ;
}
The above code is very efficient for an architecture with a
post-increment addressing mode, but the PowerPC
store-with-update instruction is, in effect, a pre-incrementing
mode. To utilize this, the compiler rewrites the code to:
p--;
for(i = 0; i < 100; i++) {
*++p = 0 ;
}
The inner body in PowerPC assembly language:
stbu r4,1(r3) #
*(r3+1) = r4, r3 += 1
Code Selection
The code selection phase performs optimization by replacing two
or more operations in the intermediate language representation
with a single real instruction. Architectural analysis enables
the global optimizer to reorder operations in such a way as to
maximize the opportunities for optimization in the code
selection phase. The following example illustrates this
point:
for(i = 0; i < 100;
i++) {
*++p = 0 ;
}
to:
for(i = 100; --i != 0;
i) {
*++p = 0 ;
}
The test expression, --i != 0, can utilize the Count
Register in the PowerPC architecture. Using this register, the
whole test and branch can be replaced with one PowerPC
instruction, bdnz. The complete loop becomes in PowerPC
assembly language:
.L4:
stbu r4,1(r3)
bdnz .L4
Peephole Optimization
Traditionally, the purpose of a peephole optimizer has been to
rectify failures by the earlier phases of optimization. With
the emergence of advanced global optimization and code
selection techniques, the role of the peephole optimizer is
declining in importance. This is simply because advanced
optimization techniques rarely produce sequences of obviously
inefficient code. However, register allocation and code
selection is an NP-complete problem and doing the "perfect"
analysis would cause unacceptably long compile times. Thus,
peephole optimization is still useful to fix inefficiencies for
a few corner cases.
Instruction Scheduling
Instruction scheduling is a significant follow-on technique to
peephole optimization. Since instruction scheduling often needs
to understand data dependencies between instructions, it should
be performed after all other optimizations. Instruction
scheduling is a particularly good example of the growing need
to tailor optimizations not only to a particular architecture,
but also to specific variants within a processor family. This
is because it is highly dependent on the implementation of the
instruction pipeline.
The PPC603e and MPC860 (also known as PowerQUICC), both
members of the PowerPC family, provide a good example of how
instruction scheduling requirements differ radically even
within the same processor family. The 603e is capable of
issuing two instructions per cycle. In addition, it has
built-in hardware support for managing data dependencies, such
as one instruction requiring an operand being calculated by
another instruction. Therefore, instruction scheduling for a
603e needs to order the code to maximize the number of
instruction pairs that can be executed in parallel. For
example:
sum += p[0] ;
sum += p[1] ;
This example includes two loads and two adds. Depending on
the surrounding instructions, it is better on the 603e to make
sure that it can issue two instructions per cycle, which means
that the loads should be in different pairs:
lwz r12,0(r3) #
these two instructions can be issued the same cycle
add r4,r4,r12
lwz r11,4(r3) # next cycle
add r4,r4,r11
In contrast, the 860 executes only a single instruction per
cycle and lacks special hardware for preventing pipeline stalls
caused by data dependencies. To produce optimal code, the
instruction scheduler must focus on minimizing the number of
times an instruction is using an operand calculated by, for
example, a load instruction immediately before it. Otherwise
the pipeline cannot overlap all its operations smoothly.
lwz r12,0(r3) #
takes two cycles to load r12
lwz r11,4(r3) # load r11 now instead of stalling
add r4,r4,r12 # now r12 is ready
add r4,r4,r11 # and now r11 is ready
Conclusions
Compiler optimization technology has been steadily advancing as
more sophisticated processors hit the market. For high-end
embedded processors, today's most advanced compilers take
advantage of both CPU-specific and application specific
information to optimize code intelligently for optimal results.
In addition, the introduction of sophisticated new optimization
and analysis techniques enables state-of-the-art compilers to
offer significant advantages over compilers employing less
sophisticated optimization technology. With compiler
performance being the single most important variable affecting
program performance, developers should carefully examine a
compiler's optimization capability before selecting a
compiler.