It is useful to understand how a high-level
language program is translated into instructions. First, since
implementing an
embedded computing system often requires controlling the instruction
sequences used to handle interrupts, placement of data and instructions
in memory, and so forth, understanding how the compiler works can help
you know when you cannot rely on the compiler.
Next, because many
applications are also performance sensitive, understanding how code is
generated can help you meet your performance goals, either by writing
high-level code that gets compiled into the instructions you want or by
recognizing when you must write your own assembly code.
Compilation =
translation + optimization. Compilation combines
translation and optimization. The high-level
language program is translated into the lower-level form of
instructions; optimizations try to generate better instruction
sequences than would be possible if the brute force technique of
independently translating source code statements were used.
Optimization techniques focus on more of the program to
ensure that
compilation decisions that appear to be good for one statement are not
unnecessarily problematic for other parts of the program.
The compilation process is summarized in Figure 5-12 below.
Compilation begins with high-level language code such as C and
generally produces assembly code. (Directly producing object code
simply duplicates the functions of an assembler, which is a very
desirable stand-alone program to have.)
The highlevel language program
is parsed to break it into statements and expressions. In addition, a
symbol table is generated, which includes all the named objects in the
program. Some compilers may then perform higher-level optimizations
that can be viewed as modifying the high-level language program input
without reference to instructions.
 |
| Figure
5-12. The compilation process. |
Simplifying arithmetic expressions is one example of a
machineindependent optimization. Not all compilers do such
optimizations, and compilers can vary widely regarding which
combinations of machine-independent optimizations they do perform.
Instruction-level optimizations are aimed at generating code.
They may
work directly on real instructions or on a pseudoinstruction format
that is later mapped onto the instructions of the target CPU. This
level of optimization also helps modularize the compiler by allowing
code generation to create simpler code that is
later optimized. For example, consider the following array access code:
x[i] = c*x[i];
A simple code generator would
generate the address for x[i] twice, once for each appearance in the
statement. The later optimization phases can recognize this as an
example of common expressions that need not be duplicated.
While in
this simple case it would be possible to create a code generator that
never generated the redundant expression, taking into account every
such optimization at code generation time is very difficult. We get
better code and more reliable compilers by generating simple code first
and then optimizing it.
Statement
Translation
In this section, we consider the basic job of translating the
high-level language program with little or no optimization. Let's first
consider how to translate an expression.
A large amount of the code in
a typical application consists of arithmetic and logical expressions.
Understanding how to compile a single expression, as described in
Example 5-2 below, is a good first step in understanding the entire
compilation process.
Example 5-2 Compiling an
Arithmetic Expression
In the following arithmetic expression,
a*b + 5*(c ' d)
the variable is written in terms of
program variables. In some machines we may be able to perform
memory-to-memory arithmetic directly on the locations corresponding to
those variables. However, in many machines, such as the ARM, we must
first load the variables into registers. This requires choosing which
registers receive not only the named variables but also intermediate
results such as (c ' d).
The code for the expression can be built by walking the
data flow graph. The data flow graph for the expression appears below.

The temporary variables for the intermediate values and
final result have been named w, x, y, and z. To generate code, we walk
from the tree's root (where z, the final result, is generated) by
traversing the nodes in post order. During the walk, we generate
instructions to cover the operation at every node. The path is
presented below.

The nodes are numbered in the order in which code is
generated. Since every node in the data flow graph corresponds to an
operation that is directly supported by the instruction set, we simply
generate an instruction at every node. Since we are making an arbitrary
register assignment, we can use up the registers in order starting with
r1. The resulting ARM code follows:
; operator 1 (+)
ADR r4,a ; get address for a
MOV r1,[r4] ; load a
ADR r4,b ; get address for b
MOV r2,[r4] ; load b
ADD r3,r1,r2 ; put w into r3
; operator 2 (')
ADR r4,c ; get address for c
MOV r4,[r4] ; load c
ADR r4,d ; get address for d
MOV r5,[r4] ; load d
SUB r6,r4,r5 ; put x into r6
; operator 3 (*)
MUL r7,r6,#5 ; operator 3, puts y into r7
; operator 4 (+)
ADD r8,r7,r3 ; operator 4, puts z into r8
The equivalent SHARC code appears below.
! operator 1 (+)
R1 = DM(a); ! load a
R2 = DM(b); ! load b
R3 = R1 + R2; ! compute a + b
! operator 2 (')
R4 = DM(c); ! load c
R5 = DM(d); ! load d
R6 = R4 ' R5; ! compute c ' d
! operator 3 (*)
R7 = 5;
R8 = R6 * R7; ! compute y
! operator 4 (+)
R9 = R3 + R8; ! compute z
One obvious optimization is to reuse a register whose
value is no longer needed. In the case of the intermediate values w, x,
and y, we know that they cannot be used after the end of the expression
(e.g., in another expression) since they have no name in the C program.
However, the final result z may in fact be used in a C assignment and
the value reused later in the program. In this case we would need to
know when the register is no longer needed to determine its best use.
 |
| Figure
5-13. Flow of control in C and control flow diagrams |
In the previous example, we made an arbitrary
allocation of variables to registers for simplicity. When we have large
programs with multiple expressions, we must allocate registers more
carefully since CPUs have a limited number of registers. We will
consider register allocation later.
We also need to be able to translate control
structures. Since conditionals are controlled by expressions, the code
generation techniques of the last example can be used for those
expressions, leaving us with the task of generating code for the flow
of control itself.
Figure 5-13 above
shows a simple example of changing flow
of control in C - an if statement, in which the condition controls
whether the true or false branch of the if is taken. Figure 5-13 also
shows the control flow diagram for the if statement. Example 5-3
illustrates how to implement conditionals in assembly language.
Example 5-3 Generating Code for a
Conditional
Consider the following C statement:
if (a + b > 0)
x = 5;
else
x = 7;
The CDFG for the statement appears below.

We know how to generate the code for the expressions.
We can generate the control flow code by walking the CDFG. One ordered
walk through the CDFG follows:

To generate code, we must assign a label to the first
instruction at the end of a directed edge and create a branch for each
edge that does not go to the immediately following instruction. The
exact steps to be taken at the branch points depend on the target
architecture.
On some machines, evaluating expressions generates
condition codes that we can test in subsequent branches, and on other
machines we must use test-and-branch instructions. ARM allows us to
test condition codes, so we get the following ARM code for the 1-2-3
walk:
ADR r5,a ; get address for a
LDR r1,[r5] ; load a
ADR r5,b ; get address for b
LDR r2,b ; load b
ADD r3,r1,r2
BLE label3 ; true condition falls through branch
; true case
LDR r3,#5 ; load constant
ADR r5,x
STR r3, [r5] ; store value into x
B stmtend ; done with the true case
; false case
label3 LDR r3,#7 ; load constant
ADR r5,x ; get address of x
STR r3,[r5] ; store value into x
stmtend ...
The 1-2 and 3-4 edges do not require a branch and label
because they are straight-line code. In contrast, the 1-3 and 2-4 edges
do require a branch and a label for the target.
Since expressions are generally created as
straight-line code, they typically require careful consideration of the
order in which the operations are executed. We have much more freedom
when generating conditional code because the branches ensure that the
flow of control goes to the right block of code. If we walk the CDFG in
a different order and lay out the code blocks in a different order in
memory, we still get valid code as long as we properly place branches.
Drawing a control flow graph based on the while form of
the loop helps us understand how to translate it into instructions.

C compilers can generate (using the -s flag) assembler
source, which some compilers intersperse with the C code. Such code is
a very good way to learn about both assembly language programming and
compilation. Web Aid 5-Assembly has examples of assembly output from C
compilers for the ARM and SHARC.
Procedures
Another major code generation problem is the creation of procedures.
Generating code for procedures is relatively straightforward once we
know the procedure linkage appropriate for the CPU. At the procedure
definition, we generate the code to handle the procedure call and
return. At each call of the procedure, we set up the procedure
parameters and make the call.
The CPU's subroutine call mechanism is usually not
sufficient to directly support procedures in modern programming
languages. The procedure linkage mechanism provides a way for the
program to
pass parameters into the program and for the procedure to return a
value.
It also provides help in restoring the values of
registers that
the procedure has modified. All procedures in a given programming
language use the same linkage mechanism (although different languages
may use different linkages). The mechanism can also be used to call
handwritten assembly language routines from compiled code.
Procedure stacks are typically built to grow down from
high addresses. A stack pointer (sp) defines the end
of the current frame, while a frame pointer (fp)
defines the end of the last frame. (The fp is technically necessary
only if the stack frame can be grown by the procedure during
execution.)
The procedure can refer to an element in the frame by
addressing relative to sp. When a new procedure is called, the sp and
fp are modified to push another frame onto the stack.
The ARM Procedure Call Standard (APCS) is a good
illustration of a typical procedure linkage mechanism. Although the
stack frames are in main memory, understanding how registers are used
is key to understanding the mechanism, as explained below.
- r0"r3 are used to pass parameters into the
procedure. r0 is also used to hold the return value. If more than four
parameters are required, they are put on the stack frame.
- r4"r7 hold register variables.
- r11 is the frame pointer and r13 is the stack
pointer.
- r10 holds the limiting address on stack size,
which is used to check for stack overflows.
Other registers have additional uses in the protocol.
 |
| Figure
5-14. Layout of a one-dimensional array in memory |
Data Structures
The compiler must also translate references to data structures into
references to raw memories. In general, this requires address
computations. Some of these computations can be done at compile time
while others must be done at run time. arrays Arrays are interesting
because the address of an array element must in general be computed at
run time, since the array index may change. Let us first consider
one-dimensional arrays:
a[i]
The layout of the array in memory is shown in Figure
5-14 above. The zeroth element is stored as the first element of
the array,
the first element directly below, and so on. We can create a pointer
for the array that points to the array's head, namely, a[0]. If we call
that pointer aptr for convenience, then we can rewrite the reading
of a
[i] as
*(aptr + i)
Two-dimensional arrays are more challenging. There are
multiple possible ways to lay out a two-dimensional array in memory, as
shown in Figure 5-15 below. In
this form, which is known as row major,
the inner variable of the array (j in a[i,j]) varies most quickly.
(Fortran uses a different organization known as column major.)
Two-dimensional arrays also require more sophisticated
addressing - in
particular, we must know the size of the array. Let us consider the
row-major form. If the a[] array is of size N × M, then we can
turn the two-dimensional array access into a one-dimensional array
access. Thus,
a[i,j]
becomes
a[i*M + j]
where the maximum value for j is M ' 1.
 |
| Figure
5-15. Memory layout for two-dimensional arrays |
A C struct is easier to address. As shown in Figure
5-16 below, a structure is implemented as a contiguous block of
memory.
Fields in the structure can be accessed using constant offsets to the
base address of the structure. In this example, if field1 is four bytes
long, then field2 can be accessed as
*(aptr + 4)
This addition can usually be done at compile time,
requiring only the indirection itself to fetch the memory location
during execution.
Expression Simplification
Expression simplification is a useful area for machine-independent
transformations. We can use the laws of algebra to simplify
expressions. Consider the following expression:
a*b + a*c
We can use the distributive law to rewrite the
expression as
a*(b + c)
 |
| Figure
5-16. C structure layout and access. |
Since the new expression has only two operations rather
than three for the original form, it is almost certainly cheaper,
because it is both faster and smaller. Such transformations make some
broad assumptions about the relative cost of operations.
In some cases,
simple generalizations about the cost of operations may be misleading.
For example, a CPU with a multiply-andaccumulate instruction may be
able to do a multiply and addition as cheaply as it can do an addition.
However, such situations can often be taken care of in code generation.
We can also use the laws of arithmetic to further
simplify expressions on constants. Consider the following C statement:
for (i = 0; i < 8 + 1; i++)
We can simplify 8 + 1 to 9 at compile time - there is
no
need to perform that arithmetic while the program is executing. Why
would a program ever contain expressions that evaluate to constants?
Using named constants rather than numbers is good programming practice
and often leads to constant expression. The original form of the for
statement could have been
for (i = 0; i < NOPS + 1; i++)
where, for example, the added 1 takes care of a
trailing null character.
Dead Code Elimination
Code that will never be executed can be safely removed from the
program. The general problem of identifying code that will never be
executed is difficult, but there are some important special cases where
it can be done. Programmers will intentionally introduce dead code in
certain situations.
Consider this C code fragment:
#define DEBUG 0 ...
if (DEBUG) print_debug_stuff();
In the above case, the
print_debug_stuff() function is never executed, but the code allows the
programmer to override the preprocessor variable definition (perhaps
with a compile-time flag) to enable the debugging code.
This case is
easy to analyze because the condition is the constant 0, which C uses
for the false condition. Since there is no else clause in the if
statement, the compiler can totally eliminate the if statement,
rewriting the CDFG to provide a direct edge between the statements
before and after the if.
Some dead code may be introduced by the compiler. For
example, certain optimizations introduce copy statements that copy one
variable to another. If uses of the first variable can be replaced by
references to the second one, then the copy statement becomes dead code
that can be eliminated.
Procedure Inlining
Another machine-independent transformation that requires a little more
evaluation is procedure inlining. An inlined procedure does not have a
separate procedure body and procedure linkage; rather, the body of the
procedure is substituted in place for the procedure call. Figure 5-17 below
shows an example of function inlining in C.
The C++ programming
language provides an inline construct that tells the compiler to
generate inline code for a function. In this case, an inlined procedure
is generated in expanded form whenever possible. However, inlining is
not always the best thing to do. It does eliminate the procedure
linkage instructions. However, in a cached system, having multiple
copies of the function body may actually slow down the fetches of these
instructions. Inlining also increases code size, and memory may be
precious.
Function definition:
int foo(a,b,c) { return a + b " c; }
Function call:
z = foo(w,x,y,);
Inlining result:
z = w + x " y;
Figure 5-17. Function inlining in C.
Loop Transformations
Loops are important program structures—although they are compactly
described in the source code, they often use a large fraction of the
computation time. Many techniques have been designed to optimize loops.
A simple but useful transformation is known as loop
unrolling, which is illustrated in Example 5-4. Loop unrolling
is important because it helps expose parallelism that can be used by
later stages of the compiler.
Example 5-4 Loop Unrolling
A simple loop in C follows:
for (i = 0; i < N; i++) {
a[i]=b[i]*c[i];
}
This loop is executed a fixed number of times, namely,
N. A straightforward implementation of the loop would create and
initialize the loop variable i, update its value on every iteration,
and test it to see whether to exit the loop. However, since the loop is
executed a fixed number of times, we can generate more direct code.
If we let N = 4, then we can substitute the above C
code for the following loop:
a[0] = b[0]*c[0];
a[1] = b[1]*c[1];
a[2] = b[2]*c[2];
a[3] = b[3]*c[3];
This unrolled code has no loop overhead code at all,
that is, no iteration variable and no tests. But the unrolled loop has
the same problems as the inlined procedure—it may interfere with the
cache and expands the amount of code required.
We do not, of course, have to fully unroll loops.
Rather than unroll the above loop four times, we could unroll it twice.
The following code results:
for (i = 0; i < 2; i++) {
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];
}
In this case, since all operations in the two lines of
the loop body are independent, later stages of the compiler may be able
to generate code that allows them to be executed efficiently on the
CPU's pipeline.
Loop fusion combines two or more loops
into a single loop. For this transformation to be legal, two conditions
must be satisfied. First, the loops must iterate over the same values.
Second, the loop bodies must not have dependencies that would be
violated if they are executed together—for example, if the second
loop's ith iteration depends on the results of the i + 1th iteration of
the first loop, the two loops cannot be combined. Loop
distribution is the opposite of loop fusion, that is,
decomposing a single loop into multiple loops.
Loop tiling breaks up a loop into a
set of nested loops, with each inner loop performing the operations on
a subset of the data. An example is shown in Figure 5-18. Here, each
loop is broken up into tiles of size two. Each loop is split into two
loops—for example, the inner ii loop iterates within the tile and the
outer i loop iterates across the tiles.
The result is that the pattern
of accesses across the a array is drastically different - rather than
walking across one row in its entirety, the code walks through rows and
columns following the tile structure. Loop tiling changes the order in
which array elements are accessed, thereby allowing us to better
control the behavior of the cache during loop execution.
 |
| Figure
5-18. Loop tiling. |
We can also modify the arrays being indexed in loops. Array
padding adds dummy data elements to a loop in order to change
the layout of the array in the cache. Although these array locations
will not be used, they do change how the useful array elements fall
into cache lines. Judicious padding can in some cases significantly
reduce the number of cache conflicts during loop execution.
Next in Part 4: The creation of procedures
To read Part 1, go to "Program design and analysis."
To read Part 2 , go to " Models
of program, assemblers and linkers."
Used with the permission
of the publisher, Newnes/Elsevier,
this series of six articles is based on copyrighted material from "Computers
as Components: Principles of Embedded Computer System Design" by Wayne Wolf. The book
can be purchased on line.
Wayne Wolf is currently the Georgia
Research Alliance Eminent Scholar holding the Rhesa "Ray" S. Farmer,
Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's
School of Electrical and Computer Engineering (ECE). Previously a
professor of
electrical engineering at Princeton University, he worked at AT&T
Bell Laboratories. He has served as editor in chief of the ACM Transactions
on Embedded Computing and
of Design
Automation for Embedded Systems.