The basics of programming embedded processors: Part 3

Wayne Wolf

August 07, 2007

Wayne Wolf August 07, 2007

It is useful to understand how a high-level language program is translated into instructions. First, since implementing an embedded computing system often requires controlling the instruction sequences used to handle interrupts, placement of data and instructions in memory, and so forth, understanding how the compiler works can help you know when you cannot rely on the compiler.

Next, because many applications are also performance sensitive, understanding how code is generated can help you meet your performance goals, either by writing high-level code that gets compiled into the instructions you want or by recognizing when you must write your own assembly code.

Compilation = translation + optimization. Compilation combines translation and optimization. The high-level language program is translated into the lower-level form of instructions; optimizations try to generate better instruction sequences than would be possible if the brute force technique of independently translating source code statements were used.

Optimization techniques focus on more of the program to ensure that compilation decisions that appear to be good for one statement are not unnecessarily problematic for other parts of the program.

The compilation process is summarized in Figure 5-12 below. Compilation begins with high-level language code such as C and generally produces assembly code. (Directly producing object code simply duplicates the functions of an assembler, which is a very desirable stand-alone program to have.)

The highlevel language program is parsed to break it into statements and expressions. In addition, a symbol table is generated, which includes all the named objects in the program. Some compilers may then perform higher-level optimizations that can be viewed as modifying the high-level language program input without reference to instructions.

Figure 5-12. The compilation process.

Simplifying arithmetic expressions is one example of a machineindependent optimization. Not all compilers do such optimizations, and compilers can vary widely regarding which combinations of machine-independent optimizations they do perform. Instruction-level optimizations are aimed at generating code.

They may work directly on real instructions or on a pseudoinstruction format that is later mapped onto the instructions of the target CPU. This level of optimization also helps modularize the compiler by allowing code generation to create simpler code that is later optimized. For example, consider the following array access code:

x[i] = c*x[i];

A simple code generator would generate the address for x[i] twice, once for each appearance in the statement. The later optimization phases can recognize this as an example of common expressions that need not be duplicated.

While in this simple case it would be possible to create a code generator that never generated the redundant expression, taking into account every such optimization at code generation time is very difficult. We get better code and more reliable compilers by generating simple code first and then optimizing it.

Statement Translation
In this section, we consider the basic job of translating the high-level language program with little or no optimization. Let's first consider how to translate an expression.

A large amount of the code in a typical application consists of arithmetic and logical expressions. Understanding how to compile a single expression, as described in Example 5-2 below, is a good first step in understanding the entire compilation process.

Example 5-2 Compiling an Arithmetic Expression
In the following arithmetic expression,

a*b + 5*(c ' d)

the variable is written in terms of program variables. In some machines we may be able to perform memory-to-memory arithmetic directly on the locations corresponding to those variables. However, in many machines, such as the ARM, we must first load the variables into registers. This requires choosing which registers receive not only the named variables but also intermediate results such as (c ' d).

The code for the expression can be built by walking the data flow graph. The data flow graph for the expression appears below.

The temporary variables for the intermediate values and final result have been named w, x, y, and z. To generate code, we walk from the tree's root (where z, the final result, is generated) by traversing the nodes in post order. During the walk, we generate instructions to cover the operation at every node. The path is presented below.

The nodes are numbered in the order in which code is generated. Since every node in the data flow graph corresponds to an operation that is directly supported by the instruction set, we simply generate an instruction at every node. Since we are making an arbitrary register assignment, we can use up the registers in order starting with r1. The resulting ARM code follows:

; operator 1 (+) 
ADR r4,a ; get address for a
MOV r1,[r4] ; load a
ADR r4,b ; get address for b
MOV r2,[r4] ; load b
ADD r3,r1,r2 ; put w into r3
; operator 2 (')
ADR r4,c ; get address for c
MOV r4,[r4] ; load c
ADR r4,d ; get address for d
MOV r5,[r4] ; load d
SUB r6,r4,r5 ; put x into r6
; operator 3 (*)
MUL r7,r6,#5 ; operator 3, puts y into r7
; operator 4 (+)
ADD r8,r7,r3 ; operator 4, puts z into r8

The equivalent SHARC code appears below.

! operator 1 (+) 
R1 = DM(a); ! load a
R2 = DM(b); ! load b
R3 = R1 + R2; ! compute a + b
! operator 2 (')
R4 = DM(c); ! load c
R5 = DM(d); ! load d
R6 = R4 ' R5; ! compute c ' d
! operator 3 (*)
R7 = 5;
R8 = R6 * R7; ! compute y
! operator 4 (+)
R9 = R3 + R8; ! compute z

One obvious optimization is to reuse a register whose value is no longer needed. In the case of the intermediate values w, x, and y, we know that they cannot be used after the end of the expression (e.g., in another expression) since they have no name in the C program. However, the final result z may in fact be used in a C assignment and the value reused later in the program. In this case we would need to know when the register is no longer needed to determine its best use.

Figure 5-13. Flow of control in C and control flow diagrams

In the previous example, we made an arbitrary allocation of variables to registers for simplicity. When we have large programs with multiple expressions, we must allocate registers more carefully since CPUs have a limited number of registers. We will consider register allocation later.

We also need to be able to translate control structures. Since conditionals are controlled by expressions, the code generation techniques of the last example can be used for those expressions, leaving us with the task of generating code for the flow of control itself.

Figure 5-13 above shows a simple example of changing flow of control in C - an if statement, in which the condition controls whether the true or false branch of the if is taken. Figure 5-13 also shows the control flow diagram for the if statement. Example 5-3 illustrates how to implement conditionals in assembly language.

Example 5-3 Generating Code for a Conditional
Consider the following C statement:

if (a + b > 0)
x = 5;
x = 7;

The CDFG for the statement appears below.

We know how to generate the code for the expressions. We can generate the control flow code by walking the CDFG. One ordered walk through the CDFG follows:

To generate code, we must assign a label to the first instruction at the end of a directed edge and create a branch for each edge that does not go to the immediately following instruction. The exact steps to be taken at the branch points depend on the target architecture.

On some machines, evaluating expressions generates condition codes that we can test in subsequent branches, and on other machines we must use test-and-branch instructions. ARM allows us to test condition codes, so we get the following ARM code for the 1-2-3 walk:

ADR r5,a ; get address for a 
LDR r1,[r5] ; load a
ADR r5,b ; get address for b
LDR r2,b ; load b
ADD r3,r1,r2
BLE label3 ; true condition falls through branch
; true case
LDR r3,#5 ; load constant
ADR r5,x
STR r3, [r5] ; store value into x
B stmtend ; done with the true case
; false case
label3 LDR r3,#7 ; load constant
ADR r5,x ; get address of x
STR r3,[r5] ; store value into x
stmtend ...

The 1-2 and 3-4 edges do not require a branch and label because they are straight-line code. In contrast, the 1-3 and 2-4 edges do require a branch and a label for the target.

Since expressions are generally created as straight-line code, they typically require careful consideration of the order in which the operations are executed. We have much more freedom when generating conditional code because the branches ensure that the flow of control goes to the right block of code. If we walk the CDFG in a different order and lay out the code blocks in a different order in memory, we still get valid code as long as we properly place branches.

Drawing a control flow graph based on the while form of the loop helps us understand how to translate it into instructions.

C compilers can generate (using the -s flag) assembler source, which some compilers intersperse with the C code. Such code is a very good way to learn about both assembly language programming and compilation. Web Aid 5-Assembly has examples of assembly output from C compilers for the ARM and SHARC.

Another major code generation problem is the creation of procedures. Generating code for procedures is relatively straightforward once we know the procedure linkage appropriate for the CPU. At the procedure definition, we generate the code to handle the procedure call and return. At each call of the procedure, we set up the procedure parameters and make the call.

The CPU's subroutine call mechanism is usually not sufficient to directly support procedures in modern programming languages. The procedure linkage mechanism provides a way for the program to pass parameters into the program and for the procedure to return a value.

It also provides help in restoring the values of registers that the procedure has modified. All procedures in a given programming language use the same linkage mechanism (although different languages may use different linkages). The mechanism can also be used to call handwritten assembly language routines from compiled code.

Procedure stacks are typically built to grow down from high addresses. A stack pointer (sp) defines the end of the current frame, while a frame pointer (fp) defines the end of the last frame. (The fp is technically necessary only if the stack frame can be grown by the procedure during execution.)

The procedure can refer to an element in the frame by addressing relative to sp. When a new procedure is called, the sp and fp are modified to push another frame onto the stack.

The ARM Procedure Call Standard (APCS) is a good illustration of a typical procedure linkage mechanism. Although the stack frames are in main memory, understanding how registers are used is key to understanding the mechanism, as explained below.

  • r0"r3 are used to pass parameters into the procedure. r0 is also used to hold the return value. If more than four parameters are required, they are put on the stack frame.
  • r4"r7 hold register variables.
  • r11 is the frame pointer and r13 is the stack pointer.
  • r10 holds the limiting address on stack size, which is used to check for stack overflows.

Other registers have additional uses in the protocol.

Figure 5-14. Layout of a one-dimensional array in memory

Data Structures
The compiler must also translate references to data structures into references to raw memories. In general, this requires address computations. Some of these computations can be done at compile time while others must be done at run time. arrays Arrays are interesting because the address of an array element must in general be computed at run time, since the array index may change. Let us first consider one-dimensional arrays:


The layout of the array in memory is shown in Figure 5-14 above. The zeroth element is stored as the first element of the array, the first element directly below, and so on. We can create a pointer for the array that points to the array's head, namely, a[0]. If we call that pointer aptr for convenience, then we can rewrite the reading of  a [i] as

*(aptr + i)

Two-dimensional arrays are more challenging. There are multiple possible ways to lay out a two-dimensional array in memory, as shown in Figure 5-15 below. In this form, which is known as row major, the inner variable of the array (j in a[i,j]) varies most quickly. (Fortran uses a different organization known as column major.)

Two-dimensional arrays also require more sophisticated addressing - in particular, we must know the size of the array. Let us consider the row-major form. If the a[] array is of size N × M, then we can turn the two-dimensional array access into a one-dimensional array access. Thus,



a[i*M + j]

where the maximum value for j is M ' 1.

Figure 5-15. Memory layout for two-dimensional arrays

A C struct is easier to address. As shown in Figure 5-16 below, a structure is implemented as a contiguous block of memory. Fields in the structure can be accessed using constant offsets to the base address of the structure. In this example, if field1 is four bytes long, then field2 can be accessed as

*(aptr + 4)

This addition can usually be done at compile time, requiring only the indirection itself to fetch the memory location during execution.

Expression Simplification
Expression simplification is a useful area for machine-independent transformations. We can use the laws of algebra to simplify expressions. Consider the following expression:

a*b + a*c

We can use the distributive law to rewrite the expression as

a*(b + c)

Figure 5-16. C structure layout and access.

Since the new expression has only two operations rather than three for the original form, it is almost certainly cheaper, because it is both faster and smaller. Such transformations make some broad assumptions about the relative cost of operations.

In some cases, simple generalizations about the cost of operations may be misleading. For example, a CPU with a multiply-andaccumulate instruction may be able to do a multiply and addition as cheaply as it can do an addition. However, such situations can often be taken care of in code generation.

We can also use the laws of arithmetic to further simplify expressions on constants. Consider the following C statement:

for (i = 0; i < 8 + 1; i++)

We can simplify 8 + 1 to 9 at compile time - there is no need to perform that arithmetic while the program is executing. Why would a program ever contain expressions that evaluate to constants? Using named constants rather than numbers is good programming practice and often leads to constant expression. The original form of the for statement could have been

for (i = 0; i < NOPS + 1; i++)

where, for example, the added 1 takes care of a trailing null character.

Dead Code Elimination
Code that will never be executed can be safely removed from the program. The general problem of identifying code that will never be executed is difficult, but there are some important special cases where it can be done. Programmers will intentionally introduce dead code in certain situations.

Consider this C code fragment:

#define DEBUG 0 ... if (DEBUG) print_debug_stuff();

In the above case, the print_debug_stuff() function is never executed, but the code allows the programmer to override the preprocessor variable definition (perhaps with a compile-time flag) to enable the debugging code.

This case is easy to analyze because the condition is the constant 0, which C uses for the false condition. Since there is no else clause in the if statement, the compiler can totally eliminate the if statement, rewriting the CDFG to provide a direct edge between the statements before and after the if.

Some dead code may be introduced by the compiler. For example, certain optimizations introduce copy statements that copy one variable to another. If uses of the first variable can be replaced by references to the second one, then the copy statement becomes dead code that can be eliminated.

Procedure Inlining
Another machine-independent transformation that requires a little more evaluation is procedure inlining. An inlined procedure does not have a separate procedure body and procedure linkage; rather, the body of the procedure is substituted in place for the procedure call. Figure 5-17 below shows an example of function inlining in C.

The C++ programming language provides an inline construct that tells the compiler to generate inline code for a function. In this case, an inlined procedure is generated in expanded form whenever possible. However, inlining is not always the best thing to do. It does eliminate the procedure linkage instructions. However, in a cached system, having multiple copies of the function body may actually slow down the fetches of these instructions. Inlining also increases code size, and memory may be precious.

Function definition:
int foo(a,b,c) { return a + b " c; }

Function call:
z = foo(w,x,y,);

Inlining result:
z = w + x " y;

Figure 5-17. Function inlining in C. 

Loop Transformations
Loops are important program structures—although they are compactly described in the source code, they often use a large fraction of the computation time. Many techniques have been designed to optimize loops.

A simple but useful transformation is known as loop unrolling, which is illustrated in Example 5-4. Loop unrolling is important because it helps expose parallelism that can be used by later stages of the compiler.

Example 5-4 Loop Unrolling
A simple loop in C follows:

 for (i = 0; i < N; i++) {

This loop is executed a fixed number of times, namely, N. A straightforward implementation of the loop would create and initialize the loop variable i, update its value on every iteration, and test it to see whether to exit the loop. However, since the loop is executed a fixed number of times, we can generate more direct code.

If we let N = 4, then we can substitute the above C code for the following loop:

 a[0] = b[0]*c[0]; 
a[1] = b[1]*c[1];
a[2] = b[2]*c[2];
a[3] = b[3]*c[3];

This unrolled code has no loop overhead code at all, that is, no iteration variable and no tests. But the unrolled loop has the same problems as the inlined procedure—it may interfere with the cache and expands the amount of code required.

We do not, of course, have to fully unroll loops. Rather than unroll the above loop four times, we could unroll it twice. The following code results:

 for (i = 0; i < 2; i++) { 
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];

In this case, since all operations in the two lines of the loop body are independent, later stages of the compiler may be able to generate code that allows them to be executed efficiently on the CPU's pipeline.

Loop fusion combines two or more loops into a single loop. For this transformation to be legal, two conditions must be satisfied. First, the loops must iterate over the same values. Second, the loop bodies must not have dependencies that would be violated if they are executed together—for example, if the second loop's ith iteration depends on the results of the i + 1th iteration of the first loop, the two loops cannot be combined. Loop distribution is the opposite of loop fusion, that is, decomposing a single loop into multiple loops.

Loop tiling breaks up a loop into a set of nested loops, with each inner loop performing the operations on a subset of the data. An example is shown in Figure 5-18. Here, each loop is broken up into tiles of size two. Each loop is split into two loops—for example, the inner ii loop iterates within the tile and the outer i loop iterates across the tiles.

The result is that the pattern of accesses across the a array is drastically different - rather than walking across one row in its entirety, the code walks through rows and columns following the tile structure. Loop tiling changes the order in which array elements are accessed, thereby allowing us to better control the behavior of the cache during loop execution.

Figure 5-18. Loop tiling.

We can also modify the arrays being indexed in loops. Array padding adds dummy data elements to a loop in order to change the layout of the array in the cache. Although these array locations will not be used, they do change how the useful array elements fall into cache lines. Judicious padding can in some cases significantly reduce the number of cache conflicts during loop execution.

Next in Part 4: The creation of procedures
To read Part 1, go to "Program design and analysis."
To read Part 2 , go to " Models of program, assemblers and linkers."

Used with the permission of the publisher, Newnes/Elsevier, this series of six articles is based on copyrighted material from "Computers as Components: Principles of Embedded Computer System Design" by Wayne Wolf. The book can be purchased on line.

Wayne Wolf  is currently the Georgia Research Alliance Eminent Scholar holding the Rhesa "Ray" S. Farmer, Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's School of Electrical and Computer Engineering (ECE). Previously a professor of electrical engineering at Princeton University, he worked at AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems.

Loading comments...