The basics of programming embedded processors: Part 3

It is useful to understand how a high-levellanguage program is translated into instructions. First, sinceimplementing anembedded computing system often requires controlling the instructionsequences used to handle interrupts, placement of data and instructionsin memory, and so forth, understanding how the compiler works can helpyou know when you cannot rely on the compiler.

Next, because manyapplications are also performance sensitive, understanding how code isgenerated can help you meet your performance goals, either by writinghigh-level code that gets compiled into the instructions you want or byrecognizing when you must write your own assembly code.

Compilation =translation + optimization. Compilation combinestranslation and optimization. The high-levellanguage program is translated into the lower-level form ofinstructions; optimizations try to generate better instructionsequences than would be possible if the brute force technique ofindependently translating source code statements were used.

Optimization techniques focus on more of the program toensure thatcompilation decisions that appear to be good for one statement are notunnecessarily problematic for other parts of the program.

The compilation process is summarized in Figure 5-12 below. Compilation begins with high-level language code such as C andgenerally produces assembly code. (Directly producing object codesimply duplicates the functions of an assembler, which is a verydesirable stand-alone program to have.)

The highlevel language programis parsed to break it into statements and expressions. In addition, asymbol table is generated, which includes all the named objects in theprogram. Some compilers may then perform higher-level optimizationsthat can be viewed as modifying the high-level language program inputwithout reference to instructions.

Figure5-12. The compilation process.

Simplifying arithmetic expressions is one example of amachineindependent optimization. Not all compilers do suchoptimizations, and compilers can vary widely regarding whichcombinations of machine-independent optimizations they do perform.Instruction-level optimizations are aimed at generating code.

They maywork directly on real instructions or on a pseudoinstruction formatthat is later mapped onto the instructions of the target CPU. Thislevel of optimization also helps modularize the compiler by allowingcode generation to create simpler code that islater optimized. For example, consider the following array access code:

x[i] = c*x[i];

A simple code generator wouldgenerate the address for x[i] twice, once for each appearance in thestatement. The later optimization phases can recognize this as anexample of common expressions that need not be duplicated.

While inthis simple case it would be possible to create a code generator thatnever generated the redundant expression, taking into account everysuch optimization at code generation time is very difficult. We getbetter code and more reliable compilers by generating simple code firstand then optimizing it.

StatementTranslation
In this section, we consider the basic job of translating thehigh-level language program with little or no optimization. Let's firstconsider how to translate an expression.

A large amount of the code ina typical application consists of arithmetic and logical expressions.Understanding how to compile a single expression, as described inExample 5-2 below, is a good first step in understanding the entirecompilation process.

Example 5-2 Compiling anArithmetic Expression
In the following arithmetic expression,

a*b + 5*(c ' d)

the variable is written in terms ofprogram variables. In some machines we may be able to performmemory-to-memory arithmetic directly on the locations corresponding tothose variables. However, in many machines, such as the ARM, we mustfirst load the variables into registers. This requires choosing whichregisters receive not only the named variables but also intermediateresults such as (c ' d).

The code for the expression can be built by walking thedata flow graph. The data flow graph for the expression appears below.

The temporary variables for the intermediate values andfinal result have been named w, x, y, and z. To generate code, we walkfrom the tree's root (where z, the final result, is generated) bytraversing the nodes in post order. During the walk, we generateinstructions to cover the operation at every node. The path ispresented below.

The nodes are numbered in the order in which code isgenerated. Since every node in the data flow graph corresponds to anoperation that is directly supported by the instruction set, we simplygenerate an instruction at every node. Since we are making an arbitraryregister assignment, we can use up the registers in order starting withr1. The resulting ARM code follows:

; operator 1 (+) 
ADR r4,a ; get address for a
MOV r1,[r4] ; load a
ADR r4,b ; get address for b
MOV r2,[r4] ; load b
ADD r3,r1,r2 ; put w into r3
; operator 2 (')
ADR r4,c ; get address for c
MOV r4,[r4] ; load c
ADR r4,d ; get address for d
MOV r5,[r4] ; load d
SUB r6,r4,r5 ; put x into r6
; operator 3 (*)
MUL r7,r6,#5 ; operator 3, puts y into r7
; operator 4 (+)
ADD r8,r7,r3 ; operator 4, puts z into r8

The equivalent SHARC code appears below.

! operator 1 (+) 
R1 = DM(a); ! load a
R2 = DM(b); ! load b
R3 = R1 + R2; ! compute a + b
! operator 2 (')
R4 = DM(c); ! load c
R5 = DM(d); ! load d
R6 = R4 ' R5; ! compute c ' d
! operator 3 (*)
R7 = 5;
R8 = R6 * R7; ! compute y
! operator 4 (+)
R9 = R3 + R8; ! compute z

One obvious optimization is to reuse a register whosevalue is no longer needed. In the case of the intermediate values w, x,and y, we know that they cannot be used after the end of the expression(e.g., in another expression) since they have no name in the C program.However, the final result z may in fact be used in a C assignment andthe value reused later in the program. In this case we would need toknow when the register is no longer needed to determine its best use.

Figure5-13. Flow of control in C and control flow diagrams

In the previous example, we made an arbitraryallocation of variables to registers for simplicity. When we have largeprograms with multiple expressions, we must allocate registers morecarefully since CPUs have a limited number of registers. We willconsider register allocation later.

We also need to be able to translate controlstructures. Since conditionals are controlled by expressions, the codegeneration techniques of the last example can be used for thoseexpressions, leaving us with the task of generating code for the flowof control itself.

Figure 5-13 above shows a simple example of changing flowof control in C – an if statement, in which the condition controlswhether the true or false branch of the if is taken. Figure 5-13 alsoshows the control flow diagram for the if statement. Example 5-3illustrates how to implement conditionals in assembly language.

Example 5-3 Generating Code for aConditional
Consider the following C statement:

if (a + b > 0)
x = 5;
else
x = 7;

The CDFG for the statement appears below.

We know how to generate the code for the expressions.We can generate the control flow code by walking the CDFG. One orderedwalk through the CDFG follows:

To generate code, we must assign a label to the firstinstruction at the end of a directed edge and create a branch for eachedge that does not go to the immediately following instruction. Theexact steps to be taken at the branch points depend on the targetarchitecture.

On some machines, evaluating expressions generatescondition codes that we can test in subsequent branches, and on othermachines we must use test-and-branch instructions. ARM allows us totest condition codes, so we get the following ARM code for the 1-2-3walk:

ADR r5,a ; get address for a 
LDR r1,[r5] ; load a
ADR r5,b ; get address for b
LDR r2,b ; load b
ADD r3,r1,r2
BLE label3 ; true condition falls through branch
; true case
LDR r3,#5 ; load constant
ADR r5,x
STR r3, [r5] ; store value into x
B stmtend ; done with the true case
; false case
label3 LDR r3,#7 ; load constant
ADR r5,x ; get address of x
STR r3,[r5] ; store value into x
stmtend ...

The 1-2 and 3-4 edges do not require a branch and labelbecause they are straight-line code. In contrast, the 1-3 and 2-4 edgesdo require a branch and a label for the target.

Since expressions are generally created asstraight-line code, they typically require careful consideration of theorder in which the operations are executed. We have much more freedomwhen generating conditional code because the branches ensure that theflow of control goes to the right block of code. If we walk the CDFG ina different order and lay out the code blocks in a different order inmemory, we still get valid code as long as we properly place branches.

Drawing a control flow graph based on the while form ofthe loop helps us understand how to translate it into instructions.

C compilers can generate (using the -s flag) assemblersource, which some compilers intersperse with the C code. Such code isa very good way to learn about both assembly language programming andcompilation. Web Aid 5-Assembly has examples of assembly output from Ccompilers for the ARM and SHARC.

Procedures
Another major code generation problem is the creation of procedures.Generating code for procedures is relatively straightforward once weknow the procedure linkage appropriate for the CPU. At the proceduredefinition, we generate the code to handle the procedure call andreturn. At each call of the procedure, we set up the procedureparameters and make the call.

The CPU's subroutine call mechanism is usually notsufficient to directly support procedures in modern programminglanguages. The procedure linkage mechanism provides a way for theprogram topass parameters into the program and for the procedure to return avalue.

It also provides help in restoring the values ofregisters thatthe procedure has modified. All procedures in a given programminglanguage use the same linkage mechanism (although different languagesmay use different linkages). The mechanism can also be used to callhandwritten assembly language routines from compiled code.

Procedure stacks are typically built to grow down fromhigh addresses. A stack pointer (sp) defines the endof the current frame, while a frame pointer (fp)defines the end of the last frame. (The fp is technically necessaryonly if the stack frame can be grown by the procedure duringexecution.)

The procedure can refer to an element in the frame byaddressing relative to sp. When a new procedure is called, the sp andfp are modified to push another frame onto the stack.

The ARM Procedure Call Standard (APCS) is a goodillustration of a typical procedure linkage mechanism. Although thestack frames are in main memory, understanding how registers are usedis key to understanding the mechanism, as explained below.

  • r0″r3 are used to pass parameters into theprocedure. r0 is also used to hold the return value. If more than fourparameters are required, they are put on the stack frame.
  • r4″r7 hold register variables.
  • r11 is the frame pointer and r13 is the stackpointer.
  • r10 holds the limiting address on stack size,which is used to check for stack overflows.

Other registers have additional uses in the protocol.

Figure5-14. Layout of a one-dimensional array in memory

Data Structures
The compiler must also translate references to data structures intoreferences to raw memories. In general, this requires addresscomputations. Some of these computations can be done at compile timewhile others must be done at run time. arrays Arrays are interestingbecause the address of an array element must in general be computed atrun time, since the array index may change. Let us first considerone-dimensional arrays:

a[i]

The layout of the array in memory is shown in Figure5-14 above . The zeroth element is stored as the first element ofthe array,the first element directly below, and so on. We can create a pointerfor the array that points to the array's head, namely, a[0]. If we callthat pointer aptr for convenience, then we can rewrite the readingof  a[i] as

*(aptr + i)

Two-dimensional arrays are more challenging. There aremultiple possible ways to lay out a two-dimensional array in memory, asshown in Figure 5-15 below. Inthis form, which is known as row major ,the inner variable of the array (j in a[i,j]) varies most quickly.(Fortran uses a different organization known as column major.)

Two-dimensional arrays also require more sophisticatedaddressing – inparticular, we must know the size of the array. Let us consider therow-major form. If the a[] array is of size N × M, then we canturn the two-dimensional array access into a one-dimensional arrayaccess. Thus,

a[i,j]

becomes

a[i*M + j]

where the maximum value for j is M ' 1.

Figure5-15. Memory layout for two-dimensional arrays

A C struct is easier to address. As shown in Figure5-16 below , a structure is implemented as a contiguous block ofmemory.Fields in the structure can be accessed using constant offsets to thebase address of the structure. In this example, if field1 is four byteslong, then field2 can be accessed as

*(aptr + 4)

This addition can usually be done at compile time,requiring only the indirection itself to fetch the memory locationduring execution.

Expression Simplification
Expression simplification is a useful area for machine-independenttransformations. We can use the laws of algebra to simplifyexpressions. Consider the following expression:

a*b + a*c

We can use the distributive law to rewrite theexpression as

a*(b + c)

Figure5-16. C structure layout and access.

Since the new expression has only two operations ratherthan three for the original form, it is almost certainly cheaper,because it is both faster and smaller. Such transformations make somebroad assumptions about the relative cost of operations.

In some cases,simple generalizations about the cost of operations may be misleading.For example, a CPU with a multiply-andaccumulate instruction may beable to do a multiply and addition as cheaply as it can do an addition.However, such situations can often be taken care of in code generation.

We can also use the laws of arithmetic to furthersimplify expressions on constants. Consider the following C statement:

for (i = 0; i < 8 + 1; i++)

We can simplify 8 + 1 to 9 at compile time – there isnoneed to perform that arithmetic while the program is executing. Whywould a program ever contain expressions that evaluate to constants?Using named constants rather than numbers is good programming practiceand often leads to constant expression. The original form of the forstatement could have been

for (i = 0; i < NOPS + 1; i++)

where, for example, the added 1 takes care of atrailing null character.

Dead Code Elimination
Code that will never be executed can be safely removed from theprogram. The general problem of identifying code that will never beexecuted is difficult, but there are some important special cases whereit can be done. Programmers will intentionally introduce dead code incertain situations.

Consider this C code fragment:

#define DEBUG 0 …if (DEBUG) print_debug_stuff();

In the above case, theprint_debug_stuff() function is never executed, but the code allows theprogrammer to override the preprocessor variable definition (perhapswith a compile-time flag) to enable the debugging code.

This case iseasy to analyze because the condition is the constant 0, which C usesfor the false condition. Since there is no else clause in the ifstatement, the compiler can totally eliminate the if statement,rewriting the CDFG to provide a direct edge between the statementsbefore and after the if.

Some dead code may be introduced by the compiler. Forexample, certain optimizations introduce copy statements that copy onevariable to another. If uses of the first variable can be replaced byreferences to the second one, then the copy statement becomes dead codethat can be eliminated.

Procedure Inlining
Another machine-independent transformation that requires a little moreevaluation is procedure inlining. An inlined procedure does not have aseparate procedure body and procedure linkage; rather, the body of theprocedure is substituted in place for the procedure call. Figure 5-17 below shows an example of function inlining in C.

The C++ programminglanguage provides an inline construct that tells the compiler togenerate inline code for a function. In this case, an inlined procedureis generated in expanded form whenever possible. However, inlining isnot always the best thing to do. It does eliminate the procedurelinkage instructions. However, in a cached system, having multiplecopies of the function body may actually slow down the fetches of theseinstructions. Inlining also increases code size, and memory may beprecious.

Function definition:
int foo(a,b,c) { return a + b ” c; }

Function call:
z = foo(w,x,y,);

Inlining result:
z = w + x ” y;

Figure 5-17 . Function inlining in C.  

Loop Transformations
Loops are important program structures—although they are compactlydescribed in the source code, they often use a large fraction of thecomputation time. Many techniques have been designed to optimize loops.

A simple but useful transformation is known as loopunrolling , which is illustrated in Example 5-4. Loop unrollingis important because it helps expose parallelism that can be used bylater stages of the compiler.

Example 5-4 Loop Unrolling
A simple loop in C follows:

 for (i = 0; i < N; i++) {
a[i]=b[i]*c[i];
}

This loop is executed a fixed number of times, namely,N. A straightforward implementation of the loop would create andinitialize the loop variable i, update its value on every iteration,and test it to see whether to exit the loop. However, since the loop isexecuted a fixed number of times, we can generate more direct code.

If we let N = 4, then we can substitute the above Ccode for the following loop:

 a[0] = b[0]*c[0]; 
a[1] = b[1]*c[1];
a[2] = b[2]*c[2];
a[3] = b[3]*c[3];

This unrolled code has no loop overhead code at all,that is, no iteration variable and no tests. But the unrolled loop hasthe same problems as the inlined procedure—it may interfere with thecache and expands the amount of code required.

We do not, of course, have to fully unroll loops.Rather than unroll the above loop four times, we could unroll it twice.The following code results:

 for (i = 0; i < 2; i++) { 
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];
}

In this case, since all operations in the two lines ofthe loop body are independent, later stages of the compiler may be ableto generate code that allows them to be executed efficiently on theCPU's pipeline.

Loop fusion combines two or more loopsinto a single loop. For this transformation to be legal, two conditionsmust be satisfied. First, the loops must iterate over the same values.Second, the loop bodies must not have dependencies that would beviolated if they are executed together—for example, if the secondloop's ith iteration depends on the results of the i + 1th iteration ofthe first loop, the two loops cannot be combined. Loopdistribution is the opposite of loop fusion, that is,decomposing a single loop into multiple loops.

Loop tiling breaks up a loop into aset of nested loops, with each inner loop performing the operations ona subset of the data. An example is shown in Figure 5-18. Here, eachloop is broken up into tiles of size two. Each loop is split into twoloops—for example, the inner ii loop iterates within the tile and theouter i loop iterates across the tiles.

The result is that the patternof accesses across the a array is drastically different – rather thanwalking across one row in its entirety, the code walks through rows andcolumns following the tile structure. Loop tiling changes the order inwhich array elements are accessed, thereby allowing us to bettercontrol the behavior of the cache during loop execution.

Figure5-18. Loop tiling.

We can also modify the arrays being indexed in loops. Arraypadding adds dummy data elements to a loop in order to changethe layout of the array in the cache. Although these array locationswill not be used, they do change how the useful array elements fallinto cache lines. Judicious padding can in some cases significantlyreduce the number of cache conflicts during loop execution.

Next in Part 4: The creation of procedures
To read Part 1, go to “Program design and analysis .
To read Part 2 , go to ” Modelsof program, assemblers and linkers.”

Used with the permissionof the publisher, Newnes/Elsevier,this series of six articles is based on copyrighted material from “Computersas Components: Principles of Embedded Computer System Design by Wayne Wolf. The bookcan be purchased on line.

Wayne Wolf  is currently the GeorgiaResearch Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer,Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech'sSchool of Electrical and Computer Engineering (ECE). Previously aprofessor ofelectrical engineering at Princeton University, he worked at AT&TBell Laboratories. He has served as editor in chief of the ACM Transactionson Embedded ComputingandofDesignAutomation for Embedded Systems.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.