The basics of programming embedded processors: Part 4 -

The basics of programming embedded processors: Part 4

Another major code generation problem is the creation of procedures.Generating code for procedures is relatively straightforward once weknow the procedure linkage appropriate for the CPU. At the proceduredefinition, we generate the code to handle the procedure call andreturn. At each call of the procedure, we set up the procedureparameters and make the call.

The CPU's subroutine callmechanism is usually notsufficient to directly support procedures in modern programminglanguages. The procedure linkage mechanism provides a way for theprogram topass parameters into the program and for the procedure to return avalue.

It also provides help in restoring the values ofregisters thatthe procedure has modified. All procedures in a given programminglanguage use the same linkage mechanism (although different languagesmay use different linkages). The mechanism can also be used to callhandwritten assembly language routines from compiled code.

Procedure stacks are typically built to grow down fromhigh addresses. A stack pointer (sp) defines the endof the current frame, while a frame pointer (fp)defines the end of the last frame. (The fp is technically necessaryonly if the stack frame can be grown by the procedure duringexecution.)

The procedure can refer to an element in the frame byaddressing relative to sp. When a new procedure is called, the sp andfp are modified to push another frame onto the stack.

The ARM Procedure Call Standard (APCS) is a goodillustration of a typical procedure linkage mechanism. Although thestack frames are in main memory, understanding how registers are usedis key to understanding the mechanism, as explained below.

  • r0″r3 are used to pass parameters into theprocedure. r0 is also used to hold the return value. If more than fourparameters are required, they are put on the stack frame.
  • r4″r7 hold register variables.
  • r11 is the frame pointer and r13 is the stackpointer.
  • r10 holds the limiting address on stack size,which is used to check for stack overflows.

Other registers have additional uses in the protocol.

Figure5-14. Layout of a one-dimensional array in memory

Data Structures
The compiler must also translate references to data structures intoreferences to raw memories. In general, this requires addresscomputations. Some of these computations can be done at compile timewhile others must be done at run time. arrays Arrays are interestingbecause the address of an array element must in general be computed atrun time, since the array index may change. Let us first considerone-dimensional arrays:


The layout of the array in memory is shown in Figure5-14 above. The zeroth element is stored as the first element ofthe array,the first element directly below, and so on. We can create a pointerfor the array that points to the array's head, namely, a[0]. If we callthat pointer aptr for convenience, then we can rewrite the readingof  a[i] as

*(aptr + i)

Two-dimensional arrays are more challenging. There aremultiple possible ways to lay out a two-dimensional array in memory, asshown in Figure 5-15 below. Inthis form, which is known as row major,the inner variable of the array (j in a[i,j]) varies most quickly.(Fortran uses a different organization known as column major.)

Two-dimensional arrays also require more sophisticatedaddressing – inparticular, we must know the size of the array. Let us consider therow-major form. If the a[] array is of size N × M, then we canturn the two-dimensional array access into a one-dimensional arrayaccess. Thus,



a[i*M + j]

where the maximum value for j is M ' 1.

Figure5-15. Memory layout for two-dimensional arrays

A C struct is easier to address. As shown in Figure5-16 below, a structure is implemented as a contiguous block ofmemory.Fields in the structure can be accessed using constant offsets to thebase address of the structure. In this example, if field1 is four byteslong, then field2 can be accessed as

*(aptr + 4)

This addition can usually be done at compile time,requiring only the indirection itself to fetch the memory locationduring execution.

Expression simplification is a useful area for machine-independenttransformations. We can use the laws of algebra to simplifyexpressions. Consider the following expression:

a*b + a*c

We can use the distributive law to rewrite theexpression as

a*(b + c)

Figure5-16. C structure layout and access.

Since the new expression has only two operations ratherthan three for the original form, it is almost certainly cheaper,because it is both faster and smaller. Such transformations make somebroad assumptions about the relative cost of operations.

In some cases,simple generalizations about the cost of operations may be misleading.For example, a CPU with a multiply-andaccumulate instruction may beable to do a multiply and addition as cheaply as it can do an addition.However, such situations can often be taken care of in code generation.

We can also use the laws of arithmetic to furthersimplify expressions on constants. Consider the following C statement:

for (i = 0; i < 8 + 1; i++)

We can simplify 8 + 1 to 9 at compile time – there isnoneed to perform that arithmetic while the program is executing. Whywould a program ever contain expressions that evaluate to constants?Using named constants rather than numbers is good programming practiceand often leads to constant expression. The original form of the forstatement could have been

for (i = 0; i < NOPS + 1; i++)

where, for example, the added 1 takes care of atrailing null character.

Dead Code Elimination
Code that will never be executed can be safely removed from theprogram. The general problem of identifying code that will never beexecuted is difficult, but there are some important special cases whereit can be done. Programmers will intentionally introduce dead code incertain situations.

Consider this C code fragment:

#define DEBUG 0 …if (DEBUG) print_debug_stuff();

In the above case, theprint_debug_stuff() function is never executed, but the code allows theprogrammer to override the preprocessor variable definition (perhapswith a compile-time flag) to enable the debugging code.

This case iseasy to analyze because the condition is the constant 0, which C usesfor the false condition. Since there is no else clause in the ifstatement, the compiler can totally eliminate the if statement,rewriting the CDFG to provide a direct edge between the statementsbefore and after the if.

Some dead code may be introduced by the compiler. Forexample, certain optimizations introduce copy statements that copy onevariable to another. If uses of the first variable can be replaced byreferences to the second one, then the copy statement becomes dead codethat can be eliminated.

Procedure Inlining
Another machine-independent transformation that requires a little moreevaluation is procedure inlining. An inlined procedure does not have aseparate procedure body and procedure linkage; rather, the body of theprocedure is substituted in place for the procedure call. Figure 5-17belowshows an example of function inlining in C.

The C++ programminglanguage provides an inline construct that tells the compiler togenerate inline code for a function. In this case, an inlined procedureis generated in expanded form whenever possible. However, inlining isnot always the best thing to do. It does eliminate the procedurelinkage instructions. However, in a cached system, having multiplecopies of the function body may actually slow down the fetches of theseinstructions. Inlining also increases code size, and memory may beprecious.

int foo(a,b,c) { return a + b ” c; }

z = foo(w,x,y,);

z = w + x ” y;

Figure 5-17. Functioninlining in C. 

Loop Transformations
Loops are important program structures – although they are compactlydescribed in the source code, they often use a large fraction of thecomputation time. Many techniques have been designed to optimize loops.

A simple but useful transformation is known as loopunrolling, which is illustrated in Example 5-4 below. Loop unrollingis important because it helps expose parallelism that can be used bylater stages of the compiler.

Example 5-4 LoopUnrolling
A simple loop in C follows:

 for (i = 0; i < N; i++) {

This loop is executed a fixed number of times, namely,N. A straightforward implementation of the loop would create andinitialize the loop variable i, update its value on every iteration,and test it to see whether to exit the loop. However, since the loop isexecuted a fixed number of times, we can generate more direct code.

If we let N = 4, then we can substitute the above Ccode for the following loop:

 a[0] = b[0]*c[0]; 
a[1] = b[1]*c[1];
a[2] = b[2]*c[2];
a[3] = b[3]*c[3];

This unrolled code has no loop overhead code at all,that is, no iteration variable and no tests. But the unrolled loop hasthe same problems as the inlined procedure—it may interfere with thecache and expands the amount of code required.

We do not, of course, have to fully unroll loops.Rather than unroll the above loop four times, we could unroll it twice.The following code results:

 for (i = 0; i < 2; i++) { 
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];

In this case, since all operations in the two lines ofthe loop body are independent, later stages of the compiler may be ableto generate code that allows them to be executed efficiently on theCPU's pipeline.

Loop fusion combines two or more loopsinto a single loop. For this transformation to be legal, two conditionsmust be satisfied. First, the loops must iterate over the same values.Second, the loop bodies must not have dependencies that would beviolated if they are executed together – for example, if the secondloop's ith iteration depends on the results of the i + 1th iteration ofthe first loop, the two loops cannot be combined. Loopdistribution is the opposite of loop fusion, that is,decomposing a single loop into multiple loops.

Loop tiling breaks up a loop into aset of nested loops, with each inner loop performing the operations ona subset of the data. An example is shown in Figure 5-18 below. Here,eachloop is broken up into tiles of size two. Each loop is split into twoloops – for example, the inner ii loop iterates within the tile and theouter i loop iterates across the tiles.

The result is that the patternof accesses across the a array is drastically different – rather thanwalking across one row in its entirety, the code walks through rows andcolumns following the tile structure. Loop tiling changes the order inwhich array elements are accessed, thereby allowing us to bettercontrol the behavior of the cache during loop execution.

Figure5-18. Loop tiling.

We can also modify the arrays being indexed in loops.Arraypadding adds dummy data elements to a loop in order to changethe layout of the array in the cache. Although these array locationswill not be used, they do change how the useful array elements fallinto cache lines. Judicious padding can in some cases significantlyreduce the number of cache conflicts during loop execution.

Next in Part 5: Register allocationand scheduling
To read Part 1, go to “Programdesign and analysis.
To read Part 2 , go to ” Modelsof program, assemblers and linkers.
To read Part 3, go to  “Basiccompilation techniques

Used with the permissionof the publisher, Newnes/Elsevier,this series of articles is based on copyrighted material from “Computersas Components: Principles of Embedded Computer System Design” byWayne Wolf. The bookcan be purchased on line.

Wayne Wolf  is currently the GeorgiaResearch Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer,Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech'sSchool of Electrical and Computer Engineering (ECE). Previously aprofessor ofelectrical engineering at Princeton University, he worked at AT&TBell Laboratories. He has served as editor in chief of the ACM Transactionson Embedded Computing andof DesignAutomation for Embedded Systems.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.