The basics of programming embedded processors: Part 5 -

The basics of programming embedded processors: Part 5

Two important aspects of the code compilation  phase in embeddedprogram development are register allocation and schedule.

Given a block of code, wewant to choose assignments ofvariables (both declared and temporary) to registers to minimize thetotal number of required registers. Example 5-5 illustrates theimportance of proper register allocation.

Example 5-5 RegisterAllocation
To keep the example small, we assume that we can use only four of theARM's registers. In fact, such a restriction is notunthinkable – programming conventions can reserve certain registers forspecial purposes and significantly reduce the number of general-purposeregisters available.

Consider the following C code:
w = a + b; /* statement 1 */
x = c + w; /* statement 2 */
y = c + d; /* statement 3 */

A naive register allocation, assigning each variable toa separate register, would require seven registers for the sevenvariables in the above code. However, we can do much better by reusinga register once the value stored in the register is no longer needed.

To understand how to do this, we can draw a lifetimegraph that showsthe statements on which each statement is used. Appearing below is alifetime graph in which the x axis is the statement number in the Ccode and the y axis shows the variables.

A horizontal line stretches from the first statementwhere the variable is used to the last use of the variable; a variableis said to be live during this interval. At each statement, we candetermine every variable currently in use. The maximum number ofvariables in use at any statement determines the maximum number ofregisters required.

In this case, statement two requires threeregisters: c, w, and x. This fits within the four registers limitation.By reusing registers once their current values are no longer needed, wecan write code that requires no more than four registers. Appearingbelow is one register assignment.

 a r0 
b r1
c r2
d r0
w r3
x r0
y r3

The ARM assembly code that uses the above registerassignment follows:

If a section of code requires more registers than areavailable, we must spillsome of the values out tomemory temporarily. After computing some values, we write the values totemporary memory locations, reuse those registers in othercomputations, and then reread the old values from the temporarylocations to resume work.

Spilling registers is problematic in severalrespects. For example, it requires extra CPU time and uses up bothinstruction and data memory. Putting effort into register allocation toavoid unnecessary register spills is worth your time.

We can solve register allocation problems by buildinga conflict graph and solving a graph coloringproblem. As shown in Figure 5-19below , each variable in the high-levellanguage code is represented by a node. An edge is added between twonodes if they are both live at the same time.

The graph coloringproblem is to use the smallest number of distinct colors to color allthe nodes such that no two nodes are directly connected by an edge ofthe same color. The figure shows a satisfying coloring that uses threecolors. Graph coloring is NP-complete, but there are efficientheuristic algorithms that can give good results on typical registerallocation problems.

Figure5-19. Using graph coloring to solve the problem of Example 5-5.

Lifetime analysis assumes that we have alreadydetermined the order in which we will evaluate operations. In manycases, we have freedom in the order in which we do things. Consider thefollowing expression:

(a + b) * (c – d)

We have to do the multiplication last, but we can doeither the addition or the subtraction first. Different orders ofloads, stores, and arithmetic operations may also result in differentexecution times on pipelined machines.

If we can keep values inregisters without having to reread them from main memory, we can saveexecution time and reduce code size as well. Example 5-6 illustrateshow proper operator scheduling can improve register allocation.

Example 5-6 .Operator Schedulingfor Register Allocation
Here is sample C code fragment:

w = a + b; /* statement 1 */
x = c + d; /* statement 2 */
y = x + e; /* statement 3 */
z = a ' b; /* statement 4 */

If we compile the statements in the order in whichthey were written, we get the register graph below.

Since w is needed until the last statement, we needfive registers at statement 3, even though only three registers areneeded for the statement at line 3. If we swap statements 3 and 4(renumbering them)we reduce our requirements to three registers. Themodified C code follows:

 w =a + b; /* statement 1 */
 w= a – b; /* statement 2' */
 w= c + d; /* statement 3' */
 w= x + e; /* statement 4' */

The lifetime graph for the new code appears below.

Compare the ARM assembly code for the two codefragments. We have written both assuming that we have only four freeregisters. In the before version, we do not have to write outany values, but we must read a and b twice. The after versionallows us to retain all values in registers as long as we need them.

 Before version After version 
 LDR r0,a LDR r0,a 
LDR r1,b LDR r1,b
ADD r2,r0,r1 ADD r2,r1,r0
STR r2,w ; w = a + b STR r2,w ; w = a + b
LDRr r0,c SUB r2,r0,r1
LDR r1,d STR r2,z ; z = a ' b
ADD r2,r0,r1 LDR r0,c
STR r2,x ; x = c + d LDR r1,d
LDR r1,e ADD r2,r1,r0
ADD r0,r1,r2 STR r2,x ; x = c + d
STR r0,y ; y = x + e LDR r1,e
LDR r0,a ; reload a ADD r0,r1,r2
LDR r1,b ; reload b STR r0,y ; y = x + e
SUB r2,r1,r0
STR r2,z ; z = a " b

We have some freedom to choose the order in which operations will beperformed. We can use this to our advantage—for example, we may be ableto improve the register allocation by changing the order in whichoperations are performed, thereby changing the lifetimes of thevariables.

We can solve scheduling problems by keeping track ofresource utilization over time. We do not have to know the exactmicroarchitecture of the CPU – all we have to know is that, forexample,instruction types 1 and 2 both use resource A while instruction types 3and 4 use resource B.

CPU manufacturers generally disclose enoughinformation about the microarchitecture to allow us to scheduleinstructions even when they do not provide a detailed description ofthe CPU's internals.

We can keep track of CPU resources during instructionscheduling using a reservation table. Asillustrated in Figure 5-20 below ,rows in the table represent instructionexecution time slots and columns represent resources that must bescheduled.

Before scheduling an instruction to be executed at aparticular time, we check the reservation table to determine whetherall resources needed by the instruction are available at that time.

Upon scheduling the instruction, we update the table tonote allresources used by that instruction. Various algorithms can be used forthe scheduling itself, depending on the types of resources andinstructions involved, but the reservation table provides a goodsummary of the state of an instruction scheduling problem in progress.

We can also schedule instructions to maximizeperformance. When an instruction that takes more cycles thannormal to finish is in the pipeline, pipeline bubbles appear thatreduce performance. Software pipelining is a technique for reorderinginstructions across several loop iterations to reduce pipeline bubbles.Software pipelining is illustrated in Example 5-7.

Figure5-20. A reservation table for instruction scheduling.

Example 5-7 Software Pipeliningin SHARC
Software pipelining can be illustrated with a small loop on the SHARC.Consider the following simple loop for a dot product computation:

 for (i = 0; i < N; i++)  
sum += a[i] * b[i];

The SHARC can perform several operations in parallel.However, we can't perform the necessary loads and arithmetic operationson the same cycle.

Assume that we want to rewrite the loop so that weperform two loads, an addition, and a multiplication in one iteration.However, because the result of one operation depends on others, wecan't do all these operations for the same iteration at the same time.Instead, the loop body will perform operations from the following threedifferent iterations:

1. The twofetches of the array elements are performed for availabilityin the next cycle.
2. The multiplicationa[i]*b[i] is performed on the operands fetched bythe previous loop iteration.
3. The addition into the dotproduct running sum is performed using theresult of the multiplication performed in the previous loop iteration.

When we rewrite the loop, we need to generate specialheader and trailer code that takes care of the first and lastiterations that cannot be pipelined. The C code below is designed toshow which operations can be performed in parallel on the SHARC.

In this code, none of the operations in the loop bodydepend on each other – remember that the p in p = ai*bi is writing adifferent value than is being used by sum += p, so they can operate inparallel. This allows all the operations to be performed in a singleSHARC instruction that performs two data reads, a multiply, and anaddition.

Selecting the instructions to use to implement each operation is nottrivial. There may be several different instructions that can be usedto accomplish the same goal, but they may have different executiontimes. Moreover, using one instruction for one part of the program mayaffect the instructions that can be used in adjacent code. Although wecan't discuss all the problems and methods for code generation here, alittle bit of knowledge helps us envision what the compiler is doing.

One useful technique for generating code is templatematching, illustrated in Figure5-21 below. We have a DAG thatrepresents the expression for which we want to generate code. In orderto be able to match up instructions and operations, we representinstructions using the same DAG representation. We shaded theinstruction template nodes to distinguish them from code nodes.

Eachnode has a cost, which may be simply the execution time of theinstruction or may include factors for size, power consumption, and soon. In this case, we have shown that each instruction takes the sameamount of time, and thus all have a cost of 1. Our goal is to cover allnodes in the code DAG with instruction DAGs – until we have covered thecode DAG we haven't generated code for all the operations in theexpression.

In this case, the lowest-cost covering uses themultiply-add instruction to cover both nodes. If we first tried tocover the bottom node with the multiply instruction, we would findourselves blocked from using the multiply-add instruction. Dynamicprogramming can be used to efficiently find the lowest-cost covering oftrees, and heuristics can extend the technique to DAGs.

Figure5-21. Code generation by template matching.

Understanding andUsing Your Compiler
Clearly, the compiler can vastly transform your program during thecreation of assembly language. But compilers are also substantiallydifferent in terms of the optimizations they perform. Understandingyour compiler can help you get the best code out of it.

Studying the assembly language output of the compileris a good way to learn about what the compiler does. Some compilerswill annotate sections of code to help you make the correspondencebetween the source and assembler output. Starting with small examplesthat exercise only a few types of state-ments will help.

You canexperiment with different optimization levels (the -O flag on most Ccompilers). You can also try writing the same algorithm in several waysto see how the compiler's output changes. If you can't get yourcompiler to generate the code you want, you may need to write your ownassembly language. You can do this by writing it from scratch ormodifying the output of the compiler.

If you write your own assemblycode, you must ensure that it conforms to all compiler conventions,such as procedure call linkage. If you modify the compiler output, youshould be sure that you have the algorithm right before you startwriting code so that you don't have to repeatedly edit the compiler'sassembly language output. You also need to clearly document the factthat the high-level language source is, in fact, not the code used inthe system.

Figure5-22. Structure of a program interpretation system

Interpreters and JITCompilers
Programs are not always compiled and then separately executed. In somecases, it may make sense to translate the program into instructionsduring execution. Two well-known techniques for on-the-fly translationare interpretation and just-in-time (JIT)compilation. The trade-offs for both techniques are similar.Interpretation or JIT compilation adds overhead – both time and memory- toexecution.

However, that overhead may be more than made up for insomecircumstances. For example, if only parts of the program are executedover some period of time, interpretation or JIT compilation may savememory, even taking overhead into account. Interpretation and JITcompilation also provide added security when programs arrive over thenetwork.

An interpreter translates programstatements one at a time. The program may be expressed in a high-levellanguage, with Forth being a prime example of an embedded language thatis interpreted. An interpreter may also interpret instructions in someabstract machine language.

As illustrated in Figure5-22 above , theinterpreter sits between the program and the machine. It translates onestatement of the program at a time. The interpreter may or may notgenerate an explicit piece of code to represent the statement.

Becausethe interpreter translates only a very small piece of the program atany given time, a small amount of memory is used to hold intermediaterepresentations of the program. In many cases, a Forth program plus theForth interpreter are smaller than the equivalent native machine code.

JIT compilers have been used for manyyears, but are best known today for their use in Java environments. AJIT compiler is somewhere between an interpreter and astand-alone compiler.

A JIT compiler produces executable code segmentsfor pieces of the program. However, it compiles a section of theprogram (such as a function) only when it knows it will be executed.Unlike an interpreter, it saves the compiled version of the code sothat the code does not have to be retranslated the next time it isexecuted.

A JIT compiler saves some execution time overheadrelative toan interpreter because it does not translate the same piece of codemultiple times, but it also uses more memory for the intermediaterepresentation. The JIT compiler usually generates machine codedirectly rather than building intermediate program representation datastructures such as the CDFG. A JIT compiler also usually performs onlysimple optimizations as compared to a stand-alone compiler.

Next in Part 6, Analysis andoptimization of execution time.
To read Part 1, go to “Programdesign and analysis.
To read Part 2 , go to ” Modelsof program, assemblers and linkers.”
To read Part 3, go to “BasicCompilation Techniques
To read Part 4, go to “The creation ofprocedures

Used with the permissionof the publisher, Newnes/Elsevier,this series of six articles is based on copyrighted material from “Computersas Components: Principles of Embedded Computer System Design” byWayne Wolf. The bookcan be purchased on line.

Wayne Wolf  is currently the GeorgiaResearch Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer,Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech'sSchool of Electrical and Computer Engineering (ECE). Previously aprofessor ofelectrical engineering at Princeton University, he worked at AT&TBell Laboratories. He has served as editor in chief of the ACM Transactionson Embedded Computing andof DesignAutomation for Embedded Systems.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.