Back to the Basics - Practical Embedded Coding Tips: Part 5 -

Back to the Basics – Practical Embedded Coding Tips: Part 5

A C compiler is a basic tool for most embedded systems programmers. Itis the tool by which the ideas and algorithms in your application(expressed as C source code) are transformed into machine codeexecutable by your target processor.

To a large extent, the C compiler determines how large theexecutable code for the application will be. A compiler performs manytransformations on a program in order to generate the best possiblecode.

Examples of such transformations are storing values in registersinstead of memory, removing code that does nothing useful, reorderingcomputations in a more efficient order, and replacing arithmeticoperations by cheaper operations.

The C language provides the compiler with significant freedomregarding how to precisely implement each C operation on the targetsystem. This freedom is an important factor in why C can usually becompiled very efficiently, but a programmer needs to be aware of thecompiler's freedom in order to write robust code.

To most programmers of embedded systems, the case that a programdoes not quite fit into the available memory is a familiarphenomenon. Recoding parts of an application in assembly language orthrowing out functionality may seem to be the only alternatives, whilethe solution could be as simple as rewriting the C code in a morecompiler-friendly manner.

In order to write code that is compiler friendly, you need to have aworking understanding of compilers. Some simple changes to a program,like changing the data type of a frequently-accessed variable, can havea big impact on code size while other changes have no effect at all.Having an idea of what a compiler can and cannot do makes suchoptimization work much easier.

Assembly programs specify both what, how, and the precise order inwhich calculations should be carried out. A C program, on the otherhand, only specifies the calculations that should be performed. Withsome restrictions, the order and the technique used to realize thecalculations are up to the compiler.

The compiler will look at the code and try to understand what isbeing calculated. It will then generate the best possible code, giventhe information it has managed to obtain, locally within a singlestatement and also across entire functions and sometimes even wholeprograms.

The Structure of a Compiler
In general, a program is processed in six main steps in a moderncompiler (not all compilers follow this blueprint completely, but as aconceptual guide it is sufficient):

Parser. The conversion from C source code to anintermediate language.
High-level optimization. Optimizations on theintermediate code.
Code generation. Generation of target machine code fromthe intermediate code.
Low-level optimization. Optimizations on the machinecode.
Assembly. Generation of an object file that can be linkedfrom the target machine code.
Linking. Linking of all the code for a program into anexecutable or downloadable file.

The parser parses the C source code, checking the syntax andgenerating error messages if syntactical errors are found in thesource. If no errors are found, the parser then generates intermediatecode (an internal representation of the parsed code), and compilationproceeds with the first optimization pass.

The high-level optimizer transforms the code to make it better. Theoptimizer has a large number of transformations available that canimprove the code, and will perform those that it deems relevant to theprogram at hand. Note that we use the word “transformation” and not”optimization.”

“Optimization” is a bit of a misnomer. It conveys the intuition thata change always improves a program and that we actually find optimalsolutions, while in fact optimal solutions are very expensive or evenimpossible to find (undecidable, in computer science lingo). To ensurereasonable compilation times and the termination of the compilationprocess, the compiler has to use heuristic methods (“good guesses”).

Transforming a program is a highly nonlinear activity, wheredifferent orderings of transformations will yield different results,and some transformations may actually make the code worse. Piling onmore “optimizations” will not necessarily yield better code.

When the high-level optimizer is done, the code generator transformsthe intermediate code to the target processor instruction set. Thisstage is performed, piece-by-piece, on the intermediate code from theoptimizer, and the compiler will try to do smart things on the level ofa single expression or statement, but not across several statements.

The code generator will also have to account for any differencesbetween the C language and the target processor. For example, 32-bitarithmetic will have to be broken down to 8-bit arithmetic for asmall-embedded target (like an Intel 8051, Motorola 68HC08, SamsungSAM8, or Microchip PIC).

A very important part of code generation is allocating registers tovariables. The goal is to keep as many values as possible in registers,since register-based operations are typically faster and smaller thanmemory-based operations.

After the code generator is done, another phase of optimizationtakes place, where transformations are performed on the target code.The low-level optimizer will clean up after the code generator (whichsometimes makes suboptimal coding decisions), and perform moretransformations.

There are many transformations that can only be applied on thetarget code, and some that are repeated from the high-level phase, buton a lower level. For example, transformations like removing a “clearcarry” instruction if we already know that the carry flag is zero areonly possible at the target code level, since the flags are not visiblebefore code generation.

After the low-level optimizer is finished, the code is sent to anassembler and output to an object file. All the object files of aprogram are then linked to produce a final binary executable ROM image(in some format appropriate for the target). The linker may alsoperform some optimizations, for example, by discarding unusedfunctions.

Thus, one can see that the seemingly simple task of compiling a Cprogram is actually a rather long and winding road through a highlycomplex system. Different transformations may interact, and a localimprovement may be worse for the whole program. For example, anexpression can typically be evaluated more efficiently if given moretemporary registers.

Taking a local view, it thus seems to be a good idea to provide asmany registers as necessary. A global effect, however, is thatvariables in registers may have to be spilled to memory, which could bemore expensive than evaluating the expression with fewer registers.

The Meaning of a Program
Before the compiler can apply transformations to a program, it mustanalyze the code to determine which transformations are legal andlikely to result in improvements. The legality of transformations isdetermined by the semantics laid down by the C language standard.

The most basic interpretation of a C program is that only statementsthat have side effects or compute values used for performing sideeffects are relevant to the meaning of a program.

Side effects are any statements that change the global state of theprogram. Examples that are generally considered to be side effects arewriting to a screen, changing a global variable, reading a volatilevariable, and calling unknown functions.

The calculations between the side effects are carried out accordingto the principle of “do what I mean, not what I say.” The compiler willtry to rewrite each expression into the most efficient form possible,but a rewrite is only possible if the result of the rewritten code isthe same as the original expression. The C standard defines what isconsidered “the same,” and sets the limits of allowable optimizations.

Basic Transformations . A modern compiler performs alarge number of basic transformations that act locally, like foldingconstant expressions, replacing expensive operations by cheaper ones(“strength reduction”), removing redundant calculations, and movinginvariant calculations outside of loops. The compiler can do mostmechanical improvements just as well as a human programmer, but withouttiring or making mistakes.

The Table below shows (in C form for readability) sometypical basic transformations performed by a modern C compiler. Notethat an important implication of this basic cleanup is that you canwrite code in a readable way and let the compiler calculate constantexpressions and worry about using the most efficient operations.

All code that is not considered useful—according to the definitionin the previous section—is removed. This removal of unreachable oruseless computations can cause some unexpected effects. An importantexample is that empty loops are completely discarded, making “emptydelay loops” useless. The code shown below stopped workingproperly when upgrading to a modern compiler that removed uselesscomputations:

Register Allocation . Processors usually give betterperformance and require smaller code when calculations are performedusing registers instead of memory. Therefore, the compiler will try toassign the variables in a function to registers.

A local variable or parameter will not need any RAM allocated at allif the variable can be kept in registers for the duration of thefunction. If there are more variables than registers available, thecompiler needs to decide which of the variables to keep in registers,and which to put in memory.

This is the problem of register allocation , and it cannot besolved optimally. Instead, heuristic techniques are used. Thealgorithms used can be quite sensitive, and even small changes to afunction may considerably alter the register allocation.

Note that a variable only needs a register when it is being used. Ifa variable is used only in a small part of a function, it will beregister allocated in that part, but it will not exist in the rest ofthe function. This explains why a debugger sometimes tells you that avariable is “optimized away at this point.”

The register allocator is limited by the language rules of C—forexample, global variables have to be written back to memory whencalling other functions, since they can be accessed by the calledfunction, and all changes to global variables must be visible to allfunctions. Between function calls, global variables can be kept inregisters.

Note that there are times when you do not want variables to beregister allocated. For example, reading an I/O port or spinning on alock, you want each read in the source code to be made from memory,since the variable can be changed outside the control of your program.This is where the volatile keyword is to be used. It signals to thecompiler that the variable should not ever be allocated in registers,but read from memory (or written) each time it is accessed.

In general, only simple values like integers, floats, and pointersare considered for register allocation. Arrays have to reside in memorysince they are designed to be accessed through pointers, and structuresare usually too large. In addition, on small processors, large valueslike 32-bit integers and floats may be hard to allocate to registers,and maybe only 16-bit and 8-bit variables will be given registers.

To help register allocation, you should strive to keep the number ofsimultaneously live variables low. Also, try to use the smallestpossible data types for your variables, as this will reduce the numberof required registers on 8-bit and 16-bit processors.

Function Calls . As assembly programmers well know,calling a function written in a high-level language can be rathercomplicated and costly. The calling function must save global variablesback to memory, make sure to move local variables to the registers thatsurvive the call (or save to the stack), and parameters may have to bepushed on the stack.

Inside the called function, registers will have to be saved,parameters taken off the stack, and space allocated on the stack forlocal variables. For large functions with many parameters andvariables, the effort required for a call can be quite large.

Modern compilers do their best, however, to reduce the cost of afunction call, especially the use of stack space. A number of registerswill be designated for parameters, so that short parameter lists willmost likely be passed entirely in registers. Likewise, the return valuewill be put in a register, and local variables will only be put on thestack if they cannot be allocated to registers.

The number of register parameters will vary wildly between differentcompilers and architecture. In most cases, at least four registers aremade available for parameters. Note also that just like for registerallocation, only small parameter types will be passed in registers.

Arrays are always passed as pointers to the array (C semantics dictate that), andstructures are usually copied to the stack and the structure parameterchanged to a pointer to a structure. That pointer might be passed in aregister, however. To save stack space, it is thus a good idea toalways use pointers to structures as parameters and not the structuresthemselves.

C supports functions with variable numbers of arguments. This isused in standard library functions like printf() and scanf() to providea convenient interface. However, the implementation of variable numbersof arguments to a function incurs significant overhead.

All arguments have to be put on the stack, since the function mustbe able to step through the parameter list using pointers to arguments,and the code accessing the arguments is much less efficient than forfixed parameter lists. There is no type-checking on the arguments,which increases the risk of bugs. Variable numbers of arguments shouldnot be used in embedded systems!

Function Inlining . It is good programming practice tobreak out common pieces of computation and accesses to shared datastructures into (small) functions. This, however, brings with it thecost of calling a function each time something should be done. In orderto mitigate this cost, the compiler transformation of function inlininghas been developed. Inlining a function means that a copy of the codefor the function is placed in the calling function, and the call isremoved.

Inlining is a very efficient method to speed up the code, since thefunction call overhead is avoided but the same computations carriedout. Many programmers do this manually by using preprocessor macros forcommon pieces of code instead of functions, but macros lack the typechecking of functions and produce harder-to-find bugs.

The executable code will often grow as a result of inlining, sincecode is being copied into several places. Inlining may also help shrinkthe code: for small functions, the code size cost of a function callmight be bigger than the code for the function. In this case, inlininga function will actually save code size (as well as speed up theprogram).

The main problem when inlining for size is to estimate the gains incode size (when optimizing for speed, the gain is almost guaranteed).Since inlining in general increases the code size, the inliner has tobe quite conservative. The effect of inlining on code size cannot beexactly determined, since the code of the calling function isdisturbed, with nonlinear effects.

To reduce the code size, the ideal would be to inline all calls to afunction, which allows us to remove the function from the programaltogether. This is only possible if all calls are known, i.e., areplaced in the same source file as the function, and the function ismarked static , so that it cannot be seen from other files.

Otherwise, the function will have to be kept (even though it mightstill be inlined at some calls), and we rely on the linker to remove itif it is not called. Since this decreases the likely gain frominlining, we are less likely to inline such a function.

Low-Level Code Compression . A common transformation onthe target code level is to find common sequences of instructions fromseveral functions, and break them out into subroutines. Thistransformation can be very effective at shrinking the executable codeof a program, at the cost of performing more jumps (note that thistransformation only introduces machine-level subroutine calls and notfull-strength function calls). Experience shows a gain from 10″30% forthis transformation.

Linker . The linker should be considered an integralpart of the compilation system, since there are some transformationsthat are performed in the linker. The most basic embedded-systemslinker should remove all unused functions and variables from a program,and only include the parts of the standard libraries that are actuallyused.

The granularity at which program parts are discarded varies, fromfiles or library modules down to individual functions or even snippetsof code. The smaller the granularity, the better the linker.Unfortunately, some linkers derived from desktop systems work on a perfile basis, and this will give unnecessarily big code.

Some linkers also perform postcompilation transformations on theprogram. Common transformation is the removal of unnecessary bank andpage switches (that cannot be done at compile-time since the exactallocation of variable addresses is unknown at that time) and codecompression as discussed above extended to the entire program.

Controlling Compiler Optimization
A compiler can be instructed to compile a program with different goals,usually speed or size. For each setting, a set of transformations hasbeen selected that tend to work towards the goal—maximal speed (minimalexecution time) or minimal size. The settings should be consideredapproximate. To give better control, most compilers also allowindividual transformations to be enabled or disabled.

For size optimization, the compiler uses a combination oftransformations that tend to generate smaller code, but it might failin some cases, due to the characteristics of the compiled program. Asan example, the fact that function inlining is more aggressive forspeed optimization makes some programs smaller on the speed settingthan on the size setting.

The example data below demonstrates this, the two programswere compiled with the same version of the same compiler, using thesame memory and data model settings, but optimizing for speed or size:

Program 1 gets slightly smaller with speed optimization, whileprogram 2 is considerably larger, an effect we traced to the fact thatfunction inlining was lucky on program 1. The conclusion is that oneshould always try to compile a program with different optimizationsettings and see what happens.

It is often worthwhile to use different compilation settings fordifferent files in a project: put the code that must run very quicklyinto a separate file and compile that for minimal execution time(maximum speed), and the rest of the code for minimal code size. Thiswill give a small program, which is still fast enough where it matters.Some compilers allow different optimization settings for differentfunctions in the same source file using #pragma directives.

Memory Model . Embedded microcontrollers are usuallyavailable in several variants, each with a different amount of programand data memory. For smaller chips, the fact that the amount of memorythat can be addressed is limited can be exploited by the compiler togenerate smaller code.

An 8-bit direct pointer uses less code memory than a 24-bit bankedpointer where software has to switch banks before each access. Thisgoes for code as well as data.

For example, some Atmel AVR chips have a code area of only 8 kb,which allows a small jump with an offset of +/- 4 kb to reach all codememory, using wraparound to jump from high addresses to low addresses.Taking advantage of this yields smaller and faster code.

The capacity of the target chip can be communicated to the compilerusing a memory model option. There are usually several different memorymodels available, ranging from “small” up to “huge.” In general,function calls get more expensive as the amount of code allowedincreases, and data accesses and pointers get bigger and more expensiveas the amount of accessible data increases.

Make sure to use the smallest model that fits your target chip andapplication—this might give you large savings in code size.

Next in Part 6: Writing compiler-friendly C-code
To read Part 1 in this series, go to Reentrancy, atomic variables and recursion.
To read Part 2 in this series, go to Asynchronous Hardware/Firmware
To read Part 3, go to Metastable States
To read Part 4, go to Dealing With Interrupt Latency

JakobEngblom ( technical marketing manager atat Virtutech.He has a MSc in computer science and a PhD in Computer Systems fromUppsala University, and hasworked with programming tools and simulation tools for embedded andreal-time systems since 1997. 

He was a contributor of material to “ The Firmware Handbook,” editedby Jack Ganssle, upon which this series of articles was based andprintedwith permission from Newnes, a division of Elsevier. Copyright 2008.  Forother publications by Jakob Engblom, see

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.