To a large extent, the C compiler determines how large the executable code for the application will be. A compiler performs many transformations on a program in order to generate the best possible code.
Examples of such transformations are storing values in registers instead of memory, removing code that does nothing useful, reordering computations in a more efficient order, and replacing arithmetic operations by cheaper operations.
The C language provides the compiler with significant freedom regarding how to precisely implement each C operation on the target system. This freedom is an important factor in why C can usually be compiled very efficiently, but a programmer needs to be aware of the compiler's freedom in order to write robust code.
To most programmers of embedded systems, the case that a program does not quite fit into the available memory is a familiar phenomenon. Recoding parts of an application in assembly language or throwing out functionality may seem to be the only alternatives, while the solution could be as simple as rewriting the C code in a more compiler-friendly manner.
In order to write code that is compiler friendly, you need to have a working understanding of compilers. Some simple changes to a program, like changing the data type of a frequently-accessed variable, can have a big impact on code size while other changes have no effect at all. Having an idea of what a compiler can and cannot do makes such optimization work much easier.
Assembly programs specify both what, how, and the precise order in which calculations should be carried out. A C program, on the other hand, only specifies the calculations that should be performed. With some restrictions, the order and the technique used to realize the calculations are up to the compiler.
The compiler will look at the code and try to understand what is being calculated. It will then generate the best possible code, given the information it has managed to obtain, locally within a single statement and also across entire functions and sometimes even whole programs.
The Structure of a Compiler
In general, a program is processed in six main steps in a modern
compiler (not all compilers follow this blueprint completely, but as a
conceptual guide it is sufficient):
Parser. The conversion from C source code to an
intermediate language.
High-level optimization. Optimizations on the
intermediate code.
Code generation. Generation of target machine code from
the intermediate code.
Low-level optimization. Optimizations on the machine
code.
Assembly. Generation of an object file that can be linked
from the target machine code.
Linking. Linking of all the code for a program into an
executable or downloadable file.
The parser parses the C source code, checking the syntax and generating error messages if syntactical errors are found in the source. If no errors are found, the parser then generates intermediate code (an internal representation of the parsed code), and compilation proceeds with the first optimization pass.
The high-level optimizer transforms the code to make it better. The optimizer has a large number of transformations available that can improve the code, and will perform those that it deems relevant to the program at hand. Note that we use the word "transformation" and not "optimization."
"Optimization" is a bit of a misnomer. It conveys the intuition that a change always improves a program and that we actually find optimal solutions, while in fact optimal solutions are very expensive or even impossible to find (undecidable, in computer science lingo). To ensure reasonable compilation times and the termination of the compilation process, the compiler has to use heuristic methods ("good guesses").
Transforming a program is a highly nonlinear activity, where different orderings of transformations will yield different results, and some transformations may actually make the code worse. Piling on more "optimizations" will not necessarily yield better code.
When the high-level optimizer is done, the code generator transforms the intermediate code to the target processor instruction set. This stage is performed, piece-by-piece, on the intermediate code from the optimizer, and the compiler will try to do smart things on the level of a single expression or statement, but not across several statements.
The code generator will also have to account for any differences between the C language and the target processor. For example, 32-bit arithmetic will have to be broken down to 8-bit arithmetic for a small-embedded target (like an Intel 8051, Motorola 68HC08, Samsung SAM8, or Microchip PIC).
A very important part of code generation is allocating registers to variables. The goal is to keep as many values as possible in registers, since register-based operations are typically faster and smaller than memory-based operations.
After the code generator is done, another phase of optimization takes place, where transformations are performed on the target code. The low-level optimizer will clean up after the code generator (which sometimes makes suboptimal coding decisions), and perform more transformations.
There are many transformations that can only be applied on the target code, and some that are repeated from the high-level phase, but on a lower level. For example, transformations like removing a "clear carry" instruction if we already know that the carry flag is zero are only possible at the target code level, since the flags are not visible before code generation.
After the low-level optimizer is finished, the code is sent to an assembler and output to an object file. All the object files of a program are then linked to produce a final binary executable ROM image (in some format appropriate for the target). The linker may also perform some optimizations, for example, by discarding unused functions.
Thus, one can see that the seemingly simple task of compiling a C program is actually a rather long and winding road through a highly complex system. Different transformations may interact, and a local improvement may be worse for the whole program. For example, an expression can typically be evaluated more efficiently if given more temporary registers.
Taking a local view, it thus seems to be a good idea to provide as many registers as necessary. A global effect, however, is that variables in registers may have to be spilled to memory, which could be more expensive than evaluating the expression with fewer registers.
The most basic interpretation of a C program is that only statements that have side effects or compute values used for performing side effects are relevant to the meaning of a program.
Side effects are any statements that change the global state of the program. Examples that are generally considered to be side effects are writing to a screen, changing a global variable, reading a volatile variable, and calling unknown functions.
The calculations between the side effects are carried out according to the principle of "do what I mean, not what I say." The compiler will try to rewrite each expression into the most efficient form possible, but a rewrite is only possible if the result of the rewritten code is the same as the original expression. The C standard defines what is considered "the same," and sets the limits of allowable optimizations.
Basic Transformations. A modern compiler performs a large number of basic transformations that act locally, like folding constant expressions, replacing expensive operations by cheaper ones ("strength reduction"), removing redundant calculations, and moving invariant calculations outside of loops. The compiler can do most mechanical improvements just as well as a human programmer, but without tiring or making mistakes.
The Table below shows (in C form for readability) some typical basic transformations performed by a modern C compiler. Note that an important implication of this basic cleanup is that you can write code in a readable way and let the compiler calculate constant expressions and worry about using the most efficient operations.
All code that is not considered useful—according to the definition in the previous section—is removed. This removal of unreachable or useless computations can cause some unexpected effects. An important example is that empty loops are completely discarded, making "empty delay loops" useless. The code shown below stopped working properly when upgrading to a modern compiler that removed useless computations:
Register Allocation.Processors usually give better performance and require smaller code when calculations are performed using registers instead of memory. Therefore, the compiler will try to assign the variables in a function to registers.
A local variable or parameter will not need any RAM allocated at all if the variable can be kept in registers for the duration of the function. If there are more variables than registers available, the compiler needs to decide which of the variables to keep in registers, and which to put in memory.
This is the problem of register allocation, and it cannot be solved optimally. Instead, heuristic techniques are used. The algorithms used can be quite sensitive, and even small changes to a function may considerably alter the register allocation.
Note that a variable only needs a register when it is being used. If a variable is used only in a small part of a function, it will be register allocated in that part, but it will not exist in the rest of the function. This explains why a debugger sometimes tells you that a variable is "optimized away at this point."
The register allocator is limited by the language rules of C—for example, global variables have to be written back to memory when calling other functions, since they can be accessed by the called function, and all changes to global variables must be visible to all functions. Between function calls, global variables can be kept in registers.
Note that there are times when you do not want variables to be register allocated. For example, reading an I/O port or spinning on a lock, you want each read in the source code to be made from memory, since the variable can be changed outside the control of your program. This is where the volatile keyword is to be used. It signals to the compiler that the variable should not ever be allocated in registers, but read from memory (or written) each time it is accessed.
In general, only simple values like integers, floats, and pointers are considered for register allocation. Arrays have to reside in memory since they are designed to be accessed through pointers, and structures are usually too large. In addition, on small processors, large values like 32-bit integers and floats may be hard to allocate to registers, and maybe only 16-bit and 8-bit variables will be given registers.
To help register allocation, you should strive to keep the number of simultaneously live variables low. Also, try to use the smallest possible data types for your variables, as this will reduce the number of required registers on 8-bit and 16-bit processors.
Inside the called function, registers will have to be saved, parameters taken off the stack, and space allocated on the stack for local variables. For large functions with many parameters and variables, the effort required for a call can be quite large.
Modern compilers do their best, however, to reduce the cost of a function call, especially the use of stack space. A number of registers will be designated for parameters, so that short parameter lists will most likely be passed entirely in registers. Likewise, the return value will be put in a register, and local variables will only be put on the stack if they cannot be allocated to registers.
The number of register parameters will vary wildly between different compilers and architecture. In most cases, at least four registers are made available for parameters. Note also that just like for register allocation, only small parameter types will be passed in registers.
Arrays are always passed as pointers to the array (C semantics dictate that), and structures are usually copied to the stack and the structure parameter changed to a pointer to a structure. That pointer might be passed in a register, however. To save stack space, it is thus a good idea to always use pointers to structures as parameters and not the structures themselves.
C supports functions with variable numbers of arguments. This is used in standard library functions like printf() and scanf() to provide a convenient interface. However, the implementation of variable numbers of arguments to a function incurs significant overhead.
All arguments have to be put on the stack, since the function must be able to step through the parameter list using pointers to arguments, and the code accessing the arguments is much less efficient than for fixed parameter lists. There is no type-checking on the arguments, which increases the risk of bugs. Variable numbers of arguments should not be used in embedded systems!
Function Inlining. It is good programming practice to break out common pieces of computation and accesses to shared data structures into (small) functions. This, however, brings with it the cost of calling a function each time something should be done. In order to mitigate this cost, the compiler transformation of function inlining has been developed. Inlining a function means that a copy of the code for the function is placed in the calling function, and the call is removed.
Inlining is a very efficient method to speed up the code, since the function call overhead is avoided but the same computations carried out. Many programmers do this manually by using preprocessor macros for common pieces of code instead of functions, but macros lack the type checking of functions and produce harder-to-find bugs.
The executable code will often grow as a result of inlining, since code is being copied into several places. Inlining may also help shrink the code: for small functions, the code size cost of a function call might be bigger than the code for the function. In this case, inlining a function will actually save code size (as well as speed up the program).
The main problem when inlining for size is to estimate the gains in code size (when optimizing for speed, the gain is almost guaranteed). Since inlining in general increases the code size, the inliner has to be quite conservative. The effect of inlining on code size cannot be exactly determined, since the code of the calling function is disturbed, with nonlinear effects.
To reduce the code size, the ideal would be to inline all calls to a function, which allows us to remove the function from the program altogether. This is only possible if all calls are known, i.e., are placed in the same source file as the function, and the function is marked static, so that it cannot be seen from other files.
Otherwise, the function will have to be kept (even though it might still be inlined at some calls), and we rely on the linker to remove it if it is not called. Since this decreases the likely gain from inlining, we are less likely to inline such a function.
Linker. The linker should be considered an integral part of the compilation system, since there are some transformations that are performed in the linker. The most basic embedded-systems linker should remove all unused functions and variables from a program, and only include the parts of the standard libraries that are actually used.
The granularity at which program parts are discarded varies, from files or library modules down to individual functions or even snippets of code. The smaller the granularity, the better the linker. Unfortunately, some linkers derived from desktop systems work on a per file basis, and this will give unnecessarily big code.
Some linkers also perform postcompilation transformations on the program. Common transformation is the removal of unnecessary bank and page switches (that cannot be done at compile-time since the exact allocation of variable addresses is unknown at that time) and code compression as discussed above extended to the entire program.
Controlling Compiler Optimization
A compiler can be instructed to compile a program with different goals,
usually speed or size. For each setting, a set of transformations has
been selected that tend to work towards the goal—maximal speed (minimal
execution time) or minimal size. The settings should be considered
approximate. To give better control, most compilers also allow
individual transformations to be enabled or disabled.
For size optimization, the compiler uses a combination of transformations that tend to generate smaller code, but it might fail in some cases, due to the characteristics of the compiled program. As an example, the fact that function inlining is more aggressive for speed optimization makes some programs smaller on the speed setting than on the size setting.
The example data below demonstrates this, the two programs were compiled with the same version of the same compiler, using the same memory and data model settings, but optimizing for speed or size:
Program 1 gets slightly smaller with speed optimization, while program 2 is considerably larger, an effect we traced to the fact that function inlining was lucky on program 1. The conclusion is that one should always try to compile a program with different optimization settings and see what happens.
It is often worthwhile to use different compilation settings for different files in a project: put the code that must run very quickly into a separate file and compile that for minimal execution time (maximum speed), and the rest of the code for minimal code size. This will give a small program, which is still fast enough where it matters. Some compilers allow different optimization settings for different functions in the same source file using #pragma directives.
Memory Model. Embedded microcontrollers are usually available in several variants, each with a different amount of program and data memory. For smaller chips, the fact that the amount of memory that can be addressed is limited can be exploited by the compiler to generate smaller code.
An 8-bit direct pointer uses less code memory than a 24-bit banked pointer where software has to switch banks before each access. This goes for code as well as data.
For example, some Atmel AVR chips have a code area of only 8 kb, which allows a small jump with an offset of +/- 4 kb to reach all code memory, using wraparound to jump from high addresses to low addresses. Taking advantage of this yields smaller and faster code.
The capacity of the target chip can be communicated to the compiler using a memory model option. There are usually several different memory models available, ranging from "small" up to "huge." In general, function calls get more expensive as the amount of code allowed increases, and data accesses and pointers get bigger and more expensive as the amount of accessible data increases.
Make sure to use the smallest model that fits your target chip and application—this might give you large savings in code size.
Next in Part 6: Writing compiler-friendly C-code
To read Part 1 in this series, go to Reentrancy, atomic variables and recursion.
To read Part 2 in this series, go to Asynchronous Hardware/Firmware
To read Part 3, go to Metastable States
To read Part 4, go to Dealing With Interrupt Latency
Jakob
Engblom (jakob@virtutech.com)
is technical marketing manager at
at Virtutech.
He has a MSc in computer science and a PhD in Computer Systems from
Uppsala University, and has
worked with programming tools and simulation tools for embedded and
real-time systems since 1997.
He was a contributor of
material to "The Firmware Handbook," edited
by Jack Ganssle, upon which this series of articles was based and
printed
with permission from Newnes, a division of Elsevier.
Copyright 2008. For
other publications by Jakob Engblom, see www.engbloms.se/jakob.html.