Simple source code changes can often result in substantial performanceenhancements using modern optimizing compilers on high-end embeddedprocessors. But, why is performance necessary? After all, thecapabilities of modern microprocessors dwarf the capabilities of1980-era supercomputers.
First, the average case response time of a real-time system isirrelevant. It is the worst-case response time that guards against adropped call, a misprint, or other error conditions in a product. Inother words, performance is necessary to minimize the response time ofone's system, thereby achieving product reliability.
Second, increased performance allows for the implementation of morefeatures without compromising the integrity of the system. Conversely,performance might be used to select a cheaper microprocessor or onethat consumes less power.
There are many factors that determine the performance of a system.The choice of hardware can mean the difference between a few MIPS and afew hundred. Good data structures and algorithms are essential, andbookshelves have been written on this topic. A good compiler is alsoessential. One should evaluate the features and optimizationcapabilities of a compiler before spending too much time working withit.
The purpose here is to explore an often-overlooked aspect of achievingmaximum performance. No matter what hardware is chosen, which datastructures and algorithms are employed, and which compiler is used,proper coding guidelines can dramatically impact the efficiency ofone's code.
Nonetheless, developers are often unaware of the consequences oftheir programming habits. And unlike fundamental design decisions, manyof these improvements can be made at a late stage of a project.
In the following examples, efficient coding guidelines will beillustrated using the C and C++ programming languages and, at times,PowerPC, ARM, and x86 assembly code. Many of these concepts, however,apply to other programming languages. Most of them are also processorindependent, although there are some exceptions, which will be notedlater.
Choice of Data Type Sizes
The most fundamental data type in the C language is the integer, whichis normally defined as the most efficient type for a given processor.The C language treats this type specially in that the language does notoperate on smaller types, at least conceptually speaking.
If a character is incremented, the language specifies that thecharacter is first promoted to an integer, the addition is performed asan integer, and the truncated result is stored back into the character.
In practice, using a smaller type for a local variable iscomputationally inefficient. Consider the following snippet of code:
m += ++c;
In PowerPC assembly code, this looks like:
On a CISC chip like the 80386, it looks like:
On a RISC chip, one pays the price after writing to a sub-integralvariable ” the value must be sign or zero extended in register toensure that the high bits of the register are consistently set. On aCISC chip, one pays the price when mixing the short variable with alarger integer variable. And unless you're using an 8 or 16-bitmicroprocessor, the microprocessor does not save on the shorter addinstruction, so there is nothing to make up for the cost of the sign orzero extension after the increment.
Variables in memory are sometimes different. Because memory isscarce, it often makes sense to conserve by using a smaller data type.Most microprocessors can load and extend the short value in a singleinstruction, which means that the extension is free. Also, thetruncation that is necessary when storing the value can often beavoided because sub-integral store instructions generally ignore thehigh bits.
These choices can be simplified by using the C99 header stdint.h . Even if your compiler doesnot support C99, it might provide stdint.h, or you may write ityourself. This file defines types that fit into a few categories:
1) int size _t ” Used when a type must beexactly a given length.
2) int_least size _t ” Used when a type must be at least a given size and when the typeshould be optimized for space. This is preferred for data structures.
3) int_fast size _t ” Used when a type must be at least a given size and when thetype should be as fast as possible. This is preferred for localvariables.
Larger data types should also be used sparingly. For example, C99adds the long long type. This type has been available as an extensionto C89 in most compilers for years. On a 32-bit microprocessor, thesevariables take up two registers and require multiple operations toperform arithmetic on them. An addition or a subtraction can usually bedone inline in a few instructions.
However, an innocuous looking shift can turn into quite a fewinstructions. And you can forget about division! Of course, thesevariables might be necessary or convenient when coding, but do not usethem unless you need them and understand the corresponding code sizeand speed impact.
Unsigned variables can result in more efficient generated code forcertain operations than using signed variables. For example, unsigneddivision by a power of two can be performed as a right shift.
Most architectures require a few instructions to perform a signeddivide by a power of two. Likewise, computing an unsigned modulus by apower of two can be implemented as an AND operation (or a bit extract).Computing a signed modulus is more complicated. In addition, it issometimes useful that unsigned variables can represent just over twiceas many positive values as their signed counterparts.
Characters deserve special mention here. Many architectures, such asthe ARM, and PowerPC, are more efficient at loading unsigned charactersthan they are at loading signed characters. Other architectures, suchas the V800 and SH, handle signed characters more efficiently. Mostarchitectures' ABIs define the plain “char” type as whichever type ismost efficient. So, unless a character needs to be signed or unsigned,use the plain char type.
Use of access types
For global data, use the static keyword (or C++ anonymous namespaces) whenever possible. In some cases, static allows a compiler to deducethings about the access patterns to a variable. The static keyword also “hides” thedata, which is generally a good thing from a programming practicesstandpoint. Declaring a function as static is also helpful in many cases.
This code can be optimized by the compiler in a couple of waysbecausethe functions were declared as static .First, the compiler can inline the call to allocate_new_unique_number() anddelete the out-of-line copy, because it is not called from any otherplace.
Second, if the compiler has basic information on the external allocate_other_fields() function,the compiler can sometimes tell that the function will not call backinto this module. This knowledge allows eliminating the second load ofunique_number embedded in the inlined allocate_new_unique_number() function.
<>On the other hand, static data should be avoided whenever possible for function-local data. Thedata's value must be preserved between calls to the function, making itvery expensive to access the data, and requiring permanent RAM storage.The static keyword should only be used on function-local data when thevalue must be preserved across calls or for large data allocations(such as an array) when the programmer prefers to tradeoff consumptionof stack space for permanent RAM.
Global Variable Definition
Compilers can sometimes optimize accesses to global variables if theyare defined and used together in the same module. In that case, thecompiler can access one variable as an offset from another variable'saddress, as if they were both members of the same structure. Therefore,all other things being equal, it is worthwhile to define globalvariables in the modules where they are used the most.
For example, if the global variable “glob “is used the most in file.c, it should be defined there. Use adefinition such as:
int glob = 1;
in addition to declaring glob (with externint glob; ) in a header file so that other modules can referenceit.
Some programmers get confused about uninitialized variabledefinitions, so it is worth clarifying how they are often implemented.Most C compilers support a feature, called common variables, whereuninitialized global variable definitions are combined and resolved atlink-time. For example, a user could place the definition:
in a header file and include this header file in every file in theproject. While a strict reading of the C language implies that thiswill result in link-time errors, under a common variable model, eachmodule will output a special common reference, which is different froma traditional definition.
The linker will combine all of the common references for a variable,and if the variable was not already defined elsewhere, allocate spacefor the variable in an uninitialized data section (such as .bss ) and point all of the commonreferences to this newly allocated space. On the other hand, if theuser employs a definition such as:
int glob = 1;
in one module, all other uninitialized, common references wouldresolve to this definition.
It is best to write code without defining the same uninitializedglobal variable in multiple modules. Then, if you turn off the commonvariable feature in the compiler, the compiler is able to perform moreaggressive optimizations because it knows that uninitialized globalvariable definitions are true definitions.
The volatile keyword tells the compiler not to optimize away accessesto the variable or type. That is, the value will be loaded each time itis needed, and will be stored each time it is modified. Volatilevariables have two major uses:
1) The variable is a memorymapped hardware device where the value can change asynchronously and/orwhere it is critical that all writes to the variable are respected.
2) The variable is used in amulti-threading environment where another thread may modify thevariable at any time.
The alternative to careful use of the volatile keyword is to disableso-called “memory optimizations”. Effectively, all variables aretreated as volatile when this option is chosen. Because disablingmemory optimizations are important for efficient code, developers areencouraged to choose a less conservative approach.
Separate threads often perform separate functions. As a result, theymay have many variables and data structures that are not accessed byother tasks. Good software engineering practices might be able tominimize the overlap to a few shared header files. If such practiceshave reduced the scope of the code to a few files, it is more feasibleto find the variables and data structures that are shared betweenthreads.
The const keyword is helpful in a couple of ways. First, constvariables can usually be allocated to a read-only data section, whichcan save on the amount of RAM required if the read-only data isallocated to flash or ROM. Second, the const keyword, when applied to apointer or reference parameter, might allow the compiler to deduce thatthe call will not result in the value being modified.
The restrict keyword tells a compiler that the specified pointer is theonly way to access a given piece of data. This can allow the compilerto optimize more aggressively. Consider the following example, a simplefinite impulse response code:
Consider the inner loop for a PowerPC target:
Better code can be generated by pulling the first load out of theloop, since in[i] does notchange within the inner loop. However, the compiler cannot tell that in[i] does not change within theloop. The compiler is unable to determine this because the in and outarrays could overlap. If the function declaration is changed to:
void fir_and_copy(int *in, const int *coeff, int *restrict out)
the compiler knows that writes through *out cannot change the values in *in . The restrict keyword is a bitconfusing because it applies to the pointer itself rather than the datathe pointer points to (contrast const int *x and int *restrict x ).
Pointers and the & operator
It is usually more efficient to have a function return a scalar valuethan to pass the scalar value by reference or by address. For example,often times a value is returned from a function by passing the addressof an integer to the function. For example:
Taking the address of a variable forces the compiler to allocate thevariable on the stack, all but assuring less efficient code generation.Passing an argument as a C++ reference parameter has the same effect.
Declaration scope of variables
Declare a variable in the inner-most scope possible, particularly whenits address must be taken. In such cases, the compiler must keep thevariable around on the stack until it goes out of scope. This caninhibit certain optimizations that depend on the variable being dead.For example, consider the variable “loc” in the following function:
The compiler could potentially perform a tail call for the call to func2() . This is an optimizationwhere the frame for func() , ifany, is popped, and the last call instruction to func2() is replaced by a branch tofunc2().
This saves the need to return to func1() ,which would immediately return to its caller. Instead, func2() returns directly to func1() 's caller. Unfortunately,the compiler cannot employ this optimization because it cannotdetermine that loc is notused in the second call to func2() (which is possible if its address was saved in the first call). Thefollowing code allows for better optimization:
In this case, the compiler knows that the lifetime of loc ends before the final call “and the tail call, at least in principle, can happen. Another benefitof using inner scopes is that variables from non-overlapping scopes canshare the same space on the stack. This helps to minimize stack use andcan result in smaller code size on some architectures.
However, it is usually not worthwhile to create artificially smallscopes simply to bound the lifetimes of variables.
Floating Point Arithmetic
Understanding the rules of arithmetic promotion can help you avoidcostly mistakes. Many embedded architectures do not implement floatingpoint arithmetic in hardware. Some processors implement the singleprecision “float ” type inhardware, but leave the “double “type to floating point software emulation routines.
Unless doubles are implemented in hardware, it is more efficient todo arithmetic with the single precision type. If this is the case andif single precision arithmetic is sufficient, follow these rules:
1 . Write single precisionfloating point constants with the F suffix. For example, write 3.0F instead of 3.0 . The constant 3.0 is a double precision value,which forces the surrounding arithmetic expressions to be promoted todouble precision.
2. Use single precision mathroutines, such as sinf() insteadof sin() , the double precisionversion.
3. Avoid old-style prototypesand function declarations because these force floats to be promoted todouble. So, instead of:
float foo(float f);
This is probably only a concern for old code bases.
Variable Length Arrays
The variable length array feature, which is included in C99, mightresult in less efficient array access. In cases where the array ismulti-dimensional and subscripts other than the first are of variablelengths, the resulting code may be larger and slower. The feature isuseful, but be aware that code generation can suffer.
Low Level Assembly Instructions
Sometimes it is helpful or necessary to use specific assemblyinstructions in embedded programming. Intrinsic functions are the bestway to do this. Intrinsic functions look like function calls, but theyare inlined into specific assembly instructions. Refer to your compilervendor's documentation to determine which intrinsics are available.
Inlined assembly code uses non-portable syntax and compilersgenerally make over-conservative assumptions when encountering inlinedassembly, thus affecting code performance. Intrinsics can be #define'dinto other names if switching from one compiler to another. This can bedone once in a header file rather than going through the code to seeevery place where inlined assembly was used.
For example, instead of writing the following code to disableinterrupts on the PowerPC:
Manual Loop Tricks
Sometimes programmers feel compelled to manually unroll loops,perform strength reduction, or use other transformations that astandard compiler optimizer would handle. For example:
is sometimes manually transformed into:
Such transformations are usually only effective under fairlyspecific architectures and compilers. For example, the 32-bit ARMarchitecture supports the post-increment addressing mode used above,but the PowerPC architecture only includes the pre-increment addressingmode. So, for the PowerPC, this loop could be written as:
As a general rule, only do manual transformations for time criticalsections of your code where your compiler of choice has not been ableto perform adequate optimizations on its own, even after adjustingcompilation options. Write simple code for most cases and let thecompiler do the optimization work for you.
The performance impact of some decisions that programmers make whenwriting their code can be significant. While efficient algorithmicdesign is of the highest importance, making intelligent choices whenimplementing the design can help application code perform at itshighest potential.
Greg Davis is Technical Lead,Compiler Development, at Green HillsSoftware, Inc.
This article is excerpted froma paper of the same name presented at the Embedded Systems ConferenceSilicon Valley 2006. Used with permission of the Embedded SystemsConference. For more information, please visit www.embedded.com/esc/sv.