Editor's note: In the first in a two-part series, the authors provide practical guidelines on how to use a compiler to get better performance out of the existing C code in your embedded application. Excerpted from Software engineering for embedded systems .
A necessary first step in optimizing your embedded system software for performance is to Prior to beginning the optimization process, it’s important to first confirm functional accuracy. In the case of standards-based code (e.g., voice or video coder), there may be reference vectors already available. If not, then at least some basic tests should be written to ensure that a baseline is obtained before optimization. This enables easy identification that an error has occurred during optimization — incorrect code changes done by the programmer or any overly aggressive optimization by a compiler. Once tests in place, optimization can begin. Figure 11.1 shows the basic optimization process.
It’s also important to understand the features of the development tools as they will provide many useful, time-saving features. Modern compilers are increasingly better performing with embedded software and leading to a reduction in the development time required. Linkers, debuggers and other components of the tool chain will have useful code build and debugging features, but in this chapter we will focus only on the compiler.
From the compiler perspective, there are two basic ways of compiling an application: traditional compilation or global (cross-file) compilation. In traditional compilation, each source file is compiled separately and then the generated objects are linked together. In global optimization, each C file is preprocessed and passed to the optimizer in the same file. This enables greater optimizations (inter-procedural optimizations) to be made as the compiler has complete visibility of the program and doesn’t have to make conservative assumptions about the external functions and references.
Global optimization does have some drawbacks, however. Programs compiled this way will take longer to compile and are harder to debug (as the compiler has taken away function boundaries and moved variables). In the event of a compiler bug, it will be more difficult to isolate and work around when built globally. Global or cross-file optimizations result in full visibility into all the functions, enabling much better optimizations for speed and size. The disadvantage is that since the optimizer can remove function boundaries and eliminate variables, the code becomes difficult to debug. Figure 11.2 shows the compilation flow for each.
Basic compiler configuration. Before building for the first time, some basic configuration will be necessary. Perhaps the development tools come with project stationery which has the basic options configured, but if not, these items should be checked:
- Target architecture: specifying the correct target architecture will allow the best code to be generated.
- Endianness: perhaps the vendor sells silicon with only one edianness, perhaps the silicon can be configured. There will likely be a default option.
- Memory model: different processors may have options for different memory model configurations.
- Initial optimization level: it’s best to disable optimizations initially.
Enabling optimizations. Optimizations may be disabled by default when no optimization level is specified and either new project stationery is created or code is built on the command line. Such code is designed for debugging only. With optimizations disabled, all variables are written and read back from the stack, enabling the programmer to modify the value of any variable via the debugger when stopped. The code is inefficient and should not be used in production code.
The levels of optimization available to the programmer will vary from vendor to vendor, but there are typically four levels (e.g., from zero to three), with three producing the most optimized code (Table 11.1 ). With optimizations turned off, debugging will be simpler because many debuggers have a hard time with optimized and out-of-order scheduled code, but the code will obviously be much slower (and larger). As the level of optimization increases, more and more compiler features will be activated and compilation time will be longer.
Note that typically optimization levels can be applied at the project, module, and function level by using pragmas, allowing different functions to be compiled at different levels of optimization.
In addition, there will typically be an option to build for size, which can be specified at any optimization level. In practice, a few optimization levels are most often used: O3 (optimize fully for speed) and O3Os (optimize for size). In a typical application, critical code is optimized for speed and the bulk of the code may be optimized for size.
Many development environments have a profiler, which enables the programmer to analyze where cycles are spent. These are valuable tools and should be used to find the critical areas. The function profiler works in the IDE and also with the command line simulator.
Understanding the embedded architecture. Before writing code for an embedded processor, it’s important to assess the architecture itself and understand the resources and capabilities available. Modern embedded architectures have many features to maximize throughput. Table 11.2 shows some features that should be understood and questions the programmer should ask.
Basic C optimization techniques
Following are some of the basic C optimization techniques that will benefit code written for all embedded processors. The central ideas are to ensure the compiler is leveraging all features of the architecture and how to communicate to the compiler additional information about the program which is not normally communicated in C.
Choosing the right data types..It’s important to learn the sizes of the various types on the core before starting to write code. A compiler is required to support all the required types but there may be performance implications and reasons to choose one type over another.
For example, a processor may not support a 32-bit multiplication. Use of a 32-bit type in a multiply will cause the compiler to generate a sequence of instructions. If 32-bit precision is not needed, it would be better to use 16-bit. Similarly, using a 64-bit type on a processor which does not natively support it will result in a similar construction of 64-bit arithmetic using 32-bit operations.
Use of intrinsics in embedded design. Intrinsic functions, or intrinsics for short, are a way to express operations not possible or convenient to express in C, or target-specific features (Table 11.3 ). Intrinsics in combination with custom data types can allow the use of non-standard data sizes or types. They can also be used to get to application-specific instructions (e.g., viterbi or video instructions) which cannot be automatically generated from ANSI C by the compiler. They are used like function calls but the compiler will replace them with the intended instruction or sequence of instructions. There is no calling overhead.
Some examples of features accessible via intrinsics are:
- fractional types
- disabling/enabling interrupts.
For example, an FIR filter can be rewritten to use intrinsics and therefore to specify processor operations natively (Figure 11.3 ). In this case, simply replacing the multiply and add operations with the intrinsic L_mac (for long multiply-accumulate) replaces two operations with one and adds the saturation function to ensure that DSP arithmetic is handled properly.
Functions calling conventions. Each processor or platform will have different calling conventions. Some will be stack- based, others register-based or a combination of both. Typically, default calling conventions can be overridden though, which is useful. The calling convention should be changed for functions unsuited to the default, such as those with many arguments. In these cases, the calling conventions may be inefficient.
The advantages of changing a calling convention include the ability to pass more arguments in registers rather than on the stack. For example, on some embedded processors, custom calling conventions can be specified for any function through an application configuration file and pragmas. It’s a two step process.
Custom calling conventions are defined by using the application configuration file (a file which is included in the compilation) (Figure 11.4 ).
They are invoked via pragma when needed. The rest of the project continues to use the default calling convention. In the example in Figures 11.6 and 11.7 , the calling convention is invoked for function TestCallingConvention.
char TestCallingConvention (int a, int b, int c, char d, short e)
#pragma call_conv TestCallingConvention mycall
Pointers and memory access
Ensuring alignment. Some embeddedprocessors such as digital signal processors (DSPs) support loading ofmultiple data values across the busses as this is necessary to keep thearithmetic functional units busy. These moves are called multiple datamoves (not to be confused with packed or vector moves). They moveadjacent values in memory to different registers. In addition, manycompiler optimizations require these multiple register moves becausethere is so much data to move to keep all the functional units busy.
Typically,however, a compiler aligns variables in memory to their access width.For example, an array of short (16-bit) data is aligned to 16 bits.However, to leverage multiple
data moves, the data must be aligned toa higher alignment. For example, to load two 16-bit values at once, thedata must be aligned to 32 bits.
Restrictand pointer aliasing. When pointers are used in the same piece of code,make sure that they cannot point to the same memory location (alias).When the compiler knows the pointers do not alias, it can put accessesto memory pointed to by those pointers in parallel, greatly improvingperformance. Otherwise, the compiler must assume that the pointers couldalias. Communicate this to the compiler by one of two methods: usingthe restrict keyword or by informing the compiler that no pointers aliasanywhere in the program (Figure 11.8 ).
The restrict keyword is a type qualifier that can be applied to pointers, references, and arrays (Tables 11.4 and 11.5 ).Its use represents a guarantee by the programmer that within the scopeof the pointer declaration, the object pointed to can be accessed onlyby that pointer. A violation of this guarantee can produce undefinedresults.
Table11.5:Example loop after restrict added to parameters.
Rob Oshana has 30 years of experience in the software industry, primarily focusedon embedded and real-time systems for the defense and semiconductorindustries. He has BSEE, MSEE, MSCS, and MBA degrees and is a SeniorMember of IEEE. Rob is a member of several Advisory Boards including theEmbedded Systems group, where he is also an international speaker. Hehas over 200 presentations and publications in various technology fieldsand has written several books on embedded software technology. He is anadjunct professor at Southern Methodist University where he teachesgraduate software engineering courses. He is a Distinguished Member ofTechnical Staff and Director of Global Software R&D for DigitalNetworking at Freescale Semiconductor.
Mark Kraeling isProduct Manager at GE Transportation in Melbourne, Florida, where he isinvolved with advanced product development in real-time controls,wireless, and communications. He’s developed embedded software for theautomotive and transportation industries since the early 1990s. Mark hasa BSEE from Rose-Hulman, an MBA from Johns Hopkins, and an MSE fromArizona State.
Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from Software engineering for embedded systems , by Robert Oshana and Mark Kraeling.