CMP EMBEDDED.COM

Login | Register     Welcome Guest IPS  Call for Abstracts
 

Using whole-program compilation to improve MCU code density and performance



Embedded.com
Microcontroller architectures have evolved to meet the needs of 21st century applications, but compilers have barely changed at all.

While MCU architectures have evolved with radically different memory maps, and register and stack configurations, all too often conventional C-compilers cannot make good use of these vastly different architectures requiring manual optimizations that compromise portability.

The original C language, and the compilers that support it, were developed in the 1970s, when computers could not run, much less compile, the large programs of today. Programs were both developed and run on the target platform. Typically computer memories rarely exceeded 1 MB RAM and a processor clock frequency of 400 kHz was pushing the envelope.

In order to conform to the limited memories and processing power of the time, programs were split into multiple small modules (Figure 1, below), each of which was compiled independently into a sequence of low-level machine instructions.

A linker joined these object modules, along with code extracted from pre-compiled libraries, to create the final program. This approach enabled quite large programs to be run on machines with limited memory. Memory size and processor speed are now no longer constraints.

Figure 1. Independent Compilation Sequence

But embedded programs are still broken up into smaller modules, each of which is compiled independently, with little regard for cross-module register optimization, re-entrant code or variable declarations.

Frequently, there are good reasons to break embedded code into modules. Code is frequently developed by teams of engineers, each of whom is responsible for one or more modules. This approach speeds system development and also makes the application easier to manage.

But, now, the CPU speed and memory size of a typical host desktop or server computer exceed those of a typical embedded target by several orders of magnitude. So, it is now quite practicable to generate code for the entire program at once, allowing all routines, variables, stacks and registers to be optimized based on the entire program.

Although it is theoretically possible to compile a very large program from a single source, the fact is that most programs are written by teams of engineers, each of whom is responsible for one or more modules of functionality. As a result, embedded systems will always be compiled from multiple files.

Limitations of independent module compilation
While it makes sense to break embedded code into smaller modules, there is no good reason any longer to compile each module independently and there are four very important reasons not to do so.

#1 - Inconsistent Definitions between Modules. Conventional compilers that compile program modules independently can introduce several types of problem into the program because each module is compiled without knowledge of the other modules. For one, C has always presented difficulties in ensuring that variables and other objects used in multiple modules have consistent definitions.

Although it is possible to have the linker check for incompatible re-declarations of variables by different modules, this approach adds complexity and doesn't always solve the problem to due to the fact that the linker may not have enough information to detect the human error.

#2 - Sub-optimal register allocation. The way in which arguments are passed and values returned, is another potential drawback in independent module compilation. Each calling convention contains a set of rules that defines which CPU registers are to be preserved across calls.

All functions must adhere to the same calling conventions. Since it is impossible to know at the time of compilation which registers will and will not actually be used by a called function, these rules can result in sub-optimal register allocation. This is particularly true with small embedded processors.

#3 - Non-portable code and subtle bugs. Processor memory architectures pose additional problems. Many embedded processors have a complex and non-linear set of memory spaces, often with different address widths. Standard C, which assumes a single linear address space, is difficult to map onto these architectures because it is often impossible to know at compile time what memory space a variable will be located in, or which spaces a given pointer will be required to access.

For example, a pointer might need to address memories of different address widths, such as an 8-bit wide RAM and a 16-bit or wider ROM, The programmer can achieve efficient pointer usage in these architectures by explicitly declaring the address spaces that a pointer will access. However, this solution is architecture-specific and results in non-portable code. It also increases the likelihood of subtle bugs being introduced.

#4 -Compiled Stack Limitations.  Many small embedded processors do not have a fully or efficiently addressable stack for the storage of local variables. This situation is usually handled by a compiled stack where local variables are statically allocated in memory, based on a call graph. Unfortunately independent compilation cannot know the call graph until link time.

Using a compiled stack requires the programmer to nominate any functions that are called re-entrantly, so that they can be dealt with accordingly. If the programmer nominates incorrectly, the error is not revealed until link time, by which time it is too late to make changes.

A New Approach
A new compiler methodology, called Omniscient Code Generation (OCG), takes advantage of the capabilities of today's host systems to achieve fully optimized object code that takes into account all the variables, functions, stacks and registers represented in all program modules.

Rather than relying completely on the linker to uncover errors in independently compiled modules, an OCG compiler completes the initial stages of compilation for each module separately, but defers object code generation until the point at which a view of the whole program is available.

It uses a global view of the compiled program, along with various other optimization techniques made possible by the complete information that is then available to the code generator. This approach solves most of the problems associated with independent compilation, described above.

An OCG compiler still compiles each module separately. However, instead of compiling to machine code instructions in an object file, it compiles each module to an intermediate code file that represents a more abstract view of each module. It does not produce actual machine instructions or allocate registers at this time.

Figure 2. Compilation Sequence with Omniscient Code Generation

Call Graph: A Clearer View
Once all the intermediate code files are available, they are loaded by the code generator into a call graph structure. The code generator also searches intermediate code libraries and extracts, as required, any library functions that are referenced by the program. Once the call graph is complete, functions that are never called may be removed, thus saving space.

The call graph (Figure 3, below) also allows identification of any functions called recursively (not common in embedded systems) or re-entrantly, such as those called from both main-line code and interrupt functions.

These functions must either use dynamic stack space for storage of local variables, or be managed in other ways to ensure that a re-entrant call of the function does not overwrite existing data. This is achieved in a way that is completely transparent to the programmer, and without requiring non-standard extensions to the language.

If a compiled stack is to be used, it can be allocated at this point, before any machine code has been generated. OCG knows exactly how big the stack is and where it is located, before the code is generated.

Figure 3. OCG Call Graph Structure

Pointer Reference Graphs
The compiler then builds reference graphs for objects and pointers in the program. Any conflicting declarations of the same object from different modules can be detected at this point and an informative error message issued to the user. Any variables never referenced can be deleted.

Determining the memory space for each pointer is one of the most important features of OCG. An algorithm uses each instance of a variable having its address taken, plus each instance of an assignment of a pointer value to a pointer (either directly, via function return, function parameter passing, or indirectly via another pointer) and builds a data reference graph (Pointer Reference Graph).

The graph is completed and then identifies all objects that can possibly be referenced by each pointer. This information is used to determine which memory space each pointer will be required to access.

Once the set of used variables and pointers is complete, OCG allocates memory of both the stack (compiled or dynamic: the call graph is used to determine the stack size) at the same time as it allocates global and static variables.

Where there are multiple memory spaces (e.g. an architecture with banked RAM), the variables accessed most often in the program can be allocated to the memory spaces that are cheapest to access. On an 8051, for example, this would be internal, directly addressable RAM, rather than external RAM which must be accessed via a pointer.

Each pointer variable now has a set of address spaces that it will be required to access, without any specific directions provided by the user. This allows each pointer to be sized and encoded in a way which is optimally efficient for the particular architecture, while still being accurate.

Bottom-up Code Generation
Generation of machine code now begins at the bottom of the call graph, starting with those functions that do not call any other functions. Automatic in-lining of these functions may be performed if desired. In any case, the code can be generated without the constraints of rigid calling conventions.

Code generation then proceeds up the call graph, so that for each function, the code generator knows exactly which functions are called by the current function, and therefore also knows exactly what registers and other resources are available at each point. Calling conventions can be tailored to the register usage and argument type of a function, instead of following a set pattern.

Customized Library Functions
An omniscient code generator has a truly global1 view of the program. (In compiler parlance the meaning of the adjective "global" has been diluted by the use of the term global optimization to refer to optimization within one function - OCG takes a wider view than this)

Therefore, complex library functions can be implemented in a way that is specific to each particular program. A good example of this is the C library functions sprintf() and printf(). These workhorse routines for formatting text strings or output are enormously useful, but can occupy a large code footprint (5 KBytes or more), if implemented in their entirety.

The OCG code generator can analyze the format strings supplied to these functions, and determine exactly the set of format specifiers and modifiers that are used. These can then be used to create a customized version of either function as required. The code size saving can be immense.

Code for a minimal version of sprintf() implementing simple string copying can be as little as 20 or 30 bytes, whereas a version providing real number formats with specific numbers of digits could occupy 5000 bytes or more. No programmer input, other than writing the program itself, is required to benefit from this customization and optimization.

Unused Return Values
Many library functions return values that are not necessarily checked by the calling code. A compiler with OCG can establish whether the return value of a function is ever used. If the code generator determines that the return value for a particular function is never used, it can remove the code implementing the return value in that function, saving both code and cycles.

Re-entrancy without a Conventional Stack
A "conventional stack" is implemented in the hardware of the target MCU. " However, not all MCUs have a hardware stack. In these implementations the compiler must implement a "compiled stack" built at compile time rather than runtime Although this is not a great problem for most embedded applications, it makes writing re-entrant code difficult.

A compiled stack is not as good as a hardware stack as it can't implement recursion or re-entrant function calls. OCG, as opposed to standard compilers, gets around re-entrant function calls. OCG transparently addresses re-entrant code by building separate call graphs for both main-line and interrupt code.

Any functions that appear in more than one call graph can be replicated, each with its own local variable area. Using this technique reentrancy can be implemented without a conventional stack.

Customized Runtime Startup Code
Once upon a time, all embedded systems programmers were accustomed to writing their own runtime startup code (often written in assembler). Startup code is executed immediately after reset to perform housekeeping, such as clearing the uninitialized RAM area. The C language requires that uninitialized static and global variables be cleared to zero on startup. Many newer embedded compilers relieve the engineer of this task by providing canned startup code.

However, canned startup code is often much larger than necessary for a given program. For example, if the program has no uninitialized global variables, there is no need to include code to clear them. OCG makes this information available to the code generator, which then creates custom runtime startup code. In a minimal case, the startup code may be completely empty.

Smaller Code Better Performance.
One obvious advantage of OCG is smaller, faster code. More importantly, OCG allows embedded C programs to be written without the use of architecture-specific extensions.

By performing at compile time an analysis of the whole program, the code generator can make the decisions about memory placement, pointer scoping etc. that would otherwise be made by the programmer, and specified through special directives or language extensions. Since this analysis is performed every time the program is recompiled, it is always accurate and up-to-date.

Table 1

Performance comparisons of OCG to conventional compilation focus on code size - typically the most important parameter for embedded systems, but also a valid proxy for execution speed. A reduction in code size always results in an increase in execution speed unless specific optimization techniques are used that sacrifice code speed for size (such as code-threading).

A single C-language source file was compiled for the Microchip's PIC18 and Cypress' PSoC, using both traditional and OCG compilation technologies. As shown in Table 1 above, OCG compilation resulted in 17.2% less code for Microchip's PIC18 architecture compared to Hi-Tech's PICC-18 STD that shares much of the same compilation technology, except OCG. OCG-compiled code for Cypress' PSoC was 41.4% smaller, than that compiled by ImageCraft.

As shown in Table 2, below, additional tests to evaluate OCG's library optimization functions on the PIC18C242 target resulted 50% better code density than IAR's EW18 compiler and 40% better code density than Microchip's compiler. Tests were conducted on other source files, showing similar results.

Table 2

Conclusion
The somewhat irregular architectures of embedded microcontrollers are often an awkward fit with the standard C language, frequently requiring substantial architecture-specific hand crafting to achieve efficient code. Omniscient Code Generation simplifies and streamlines the programmer's job by abstracting and hiding the underlying architecture, while simultaneously delivering reduced code size and increased execution speed.

Clyde Stubbs is CEO and Founder of HI-TECH Software which provides development tools for embedded systems, offering compilers, RTOS and an Eclipse based IDE (HI-TIDE) for 8-, 16-, and 32-bit microcontroller and DSP chip architectures.

1

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Ready to take that job and shove it?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS




 :