Using whole-program compilation to improve MCU code density and performance - Embedded.com

Using whole-program compilation to improve MCU code density and performance

Microcontroller architectures have evolved to meet the needs of 21stcenturyapplications, but compilers have barely changed at all.

While MCU architectures have evolved with radically different memorymaps, and register and stack configurations, all too often conventionalC-compilers cannot make good use of these vastly differentarchitectures requiring manual optimizations that compromiseportability.

The original C language, and the compilers that support it,were developed in the 1970s, when computers could not run, much lesscompile, the large programs of today. Programs were both developed andrun on the target platform. Typically computer memories rarely exceeded1 MB RAM and a processor clock frequency of 400 kHz was pushing theenvelope.

In order to conform to the limited memories and processing power ofthe time, programs were split into multiple small modules (Figure 1, below ), each ofwhich was compiled independently into a sequence of low-level machineinstructions.

A linker joined these objectmodules, along with code extracted from pre-compiled libraries, tocreate the final program. This approach enabled quite large programs tobe run on machines with limited memory. Memory size and processor speedare now no longer constraints.

Figure1. Independent Compilation Sequence

But embedded programs are still broken up into smaller modules, eachof which is compiled independently, with little regard for cross-moduleregister optimization, re-entrant code or variable declarations.

Frequently, there are good reasons to break embedded code intomodules. Code isfrequently developed by teams of engineers, each of whom is responsiblefor one or more modules. This approach speeds system development andalso makes the application easier to manage.

But, now, the CPU speed and memory size of a typical host desktop orserver computer exceed those of a typical embedded target by severalorders of magnitude. So, it is now quite practicable to generate codefor the entire program at once, allowing all routines, variables,stacks and registers to be optimized based on the entire program.

Although it is theoretically possible to compile a very largeprogram from a single source, the fact is that most programs arewritten by teams of engineers, each of whom is responsible for one ormore modules of functionality. As a result, embedded systems willalways be compiled from multiple files.

Limitations of independent modulecompilation
While it makes sense to break embedded code into smaller modules, thereis no good reason any longer to compile each module independently andthere are four very important reasons not to do so.

#1 -Inconsistent Definitions between Modules. Conventional compilersthat compile program modules independently can introduce several typesof problem into the program because each module is compiled withoutknowledge of the other modules. For one, C has always presenteddifficulties in ensuring that variables and other objects used inmultiple modules have consistent definitions.

Although it is possible to have the linker check for incompatiblere-declarations of variables by different modules, this approach addscomplexity and doesn't always solve the problem to due to the fact thatthe linker may not have enough information to detect the human error.

#2 -Sub-optimal register allocation . The way in which arguments arepassed and values returned, is another potential drawback inindependent module compilation. Each calling convention contains a setof rules that defines which CPU registers are to be preserved acrosscalls.

All functions must adhere to the same calling conventions. Since itis impossible to know at the time of compilation which registers willand will not actually be used by a called function, these rules canresult in sub-optimal register allocation. This is particularly truewith small embedded processors.

#3 -Non-portable code and subtle bugs. Processor memoryarchitectures pose additional problems. Many embedded processors have acomplex and non-linear set of memory spaces, often with differentaddress widths. Standard C, which assumes a single linear addressspace, is difficult to map onto these architectures because it is oftenimpossible to know at compile time what memory space a variable will belocated in, or which spaces a given pointer will be required to access.

For example, a pointer might need to address memories of differentaddress widths, such as an 8-bit wide RAM and a 16-bit or wider ROM,The programmer can achieve efficient pointer usage in thesearchitectures by explicitly declaring the address spaces that a pointerwill access. However, this solution is architecture-specific andresults in non-portable code. It also increases the likelihood ofsubtle bugs being introduced.

#4 -Compiled Stack Limitations.  Many small embedded processors do not have a fully or efficientlyaddressable stack for the storage of local variables. This situation isusually handled by a compiled stack where local variables arestatically allocated in memory, based on a call graph. Unfortunatelyindependent compilation cannot know the call graph until link time.

Using a compiled stack requires the programmer to nominate anyfunctions that are called re-entrantly, so that they can be dealt withaccordingly. If the programmer nominates incorrectly, the error is notrevealed until link time, by which time it is too late to make changes.

A New Approach
A new compiler methodology, called OmniscientCode Generation (OCG), takes advantage of the capabilities oftoday's host systems to achieve fully optimized object code that takesinto account all the variables, functions, stacks and registersrepresented in all program modules.

Rather than relying completely on the linker to uncover errors inindependently compiled modules, an OCG compiler completes the initialstages of compilation for each module separately, but defers objectcode generation until the point at which a view of the whole program isavailable.

It uses a global view of the compiled program, along with variousother optimization techniques made possible by the complete informationthat is then available to the code generator. This approach solves mostof the problems associated with independent compilation, describedabove.

An OCG compiler still compiles each module separately. However,instead of compiling to machine code instructions in an object file, itcompiles each module to an intermediate code file that represents amore abstract view of each module. It does not produce actual machineinstructions or allocate registers at this time.

Figure2. Compilation Sequence with Omniscient Code Generation

Call Graph: A Clearer View
Once all the intermediate code files are available, they are loaded bythe code generator into a call graph structure. The code generator alsosearches intermediate code libraries and extracts, as required, anylibrary functions that are referenced by the program. Once the callgraph is complete, functions that are never called may be removed, thussaving space.

The call graph (Figure 3, below )also allows identification of any functions calledrecursively (not common in embedded systems) or re-entrantly, such asthose called from both main-line code and interrupt functions.

These functions must either use dynamic stack space for storage oflocal variables, or be managed in other ways to ensure that are-entrant call of the function does not overwrite existing data. Thisis achieved in a way that is completely transparent to the programmer,and without requiring non-standard extensions to the language.

If a compiled stack is to be used, it can be allocated at thispoint, before any machine code has been generated. OCG knows exactlyhow big the stack is and where it is located, before the code isgenerated.

Figure3. OCG Call Graph Structure

Pointer Reference Graphs
The compiler then builds reference graphs for objects and pointers inthe program. Any conflicting declarations of the same object fromdifferent modules can be detected at this point and an informativeerror message issued to the user. Any variables never referenced can bedeleted.

Determining the memory space for each pointer is one of the mostimportant features of OCG. An algorithm uses each instance of avariable having its address taken, plus each instance of an assignmentof a pointer value to a pointer (either directly, via function return,function parameter passing, or indirectly via another pointer) andbuilds a data reference graph (Pointer Reference Graph).

The graph is completed and then identifies all objects that canpossibly be referenced by each pointer. This information is used todetermine which memory space each pointer will be required to access.

Once the set of used variables and pointers is complete, OCGallocates memory of both the stack (compiled or dynamic: the call graphis used to determine the stack size) at the same time as it allocatesglobal and static variables.

Where there are multiple memory spaces (e.g. an architecture withbanked RAM), the variables accessed most often in the program can beallocated to the memory spaces that are cheapest to access. On an 8051,for example, this would be internal, directly addressable RAM, ratherthan external RAM which must be accessed via a pointer.

Each pointer variable now has a set of address spaces that it willbe required to access, without any specific directions provided by theuser. This allows each pointer to be sized and encoded in a way whichis optimally efficient for the particular architecture, while stillbeing accurate.

Bottom-up Code Generation
Generation of machine code now begins at the bottom of the call graph,starting with those functions that do not call any other functions.Automatic in-lining of these functions may be performed if desired. Inany case, the code can be generated without the constraints of rigidcalling conventions.

Code generation then proceeds up the call graph, so that for eachfunction, the code generator knows exactly which functions are calledby the current function, and therefore also knows exactly whatregisters and other resources are available at each point. Callingconventions can be tailored to the register usage and argument type ofa function, instead of following a set pattern.

Customized Library Functions
An omniscient code generator has a truly global1 view of the program.(In compiler parlance the meaning of the adjective “global” has beendiluted by the use of the term global optimization to refer tooptimization within one function – OCG takes a wider view than this)

Therefore, complex library functions can be implemented in a waythat is specific to each particular program. A good example of this isthe C library functions sprintf() and printf(). These workhorseroutines for formatting text strings or output are enormously useful,but can occupy a large code footprint (5 KBytes or more), ifimplemented in their entirety.

The OCG code generator can analyze the format strings supplied tothese functions, and determine exactly the set of format specifiers andmodifiers that are used. These can then be used to create a customizedversion of either function as required. The code size saving can beimmense.

Code for a minimal version of sprintf() implementing simple stringcopying can be as little as 20 or 30 bytes, whereas a version providingreal number formats with specific numbers of digits could occupy 5000bytes or more. No programmer input, other than writing the programitself, is required to benefit from this customization andoptimization.

Unused Return Values
Many library functions return values that are not necessarily checkedby the calling code. A compiler with OCG can establish whether thereturn value of a function is ever used. If the code generatordetermines that the return value for a particular function is neverused, it can remove the code implementing the return value in thatfunction, saving both code and cycles.

Re-entrancy without a ConventionalStack
A “conventional stack” is implemented in the hardware of the targetMCU. ” However, not all MCUs have a hardware stack. In theseimplementations the compiler must implement a “compiled stack” built atcompile time rather than runtime Although this is not a great problemfor most embedded applications, it makes writing re-entrant codedifficult.

A compiled stack is not as good as a hardware stack as it can'timplement recursion or re-entrant function calls. OCG, as opposed tostandard compilers, gets around re-entrant function calls. OCGtransparently addresses re-entrant code by building separate callgraphs for both main-line and interrupt code.

Any functions that appear in more than one call graph can bereplicated, each with its own local variable area. Using this techniquereentrancy can be implemented without a conventional stack.

Customized Runtime Startup Code
Once upon a time, all embedded systems programmers were accustomed towriting their own runtime startup code (often written in assembler).Startup code is executed immediately after reset to performhousekeeping, such as clearing the uninitialized RAM area. The Clanguage requires that uninitialized static and global variables becleared to zero on startup. Many newer embedded compilers relieve theengineer of this task by providing canned startup code.

However, canned startup code is often much larger than necessary fora given program. For example, if the program has no uninitializedglobal variables, there is no need to include code to clear them. OCGmakes this information available to the code generator, which thencreates custom runtime startup code. In a minimal case, the startupcode may be completely empty.

Smaller Code Better Performance.
One obvious advantage of OCG is smaller, faster code. More importantly,OCG allows embedded C programs to be written without the use ofarchitecture-specific extensions.

By performing at compile time an analysis of the whole program, thecode generator can make the decisions about memory placement, pointerscoping etc. that would otherwise be made by the programmer, andspecified through special directives or language extensions. Since thisanalysis is performed every time the program is recompiled, it isalways accurate and up-to-date.

Table1

Performance comparisons of OCG to conventional compilation focus oncode size – typically the most important parameter for embeddedsystems, but also a valid proxy for execution speed. A reduction incode size always results in an increase in execution speed unlessspecific optimization techniques are used that sacrifice code speed forsize (such as code-threading).

A single C-language source file was compiled for the Microchip's PIC18 and Cypress' PSoC ,using both traditional and OCG compilation technologies. As shown in Table 1 above , OCG compilationresulted in 17.2% less code for Microchip's PIC18 architecture comparedto Hi-Tech's PICC-18 STD that shares much of the same compilationtechnology, except OCG. OCG-compiled code for Cypress' PSoC was 41.4%smaller, than that compiled by ImageCraft.

As shown in Table 2, below ,additional tests to evaluate OCG's library optimization functions onthe PIC18C242 target resulted 50% better code density than IAR's EW18compiler and 40% better code density than Microchip's compiler. Testswere conducted on other source files, showing similar results.

Table2

Conclusion
The somewhat irregular architectures of embedded microcontrollers areoften an awkward fit with the standard C language, frequently requiringsubstantial architecture-specific hand crafting to achieve efficientcode. Omniscient Code Generation simplifies and streamlines theprogrammer's job by abstracting and hiding the underlying architecture,while simultaneously delivering reduced code size and increasedexecution speed.

Clyde Stubbs is CEO and Founder ofHI-TECHSoftware which provides development tools for embedded systems,offering compilers, RTOS and an Eclipse based IDE (HI-TIDE) for 8-,16-, and 32-bit microcontroller and DSP chip architectures.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.