Code and compilers: small, fast or what? -

Code and compilers: small, fast or what?


All modern compilers generate optimized code. For embedded developers, good optimization is critical, as resources are always at a premium, but control of optimization is also essential, as every embedded system is different. In this article, the balance between fast and small code and why this choice is necessary are considered. Additionally, examples are given where this rule can be broken, and fast, small code results, which lead to a reconsideration of the true function of a compiler.

What is a compiler?
Ask an average engineer and you will get an answer something like: “A software tool that translates high level language code into assembly language or machine code.” Although this definition is not incorrect, it is rather incomplete and out of date – somewhat 1970s. A better way to think of a compiler is: “A software tool that translates an algorithm described in a high level language code into a functionally identical algorithm expressed in assembly language or machine code.” More words, yes, but a more precise definition.

The implications of this definition go beyond placating a pedant like me. It leads to a greater understanding of code generation and hints at just how good a job a modern compiler can do, along with the effect on debugging the compiled code.

Life is often about compromise, but embedded developers are not good at that. Code generation is a context in which compromise is somewhat inevitable – we call it “optimization”. All modern compilers perform optimization, of course. Some do a better job than others. A lot of the time, the compiler simply guesses which optimization will produce the best result without knowing what the designer really wants. For desktop applications, this is OK. Speed is the only important criterion, as memory is effectively free. But embedded is different.

Desktop versus embedded
To the first approximation, all desktop computers are the same. It is quite straightforward to write acceptable applications that will run on anyone’s machine. Also, the broad expectations of their users are identical. But embedded systems are all different – the hardware and software environment varies widely and the expectations of users are just as diverse. In many ways, this is what is particularly interesting about embedded software development.Optimization
An embedded compiler is likely to have a greatmany options to control optimization. Sometimes that fine-grain controlis vital; on other occasions, it can come down to a simple choicebetween optimization for speed or for size. This choice is curious, butit is simply an empirical observation that small code is often slowerand fast code tends to need more memory.
A simplistic example wouldbe an iterative process where some operation is performed a number oftimes. This could be done two ways. The obvious way is by means of aloop:

  for n=1 to 10
  operate on data element n

The other option is to unroll the loop:
  operate on data element 1
  operate on data element 2
  operate on data element 3
  operate on data element 4
  operate on data element 5
  operate on data element 6
  operate on data element 7
  operate on data element 8
  operate on data element 9
  operate on data element 10

Clearlythe first way is more compact, but it would be slower because of theincrement and test instructions. The extra size of the second approachmay be acceptable to gain some speed, but would soon become impracticalfor larger numbers of operations.

A more realistic example isfunction inlining. A small function can be optimized so that its actualcode is placed in line at each call site. This executes faster becausethe call/return sequence is eliminated, but it will use more memory asthere may be multiple copies of identical code.

The control ofoptimization for embedded code generation is not likely to get anyeasier, as more possibilities are coming along. Notably, there isincreased interest in minimizing power consumption.

An algorithmmay be selected on the basis of how much CPU power it requires to getthe job done. This is subtle, because fast code needs less CPU power,but smaller code needs less memory, which consumes power.

An exception
Itis said that rules are made to be broken; the stipulation that a choicemust always be made between small and fast code can certainly beinvalid. Sometimes you can get lucky and an optimization which yieldsfaster code is also light on memory, but this is quite unusual. Considerthis code:

  #define SIZE 4

  char buffer[SIZE];

  for (i=0; i    buffer[i] = 0;

This is straightforward. One would expect a simple loop that counts around four times using the counter variable i .I tried this, generating code for a 32-bit device, and stepped throughthe code using a debugger. To my surprise, the code only seemed toexecute the assignment once; not four times. Yet the array was clearedcorrectly. So, what was going on?

A quick look at the underlyingassembly language clarified matters. The compiler had generated asingle, 32-bit clear instruction, which was considerably more efficientthan a loop. The loop variable did not exist at all. I experimented andfound that for different values for SIZE, various combinations of 8-,16- and 32-bit clear instructions were generated. Only when the arraysize exceeded something like 12 did the compiler start generating arecognizable loop, but even that was not a byte-by-byte clear. Theoperation was performed 32 bits at a time.

Ofcourse, debugging optimized code is always a challenge. Indeed, eventoday some debuggers do not allow debugging of fully optimized code.They give you an interesting choice: ship optimal code or debugged code.This is because the correlation between the source code and theresulting machine instructions is unclear (to both human observer anddebugger).

Realistically, I would recommend that initialdebugging be performed with optimization wound down to avoid suchconfusion and enable this kind of logic to be verified carefully. Later,verify the overall functionality of the code with aggressiveoptimization activated.

Building code forembedded applications is not a press-button activity. An awareness ofcompiler functionality, code generation and optimization is essential.Although there is little need to write assembly language code, anability to read and understand the generated code is very useful.

Optimization should be used with care and tuned appropriately for the debug phase and for shipping code.

Colin Walls has over thirty years experience in the electronics industry, largelydedicated to embedded software. A frequent presenter at conferences andseminars and author of numerous technical articles and two books onembedded software, Colin is an embedded software technologist withMentor Embedded (the Mentor Graphics Embedded Software Division), and isbased in the UK. Visit his blog or write to him at .

6 thoughts on “Code and compilers: small, fast or what?

  1. Debugging optimized code is always a challenge, but it is near to impossible for those who are not good at assembly.
    Do we have any tools which Regenerates CC++ code from optimized assembly/machine ? It will be very helpful for those who are not well verse

    Log in to Reply
  2. Debugging optimized code, especially on ARM7 THUMB architectures can be a pain (2 hardware debug registers? really?) but I found that if I unoptimize the section the section that I am interested in and leave the rest optimized that I could get the problem

    Log in to Reply
  3. I've often run into apparently incorrect execution that would correct itself when adding messages to be sent to a uart for debug purposes. The solution to timing bugs is to properly structure the code using RTOS blocking semaphores or other methods so suc

    Log in to Reply
  4. No, you cannot turn assembly into anything close to sensible C or C++.

    There are many techniques you can use to make it easier to debug code, even when using powerful optimisation. A common one is to add some gratuitous volatile variables into the code t

    Log in to Reply
  5. There are a couple of points that make the picture even more complex than suggested here. On some processors, unrolling loops like in the article will make them go faster. On other processors, especially more advanced processors, the loop will be faster

    Log in to Reply
  6. You don't need an RTOS or semaphores – you need to understand what your code is doing, and what the hardware is doing, and structure things appropriately. The most common cause for people thinking that debug calls or disabled optimisation “fixes” their co

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.