Energy efficient C code for ARM devices

Chris Shore, ARM

November 3, 2010

Chris Shore, ARM

The C compiler is not omniscient
Simply put, the C compiler cannot read your mind! Much as you try, it cannot divine your attentions simply by looking at the code you write. Further, it is restricted to examining the single compilation unit with which it is presented. In order to guarantee correct program execution, it is programmed for maximum paranoia in all circumstances.

So it must make “worst-case” assumptions about everything. The most obvious and best known example of this is “pointer aliasing.” This means that the compiler must make the assumption that any write through any pointer can change the value of any item in memory whose address is known to the program. This frequently has severe repercussions for compiler optimization.

Other examples would be that the compiler must assume that all global data is volatile across external function boundaries, loop counters may take any value, loop tests may fail first time round. There are many other examples.

The good news is that it is, in most cases, very easy for the programmer to provide extra information which helps the compiler out. In other cases, you can rewrite your code better to express your intentions and better to convey the particular conditions which prevail. For instance, if you know that a particular loop will always execute at least once, a do-while loop is a much better choice than a for(;;) loop.

The for loop in C always applies the termination test before the first iteration of the loop. The compiler is therefore forced to either place duplicate tests at top and bottom, place a branch at the top to a single test at the bottom or a branch at the bottom to a single test at the top which is executed first time round.

All three approaches involve extra code, extra branches or both. Yes, you can argue that modern sophisticated branch prediction hardware reduces the penalty of this but it is, in my view, still lazy code and you should be doing better!

As shown below  the ARM compiler also provide several keywords which are available for giving the compiler “meta­information” about your code at various points.

__pure: Indicates that a function has no side-effects and accesses no global data. In other words, its result depends only on its parameters and calling it twice with the same  parameters will always return the same result. This makes the function a candidate for common subexpression elimination.

__restrict: When applied to a pointer declaration, indicates that no write through this pointer changes the value of an item referenced by any other pointer. This is of particular benefit in loop optimization where it increases the freedom of the compiler to apply transformations such as rotation, unrolling and inversion etc.

__promise: Indicates that, at a particular point in a program, a particular expression holds true.

Consider the following example:

 


The  __promise intrinsic here informs the compiler that the loop counter, at this particular point, will be greater than zero and divisible by eight. This allows the compiler to treat the for loop as a do-while loop, omitting the test at the top, and also to unroll the loop by any factor up to eight without having to worry about boundary conditions.

This kind of optimization is particularly important when combined with the vectorization capability of the NEON engine.

The C compiler is not omnipotent
As well as not knowing everything, the compiler cannot do everything either!

There are many instructions, particularly in more recent versions of the architecture, which cannot be generated by the C compiler at all. This is usually because they have no equivalent in the semantics of the C language.

Using SIMD instructions to operate on complex numbers stored as packed signed halfwords is a good example. The C compiler has no complex type, certainly not one which fits this storage method and so it cannot use these very efficient instructions for carrying out very straightforward operations.

The proficient programmer could write hand-coded assembly functions to implement these operations. However, it is often much easier to use the rich set of intrinsic functions provided by the ARM C compiler.

The following example shows a short function implementing a complex multiplication using the SMUSD and SMUADX instructions provided in architecture ARMv6.

 

This produces the following output.

 

If the compiler is able to inline the function, then there is no function call overhead either. The obvious advantages of this approach over coding in assembly are increased portability and readability.

NEON support in the compiler
The C compiler can be also used to access the functionality of the NEON Media Processing Engine via a comprehensive set of intrinsic functions and built-in data types.

Here is a straightforward implementation of an array multiplication function. The C code is shown on the left and the resulting assembly code on the right. Only the body of the loop is shown and this executes a total of eight times.

 

The next pair of sequences show the same loop implemented using NEON instrinsics. The key points to note are that the loop has been unrolled by a factor of four to reflect the fact that the NEON load, store and multiplication instructions are dealing with four 32-bit words each time. This greatly lowers the instruction count. Due to the reduced number of iterations, the loop overhead is also reduced.

 

If you look a little more closely, you can see that the compiler output does not exactly correspond to the C source code. The order of the instructions has been changed. The compiler has done this to minimize interlocks and thus maximize throughput. This is another advantage of using intrinsics over hand coding in assembly -the compiler is able to optimize the machine instructions in the context of the surrounding code and in response to the target architecture.

< Previous
Page 3 of 5
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER