Energy efficient C code for ARM devices

Chris Shore, ARM

November 3, 2010

Chris Shore, ARM

Instructions count, too
In terms of energy and time cost, instruction execution come after memory access. Put simply, the fewer instructions you execute, the less energy you consume and the shorter time you take to do it. We are assuming here that instructions are, in the main, executed from cache. For most applications, with sensible cache configuration and reasonable cache size, this will generally be the case.

So, message number three is “optimize for speed”. The efficiency comes in two ways: firstly, the fewer instructions you execute, the less energy that takes; secondly, the faster you accomplish your task, the sooner you can sleep the processor and save power. We will have more to say about dynamic and static power management later, when we get to operating systems but, for now it suffices to say that fast code is, in general, power efficient code.

Further, notice that executing instructions is cheap compared to accessing memory. This leads us to conclude that algorithms which favor computation (data processing) over communication (data movement) will tend to be more power-efficient. Another way of looking at this is to say the programs which are CPU-bound tend to be more efficient than those which are memory-bound.

There is one complication which we would do well to mention at this point. Code which is compiled for size (i.e. smallest code size) may, in some circumstances be more efficient than code which is compiled for speed.

This will occur if, by virtue of being smaller, it uses the cache better. I think this may be regarded as an exception to the general rule discussed above but it may apply in some circumstances. If it does, due to the greatly increased efficiency of cache accesses, its effect could be huge.

Good coding practice
A good rule of thumb here is “Make things Match.” So, match data types to the underlying architecture, code to the available instruction set, memory use to the configuration of the platform, coding conventions to the available tools.

Similarly “Write sensible code” and make sure you know at least something about how the tools work. If you know even a little about how procedure calls and global data are implemented on the ARM architecture, there is a lot you can do to make your code more efficient.

To some extent, be driven by the way the tools work, rather than trying to drive the tools to work the way you do. Very often, the tools work they way they do for very valid architectural reasons!

Data type and size
ARM cores are 32-bit machines. That is to say that they have 32-bit registers, a 32-bit ALU, and 32-bit internal data paths. Unsurprisingly, they are good at doing things with 32-bit quantities. Some cores do implement wider buses than this between, say, the core and the cache but this does not change the fact that the core is fundamentally a 32-bit machine.

For instance,  as shown below, simple ALU operations on 16-bit data (signed or unsigned short) will involve extra instructions to either truncate or sign-extend the result in the 32-bit register.

 

The core can, sometimes, hide these extra operations. For instance, truncating a 32-bit value to 16-bits can be accomplished at the same time as storing it to memory if a halfword store instruction is used. This does indeed accomplish two operations for the price of one but does depend on storing the item being what you actually want to do with it.

Later versions of the architecture provide sub-word SIMD instructions which can get around some of these inefficiencies but in most cases these instructions are hard, if not impossible, to use from C.

Remember, too, that local variables, regardless of size, always take up an entire 32-bit register when held in the register bank and an entire 32-bit word in memory when spilled on to the stack.

There is very little advantage in using sub-word types on an ARM core (as compared with many other architectures in which it can save both time and storage). That’s not to say that it doesn’t work or isn’t sometimes the right thing to do – just that there is little or no efficiency advantage in doing so.

Data alignment
In keeping with this, the ARM core has historically imposed very strict alignment requirements on both code and data placement. It is still true today that ARM cores cannot execute unaligned instructions. However, more recent cores, from architecture ARMv6 onwards, have relaxed somewhat the requirements for data accesses.

These cores support access to unaligned words and halfwords. It is tempting, therefore, to abandon all consideration of alignment in your code, enable unaligned support in the hardware and declare everything as unaligned.

This would be a mistake! For instance, loading an unaligned word may indeed take one instruction where, on earlier cores, it would have taken three or four but there is still a performance penalty due to the need for the hardware to carry out multiple bus transactions.

Those transactions are hidden from the program and from the programmer but, at hardware level, they are still there and they still take time.

Structures and arrays
If storage is not at too much of a premium, there are good reasons for choosing carefully the size of array elements. Firstly, making the length of each element a power of 2 simplifies the offset calculation when accessing an individual element. This is a result of the fact that the ARM instruction set allows shift operations to be combined with ALU operations and to be built into addressing modes.

For an array of elements of size 12, based at address r3, accessing the element at index r1 would take a sequence like this: 

ADD r1, r1, r1, LSL #1 ; r1 = 3 * r1
LDR r0, [r3, r1, LSL #2] ; r0 = *(r1 + 4 * r1)

By comparison, if the element size is 16, the address calculation is much simpler:

LDR r0, [r3, r1, LSL #4] ; r0 = *(r3 + 16 * r1)

Although we will look at cache-friendly data access in more detail later, it is worth noting at this point that elements which fit neatly into cache lines make more efficient use of the automatic prefetching behavior of cache linefills.

Efficient parameter passing
Each subword-sized parameter to a function will be passed using either a single 32-bit register or a 32-bit word on the stack. This information is buried inside a specification called the Procedure Call Standard for the ARM Architecture (AAPCS).

This, in turn, is part of a larger document called the ARM Application Binary Interface (ARM ABI). The latest version of both can be found on ARM’s website and, while the ABI document is aimed largely at tools developers, the AAPCS information is directly relevant to applications programmers.

According to the AAPCS, we have four registers available for passing parameters. So, up to four parameters can be passed very efficiently simply by loading them into registers prior to the function call. Similarly, a single word-sized (or smaller) return value can be passed back in a register and a doubleword value in two registers.

It is plain to see that trying to pass more than four parameters involves placing the remainder on the stack with the attendant cost in memory accesses, extra instructions and, therefore, energy and time. The simple rule “keep parameters to four or fewer” is well worth keeping in mind when coding procedures.

Further, there are “alignment” restrictions on the use of registers when passing doublewords as parameters. In essence, a doubleword is passed in two register and those registers must be an even-odd pair i.e. r0 and r1, or r2 and r3.

The following function call will pass ‘a’ in r0, ‘b’ in r2:r3 and ‘c’ on the stack. R1 is unable to be used due to the alignment restrictions which apply to passing the doubleword.

fx(int a, double b, int c)

Re-declaring the function as shown below is functionally identical but passes all the parameters in registers.

fx(int a, int c, double b)

< Previous
Page 2 of 5
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER