Measuring instead of speculating - Embedded.com

Measuring instead of speculating

Some programmers think modeling memory-mapped devices as C++ objects is too costly. With some architectures, the chances are they're wrong.

Last spring, I began a series of columns on representing and manipulating memory-mapped devices in C and C++. In May, I considered the alternatives in C and advocated that structures are the best way to represent device registers.1 In June, I explained why C++ classes are even better than C structures.2

In Standard C and C++, you can't declare a memory-mapped object at a specified absolute address, but you can initialize a pointer with the value of that address. Then you can dereference that pointer to access the object. That's what I did in those articles last spring.

The June column prompted several readers to post comments alleging that using a pointer to access a C++ class object representing a memory-mapped device incurs a performance penalty by somehow adding unnecessary indirect addressing. Interestingly, no one complained that using a pointer to access a C structure incurs a similar performance penalty. This left me wondering if the allegation is that using a C++ class is more expensive than using a C structure, or if it's that using pointers to access memory-mapped class objects is more costly than using some other means.

My impression is that the authors of the comments were more concerned about the latter–the alleged cost of using pointers. However, I suspect many of you are also interested in knowing whether using C++ classes is more expensive than using comparable C structures. I know I am. Therefore, I decided to evaluate alternative memory-mapped device implementations in both C and C++.

In my August column, I described the common standard and non-standard mechanisms for placing objects into memory-mapped locations.3 In September, I presented alternative data representations for memory-mapped device registers that eliminate the need to use pointers to access memory-mapped devices.4 In November, I delineated some of those alternatives more explicitly.5

All of this brings us back to the question I set out to answer: Does eliminating pointer references from function calls and function bodies actually improve the run-time performance of C or C++ code that manipulates memory-mapped devices? To answer this, I ran some timing tests using a few different C and C++ compilers. This month, I'll describe how I wrote the tests and what conclusions I think we can draw from the results. Some of the results surprised me. I suspect they'll surprise many of you, too.

Test design considerations

Different processors support different combinations of addressing modes. Some are better at, say, absolute addressing than they are at base+offset addressing, and others are just the opposite. For a given processor, some compilers may be better than others at leveraging the addressing modes on the target processor. Thus, the results you get from measurements made with one compiler targeting one processor may not be the same as what you get with a different compiler or different target processor. No surprise there.

I have access to only a modest assortment of compilers and processors. Any conclusions that we can draw from running tests with the tools I have might be broadly applicable, but I have no illusions about discovering universal truths. Running tests on only a small set of compilers or processors can still yield useful information–just not as much as most of us would like. Therefore, I'll explain how and why I designed the test programs as I did so that you can write similar (or perhaps better) tests for other compilers and processors, make your own measurements, and share your observations with the rest of us.

For this first round of measurements, I decided to use the one evaluation board I have that I can program with multiple compilers. The board has a 50-MHz ARM processor with 512 Kbytes of memory and a small assortment of memory-mapped devices. I used three different compilers, each from a different vendor and of different vintage. Each compiler supported both C and C++. I compiled for the ARM (rather than THUMB) instruction set with little-endian byte ordering. I set each compiler to optimize for maximum speed. I didn't turn the instruction cache on.

All of the tests are variations on the same theme: the main function in each program repeatedly calls a function that accesses a memory-mapped device, and counts the number of calls it makes in a given span of time. Each program differs in (1) how it represents the registers of the memory-mapped device, (2) how it accesses those registers, and (3) whether the access functions are inline or not.

The purpose of these test programs is to provide information to help evaluate programming techniques. They're not for compiler benchmarking. Therefore, I won't identify the compiler vendors. Rather, I'll refer to each compiler by the year in which it was released: 2000, 2004, and 2010.

Implementation choices
Each program tests either a polystate implementation, a bundled monostate implementation, or an unbundled monostate implementation.

As I explained last month, a polystate implementation of a memory-mapped device uses a C structure or C++ class accessed via a pointer or reference. Polystate implementations support multiple instances of the same kind of device. The structures and classes that I presented in May and June, and that provoked the comments that triggered this investigation, are polystate implementations.

A monostate implementation eliminates the need for pointers or references to access the device. A monostate implementation for a device permits only one instance of that device. As I explained last month, a monostate implementation can be bundled or unbundled . A bundled monostate wraps its data members in an additional structure; an unbundled monostate does not.

For each design, I wrote a C implementation and a C++ implementation, and for each of those implementations, I wrote a version that used inline functions and another that did not. Thus, for example, I wrote a polystate implementation in C with inline functions, a polystate implementation in C with non-inline functions, a polystate implementation in C++ with inline functions, a polystate implementation in C++ with non-inline functions, and so on for the bundled and unbundled monostate implementations.

Not all the C compilers I used support inline functions, so I implemented the inline functions in C using function-like macros.

Placement choices
As I noted earlier, in Standard C and C++, you can't declare a memory-mapped object at a specified absolute address, but you can initialize a pointer with the value of that address. In the following discussion of the test cases, I describe this technique as using pointer-placement .

For example, to access a timer_registers object residing at location 0xFFFF6000 , you can declare a pointer called the_timer as a macro:

#define the_timer 
((timer_registers *)0xFFFF6000)

or as a constant pointer:

timer_registers *const the_timer
= (timer_registers *)0xFFFF6000;

My tests that use pointer-placement use the latter form. I randomly replaced the pointer constants with macros in a few test cases and saw no difference in the generated code.

In C++, you can also write pointer casts using the reinterpret_cast operator, as in:

timer_registers *const the_timer
= reinterpret_cast
(0xFFFF6000);

I use this form in my C++ tests.

In C++, but not C, you can use a reference instead of a pointer, as in:

timer_registers &the_timer
= *reinterpret_cast
(0xFFFF6000);

I call this technique reference-placement .

As some readers suggested, you can declare a memory-mapped object using a standard extern declaration such as:

extern timer_registers the_timer;   

and then use a linker command to force the_timer into the desired address. I call this technique linker-placement .

Some C and C++ compilers provide language extensions that let you position an object at a specified memory address. For example, to declare a timer_registers object residing at location 0xFFFF6000 , you might write:

timer_registers the_timer @ 0xFFFF6000;   

with one compiler, or:

timer_registers the_timer _at(0xFFFF6000);   

with another, or:

timer_registers the_timer
__attribute__((at(0xFFFF6000)));

with yet another. I describe test cases that use declarations such as these as using at-placement .

Of the compilers at my disposal, all three support pointer-placement–as they should because it's standard–and all three C++ compilers support reference-placement. Only one compiler supports at-placement.

I suspect all the compilers support linker-placement, but to be honest, I could figure out how to do it with only one. However, I realized I could simulate the behavior of linker-placement easily, and increase the portability of the tests at the same time. Here's how.A software “device”
In order to time the execution of function calls that manipulate memory-mapped devices, I needed just two devices: a timer and something else. My hardware platform has a small assortment of devices such as lights, switches and serial ports, and I could have picked any one of them. But then again, I wanted these tests to be fairly easy to migrate, and I didn't think I could rely on any one of these devices being available on other platforms.

Rather than use a real hardware device, I invented a software “device” that manipulates memory, and I wrote the tests to address the “device” as if it really were a memory-mapped device. The “device” is a Fibonacci sequence generator; each “get” operation applied to the device returns the next number in the Fibonacci sequence. A C++ implementation appears in Listing 1 . The corresponding C implementation appears in Listing 2 .


Click on image to enlarge.


Click on image to enlarge.

The address range of the memory on my target evaluation board is from 0 to 0x7FFFF. I determined that my test programs weren't using any memory at the higher addresses, so I placed the fibonacci_registers object at 0x7FF00. For example, C test cases that use pointer-placement declare a pointer to the fibonacci_registers using a declaration such as:

fibonacci_registers *const fibonacci
= (fibonacci_registers *)(0x7FF00);

Programs that use at-placement declare the device as:

fibonacci_registers fibonacci @ 0xF7700;   

Using the Fibonacci “device” also gave me a way to mimic linker-placement when I couldn't figure out the linker commands to really do it. With linker-placement, the test program declares the “device” using the extern declaration:

extern fibonacci_registers fibonacci;   

When I couldn't use genuine linker-placement, I simply wrote the definition for the object in another source file, as in:

/* fibonacci.c */
#include "fibonacci.h"
fibonacci_registers fibonacci;

The linker places the compiled definition for the fibonacci object among the other global objects in the program, not at 0x7FF00. I call this technique default-placement . Using default-placement let me measure the performance of linker-placement without actually using linker-placement. On the one compiler that I actually tested linker-placement, default-placement did indeed produce the same run times as linker-placement.

Computing run times
All of the test programs have a main function that repeatedly calls the Fibonacci device's get function, and counts the number of calls it makes in 15 seconds. The C version of main for a polystate implementation appears in Listing 3 . The C++ version (not shown) is nearly identical; it uses the C++ member function notation for calls to the timer and Fibonacci functions.


Click on image to enlarge.

The test program generates no output. I used a debugger to examine the value of iterations when the program terminated. The final value of iterations indicates the relative speed of the call to the Fibonacci get function: faster implementations of the get function yielded more iterations, while slower implementations yielded fewer iterations.

I thought the results would be easier to read if I converted the number of iterations into the actual execution time of the function call. The time for each iteration is simply the elapsed time for all the iterations (15 seconds) divided by the number of iterations. But that time, which I'll call Te , includes the time for loop overhead (the increments, compares and branches) as well as the function call. I wanted the time for just the function call.

I simply commented out the function call and ran the test again. (I did check that commenting out the function call removed only the instructions for the call.) I used that iteration count to compute To , the execution time for one iteration without the call. The execution time for each call, Tf , is then simply Te – To .

The envelope, please
Tables 1 through 3 show the results I obtained running my tests with each of the three compilers. The results in each table are sorted from fastest to slowest, with bands of shading to highlight tests with the same run times.


Click on image to enlarge.

The tables have unequal length because some compilers support more testable features than others. For example, Compiler 2004 was the only one that I could figure out how to test linker-placement. Only Compiler 2010 has a syntax for at-placement.

The tables have a column labeled “scope.” For monostate implementations, the scope is always global. For polystate implementations, the scope indicates whether the placement declaration is at the global scope (outside main ) or local scope (within main ). For example, the main function for a C polystate implementation using local pointer-placement would look in part like:

int main()
{
fibonacci_registers *const fibonacci
= (fibonacci_registers *)0x7FF00;
~~~
}

So what have we learned today?
It appears that using inline functions does more than anything to improve the performance of memory-mapped accesses. With each compiler, every inline implementation outperformed every non-inline implementation. In fact, with the exception of one slightly unusual case in Table 2 (the C++ polystate implementation using reference-placement at global scope), inlining erased every other factor from consideration.

Among the non-inline implementations, three things strike me as significant. One is that, in general, the best non-inline polystate implementations outperform the best non-inline monostate implementations. This is the case for all three compilers. I believe this is because the ARM architecture loves base+offset addressing, which the polystate implementations leverage effectively.

Secondly, the non-inline unbundled monostate implementations with linker-placement (or default-placement) have the worst performance of all, and the C implementation is worse than the C++ implementation. Again, this is true for every compiler. By shifting knowledge of the data layout from the compiler to the linker, this approach robs the compiler of useful information it might use to improve code quality.

The third and possibly the most striking observation is that, except for the non-inline unbundled monostate in C++, every non-inline C++ implementation outperformed every non-inline C implementation. Once again, this was true for every compiler, regardless of its age.

I gotta say, this last one surprised me. I expected the C code to be clearly better in the oldest compiler and for the gap between C and C++ to disappear with the newer ones. I didn't expect the non-inline C++ to be better than the non-inline C across the board.

As I cautioned earlier, we shouldn't put too much stock in measurements made using just one architecture, albeit a popular one. No doubt there are things we could do to improve these measurements. But I think I've done my part to lift the discussion above speculation and anecdotes. To those who maintain that C++ polystate implementations are too costly to use: the ball's now in your court.

Dan Saks is president of Saks & Associates, a C/C++ training and consulting company. E-mail him at .

Acknowledgements:
Thanks to Greg Davis, Bill Gatliff, and Nigel Jones for their valuable feedback.

Endnotes:
1. Saks, Dan. “Alternative models for memory-mapped devices,” Embedded Systems Design, May 2010, p. 9.  www.embedded.com/columns/224700534.
2. Saks, Dan. “Memory-mapped devices as C++ classes,” Embedded.com, June 2010. www.eetimes.com/discussion/other/4200572/Memory-mapped-devices-as-C–classes.
3. Saks, Dan. “Compared to what?” Embedded.com, August 2010. www.eetimes.com/discussion/other/4205983/Compared-to-what.
4. Saks, Dan. “Accessing memory-mapped classes directly.” Embedded Systems Design, September 2010, p. 9. www.eetimes.com/discussion/programming-pointers/4208573/Accessing-memory-mapped-classes.
5. Saks, Dan. “Bundled vs. unbundled monostate classes.” Embedded.com, November 12, 2010. www.eetimes.com/discussion/other/4210702/Bundled-unbundled-monostate-classes.

1 thought on “Measuring instead of speculating

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.