Measuring instead of speculating
Some programmers think modeling memory-mapped devices as C++ objects is too costly. With some architectures, the chances are they're wrong.
Last spring, I began a series of columns on representing and manipulating memory-mapped devices in C and C++. In May, I considered the alternatives in C and advocated that structures are the best way to represent device registers.1 In June, I explained why C++ classes are even better than C structures.2
In Standard C and C++, you can't declare a memory-mapped object at a specified absolute address, but you can initialize a pointer with the value of that address. Then you can dereference that pointer to access the object. That's what I did in those articles last spring.
The June column prompted several readers to post comments alleging that using a pointer to access a C++ class object representing a memory-mapped device incurs a performance penalty by somehow adding unnecessary indirect addressing. Interestingly, no one complained that using a pointer to access a C structure incurs a similar performance penalty. This left me wondering if the allegation is that using a C++ class is more expensive than using a C structure, or if it's that using pointers to access memory-mapped class objects is more costly than using some other means.
My impression is that the authors of the comments were more concerned about the latter--the alleged cost of using pointers. However, I suspect many of you are also interested in knowing whether using C++ classes is more expensive than using comparable C structures. I know I am. Therefore, I decided to evaluate alternative memory-mapped device implementations in both C and C++.
In my August column, I described the common standard and non-standard mechanisms for placing objects into memory-mapped locations.3 In September, I presented alternative data representations for memory-mapped device registers that eliminate the need to use pointers to access memory-mapped devices.4 In November, I delineated some of those alternatives more explicitly.5
All of this brings us back to the question I set out to answer: Does eliminating pointer references from function calls and function bodies actually improve the run-time performance of C or C++ code that manipulates memory-mapped devices? To answer this, I ran some timing tests using a few different C and C++ compilers. This month, I'll describe how I wrote the tests and what conclusions I think we can draw from the results. Some of the results surprised me. I suspect they'll surprise many of you, too.
Test design considerations
Different processors support different combinations of addressing modes. Some are better at, say, absolute addressing than they are at base+offset addressing, and others are just the opposite. For a given processor, some compilers may be better than others at leveraging the addressing modes on the target processor. Thus, the results you get from measurements made with one compiler targeting one processor may not be the same as what you get with a different compiler or different target processor. No surprise there.
I have access to only a modest assortment of compilers and processors. Any conclusions that we can draw from running tests with the tools I have might be broadly applicable, but I have no illusions about discovering universal truths. Running tests on only a small set of compilers or processors can still yield useful information--just not as much as most of us would like. Therefore, I'll explain how and why I designed the test programs as I did so that you can write similar (or perhaps better) tests for other compilers and processors, make your own measurements, and share your observations with the rest of us.
For this first round of measurements, I decided to use the one evaluation board I have that I can program with multiple compilers. The board has a 50-MHz ARM processor with 512 Kbytes of memory and a small assortment of memory-mapped devices. I used three different compilers, each from a different vendor and of different vintage. Each compiler supported both C and C++. I compiled for the ARM (rather than THUMB) instruction set with little-endian byte ordering. I set each compiler to optimize for maximum speed. I didn't turn the instruction cache on.
All of the tests are variations on the same theme: the main function in each program repeatedly calls a function that accesses a memory-mapped device, and counts the number of calls it makes in a given span of time. Each program differs in (1) how it represents the registers of the memory-mapped device, (2) how it accesses those registers, and (3) whether the access functions are inline or not.
The purpose of these test programs is to provide information to help evaluate programming techniques. They're not for compiler benchmarking. Therefore, I won't identify the compiler vendors. Rather, I'll refer to each compiler by the year in which it was released: 2000, 2004, and 2010.