CMP EMBEDDED.COM

Login | Register     Welcome Guest ESC Boston  esc india  Call for Abstracts
 

'Optimal' C Constructs for 8051 Microprocessors



Embedded.com
The limitations of 8-bit microprocessors sometimes make conventional C constructs produce suboptimal code. The author looks at common cases on an 8051 and suggests some workarounds.

Note: All tables for this article are located at ftp://ftp.embedded.com/pub/11jones/tables.

All listings for this article are located at ftp://ftp.embedded.com/pub/11jones/listings

The 8-bit marketplace is massive. Despite the inroads made by newer architectures, the ancient 8051 continues to hold significant market share. But old 8-bit architectures, such as the 8051 and HC11, were not designed to be programmed in C. Despite this, compiler vendors work hard to make good compilers for these microprocessors. But even with all the ingenuity in the world, compilers eventually bump up against the target's fundamental limitations, producing some surprising results. In this article, we'll look at these results and offer some solutions.

Those of you familiar with the 8051 know that it's not exactly a great architecture from a programmer's perspective. Accordingly, a lot of 8051-based projects end up hungry for every cycle and byte that can be saved. Since I enjoy working on 8-bit designs, I've spent many years wringing the last drop of performance out of such systems-and learning a lot in the process.

If you have a poor software design, inefficient algorithms, or haven't taken the basic steps advocated by the compiler supplier, nothing I've written here will make a difference. However, if you take all the steps advocated by the compiler writer and have optimal algorithms and other necessary conditions, and you still need more performance (but don't want to resort to assembly language) finding the optimal coding constructs can help you get the job done.

On 32-bit systems with a plethora of registers and addressing modes, functionally equivalent coding constructs will probably produce identical performance. On 8-bit systems, however, a subtle change in coding style can affect performance significantly. Consider the following blocks of code, which are functionally equivalent.


// 8 bytes and 42 cycles
for(i = 0; i < 10; i++)
{
_nop_(); 
}

// 15 bytes and 120 cycles i = 0; while (i++ < 10) { _nop_(); }

As the comments show, the difference between the two fragments is significant. Given the large difference, you'd think the compiler vendors would publish a nice list of optimal techniques to which one could refer. Well, I have poured over dozens of compiler manuals and have never come across such a list. I can only assume that either (a) the vendors aren't aware of these differences (which seems hard to believe), or (b) they consider it a sign of weakness in their product, and don't want to publicize the information.

Preliminaries

Working on the adage that "What they won't tell me is probably really worth knowing," I have created just such a list of optimal constructs. This list was hard won and I intend to share it with you. However, a few caveats are in order.

First, this list is only definitive for an 8051 and version 6.02 of the Keil compiler using optimization level 8, favoring execution speed over size. I also used the no integer promote (NOIP) switch. Does this mean that if you are using something else that this article is a waste of your time? I hope not! Many of the results come about because of the 8051 architecture. Thus, if you are using another 8051 compiler, I suspect that the general results will hold true. If you are using another 8-bit machine, such as the HC11 or the PIC then at the very least, the article should give you some food for thought. In either case, the source code for the tests is available at ftp://ftp.embedded.com/pub/2002/11jones. I encourage you to try it out on your system.

Second, I can't hope to cover every possible permutation of each of the scenarios. If you think I've missed an important variant, please add it to the test code and let me know your results.

"Optimal" can mean many things. I think there is a tremendous case for claiming that optimal code is code that is easy to maintain. In this article, however, I use optimal to refer to code that is optimally fast, small, and consistent in its execution time; I err on the side of speed. Generally, however, I've found that the fastest code is also normally the tightest.

On many occasions I have seen compilers do weird things with subtle changes to code. As a result, the information I'll present here should be taken as a "rule of thumb." If you apply the techniques, most of the time you'll get more optimal code. However, if you are trying to wring the last ounce of performance out of your system, you should really check the results for your specific case.

Finally, some of the tests reveal surprising weaknesses in the Keil compiler. It is not my intention to criticize their product. Indeed, I have used their compiler for over ten years, which says a lot about what I think of it. I also suspect that similar compilers would also show their idiosyncrasies when subjected to similar detailed inspection. I'd like to go on the record as saying I think Keil's products are excellent-and put a lot of the bigger names in the industry to shame.

Coding conventions
The test code that produced the results mentioned in this article is available at ftp://ftp.embedded.com/pub/2002/11jones. The code consists of a Vision2 project file, together with a C source file. If you don't have Vision2 then it should be easy enough to port the code to your own development environment.

A central problem in trying to demonstrate the various constructs is to keep the code very simple-but not so simple that the optimizer realizes the code is doing nothing useful, and hence optimizes it away to nothing. To this end, I make extensive use of the intrinsic "function call" _nop_().

In comparing various techniques, and to ensure that every technique gets a fair hearing, I make extensive use of conditional compilation. To try the various techniques, it is of course necessary to set the value of the various manifest constants in order to compile the code one way or another.

In most cases, I report the number of execution cycles, the code size, and the data size of the various fragments. The data size is instructive. However, given that this memory gets heavily overlaid on an 8051, don't get too worked up about differences in this parameter. When relevant, I also report the size of the library code.

In some of the projects, I run a lot of variants in an attempt to deduce patterns and optimal constructs. I usually only report on my conclusions. I encourage you to study the results of the various test cases and see if you agree with my conclusions.

As a final note, the execution times are normally given in cycles assuming a standard 8051 (12 clocks per instruction cycle). When I am looking at the benefits of dual data pointers, I use a processor with a different number of clocks per cycle. In either case, what is really important is the ratio (or difference) of the execution times.

Constructs to be examined
We'll be looking at these areas:

  • looping
  • function calling
  • decision making
  • calling one of N functions
  • pointer manipulation
  • copying
  • bit banging

The section on copying builds on the earlier topics, while the bit-banging section is an overall wrap up. That is, we'll apply some of the techniques to address the problem of addressing a serial device by twiddling port lines.

Looping

All nontrivial programs contain loops. Of these, the most common is the loop that has to execute a fixed number of times. C offers numerous ways to achieve this. To see if there is any significant difference between the various constructs, I put together the looping project. The looping project contains eight different constructs. After testing the eight variants, I achieved a variation in performance that is quite spectacular (2.5:1 in size and 3.5:1 in speed). Furthermore, the best performing construct is not exactly intuitive. First, however, let's look at the worst performing construct:


// Code: 15. Cycles: 124. 
void fne(void)
{
unsigned char i;
i = 0;
while (i++ < 10)
{
_nop_();
}
}

This is a fairly innocuous construct, but for various reasons it causes the compiler to generate a lousy solution. In contrast, three different constructs produced the optimal solution. The following is the most "standard" looking:


// Code: 6. Cycles: 35
void fnc(void)
{
unsigned char i;
for(i = 10; i; i--)
{
_nop_();
}
}

Note that this solution (along with the other two optimal solutions) involves counting down. In contrast, let's look at the results for a standard count-up for-loop:


// Code: 8. Cycles: 46
void fna(void)
{
unsigned char i;
for (i = 0; i < 10; i++)
{
_nop_();
}
}

This requires two more code bytes and an extra cycle per loop iteration. This is an interesting result, one in which the standard construct doesn't produce the optimal code. This result comes about because the 8051 has a Decrement and Jump Not Zero (DJNZ) instruction that is designed for efficient looping. However, it requires the loop variable to count down-hence the for-loop that counts down gives the best performance.

Note, however, that the for-loop that counts down is usually problematic if the loop variable is used within the body of the loop. For example, if we wanted to set the elements of an array to some value, we'd have a problem because the values of the looping variable would be 10 to 1 instead of 9 to 0. If you need to use the looping variable within the body of the loop (which is the more common case), then the standard count-up for-loop construct gives the best results.

It's also worth noting that if you were to use a 16-bit variable as the loop index, the optimal construct would probably change. Since I almost never have to use a 16-bit variable as a loop index, I've decided to leave this case as an exercise for the reader. If anyone does try it, I'd be interested in the results.

Function calling

Calling a function incurs overhead. On modern processors, this typically consists of pushing the required parameters onto a stack and then executing a call. The called function pops the required parameters off of the stack, performs the required work, and then returns. For many of the modern processors, the penalty is not so much the pushing and popping of parameters, as it is the fact that a function call often results in a cache miss and the attendant system slow down.

In the 8051 world, however, parameters are usually passed via registers. (It's horribly inefficient to use the compiler's reentrant model with parameters pushed onto an external stack. Consequently, it should never be used out of choice, except in a case of true need.) However, the 8051 only has a few registers to use to pass parameters. Keil's rules for parameter passing appear in Table 1.

There are several things to note:

  1. At most, only three parameters will be passed in registers.
  2. The order in which arguments are declared affects how many bytes are passed in registers.
  3. At most, only one long or float and only one generic pointer can be passed in registers.
  4. There is no provision for passing other data types such as a structure or a union in registers.

Let's look at the implications of each of these limitations.

Three parameter limit

Passing lots of parameters to a function on an 8-bit machine is not a great idea. Here are some things you can try to get around this limitation:

  • Group data into structures and pass a pointer to the structure.
  • If you need to pass four chars, you can group them as a union of four chars and one long and then pass the long. (Not pretty, but it works.)

Parameter ordering
This is a limitation that often surprises people. Consider a function foo() that requires a char, an int, and a generic pointer to be passed to it.

void foo(char a, int b, int *c);

This will result in all six bytes (one for a, two for b, and three for c) being passed in registers. However, rearranging the parameter order to:

void bar(int *c, int b, char a);

will result in only five bytes being passed in registers, with parameter a being passed in a fixed memory location. This results in a significant performance hit.

The code to demonstrate this is in Listing 1, which displays Test Case 1 of the function calling project. The first method requires three fewer cycles, three fewer code bytes, and six fewer data bytes than the second method. If this function is called often, the overall savings can be significant.

The bottom line is this: if you have a function that is called repeatedly, it is worthwhile to check the parameter order to make sure as many parameters as possible are passed in registers.

Passing pointers
Table 1 shows that only one generic pointer can be passed in registers, whereas up to three memory-space-specific pointers may be passed in registers. This is an unsubtle hint not to use generic pointers. To give you an indication of how severe the penalty is, consider Listing 2, which shows the difference between passing two generic pointers and passing two typed pointers.

The casts in foo() are intended to eliminate the time difference associated with the expression evaluation (the Keil compiler is smart enough to eliminate the overhead associated with the casts). Calling foo() required considerably more resources than calling bar(). This can all be attributed to the fact that bar() required more bytes (six versus four) to be passed, and that three of the bytes were passed in fixed memory locations. The obvious conclusion is to look very hard at any function that requires the passing of more than one generic pointer.

Passing other data types
The register argument rules are quite rigid. If you have the temerity to try and pass a bit or a structure as an argument (even if the structure consumes just two bytes), that parameter and all parameters to the right of it on the argument list are not passed in registers. Thus:

void foo(char a, int b, int *c, bit d);

will result in the first three parameters being passed in registers. However,

void bar(bit d, char a, int b, int *c);

will result in none of the parameters being passed in registers. The code to demonstrate this is Test Case 3 of the function calling project. The performance difference is significant and is shown in the Table 2.

Similarly, the following code


typedef struct 
{
char e;
char f;
} STYPE;

void foo(STYPE a, int b);

also results in none of the parameters being passed in registers, since STYPE defines an unsupported type. The code to demonstrate this is in Listing 3.

In this case, foo() uses a massive amount of resources compared to bar(). Now admittedly, this has a lot to do with the compiler calling a function to copy the structure-rather than being smart enough to realize the structure is so small that it could easily fit into the available registers. However, I think it demonstrates that passing structures to a function is a really bad idea.

On a related note, I have also discovered that in certain circumstances, even when all the parameters may be passed in registers, changing the ordering of parameters sometimes makes a difference. Unfortunately, I am unable to come up with a rule of thumb to guide you on this one. All I can do is suggest you experiment on critical functions.

Return types
No discussion of function calling would be complete without mention of the impact of the data type that is returned. The Keil documentation shows that all basic types (char, int, float, *) will be returned in registers. A bit will be returned in the Carry flag. However, they make no mention of what happens if you try and return a structure. We'll examine the case where you wish to get a copy of the data structure. Listing 4 examines this topic.

Thus, the simpler construct produced marginally better code. This is the first case of many where simplicity produces the better results.

As an aside, this example again demonstrates the inefficiency of the compiler's data structure copy routine. For a trivial case like this, it would of course be better to return a pointer and perform an explicit element-by-element copy. I just find it surprising that the compiler doesn't contain some form of heuristic algorithm for determining the optimal solution.

Function calling summary
Take a hard look at all of your function prototypes. The following changes will usually result in worthwhile performance increases:

  1. Convert generic pointer arguments to typed pointer arguments.
  2. Make sure that any bit or structure arguments are passed last.
  3. Notwithstanding #2, don't pass structures!
  4. Minimize the number of arguments-but not at the expense of violating rule 3.
  5. Look at the ordering of your parameters to see if a reordering will result in all (or more) of the parameters being passed in registers.
  6. Experiment! If you have a function that is called regularly, experiment with the order of its parameters. You may be surprised at the gains you realize.
  7. If you need to assign a structure from a function call, then return the structure instead of a pointer to the structure.

Decision making

All programs are littered with if-else and switch-case constructs. On the face of it, there isn't much one can do to change the system performance. However, certain questions that should be answered include:

  • Is the ternary operator more efficient than an if-else?
  • What's the most efficient way to test something for being zero?
  • How does an index into an array of pointers to functions compare in performance to a switch statement?

The ternary operator
The ternary operator is terse, slightly cryptic and a favorite of mine. Examination of K&R shows that the authors use the ternary operator very liberally indeed. But does the ternary operator actually accomplish anything? Well, the answer with the Keil compiler is an emphatic no, since it produces larger code and longer execution time in every test case I tried. Listing 5 shows Test Case 1 of the decision making project.

The code in main is designed to test the functions "both ways." Other tests, involving bits, floats, and so on, all produced similar results. This is the second example where a simpler coding style produces better code.

Evaluating a variable for TRUE / FALSE
This issue seems to occur frequently. In a nutshell, we are interested in determining if a variable is zero (FALSE) or nonzero (TRUE). This seems so trivial that you wouldn't imagine that performance would vary widely depending upon the constructs used. Think again!

Test Code 2 from the decision-making project is shown in Listing 6. In this case, I have three functions whose sole task is to test a byte variable and return 0 if it is zero, and 1 if it is not. The result is represented as a bit. The first function solves the problem by using the most straightforward representation. For simplicity and clarity, it is hard to beat. It also has the possible benefit of having equal execution time regardless of the value of the parameter. The second example is more obfuscated while giving no real benefit (a byte saved versus increased and variable execution times). The third function is either incredibly simple or ridiculously obtuse, depending on your perspective. It does, however, give dramatically better performance and comes about from an exceedingly neat trick by the compiler.

This is an interesting example, since I think it demonstrates two points:

As we have seen in other examples, the compiler seems to generate better code when simpler constructs are used. Hence example 1 is better than example 2.

Example 3 gives the best result because it makes use of a type cast. Type casts occur so frequently that I suspect the compiler writers have put a lot of effort into generating optimal coding constructs for them.

Summary
From the preceding examples, we can come up with a few rules of thumb for decision-making:

  1. Avoid the use of the ternary operator.
  2. When testing for TRUE / FALSE, omit all explicit tests and rely on the compiler to cast the variable to a Boolean.
1 | 2

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Ready to take that job and shove it?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS


 :