Slow and Steady Never Lost the Race
Internet Appliance Design


You can't go anywhere these days without hearing or reading about the “emerging postPC era.” Although the treatment in the popular press often annoys me (to read it, you'd think embedded computing was a concept born yesterday), I do tend to agree that we are in a period of significant changes in how computers are integrated into our society. However, I can't decide which of the changes described by the popular press will be the most important in the long run. In fact, many of the changes anticipated there have already been going on for quite some time. Specifically, the miniaturization of computers and their transformation from generalpurpose devices into devices that serve specific needs have both been going on for decades. I think the single fact that underlies all of this talk of a postPC era is that the demand for PCs is flattening. The PC as a product class doesn't seem able to make further inroads into consumer homes based solely on lower prices and increased computing power. That old successful formula â€” and the high profit margins that went with it â€” are seemingly dead. Most of today's computer buyers already have PCs at home and are reasonably satisfied with the amount of bang they are getting for their buck. And, more and more, all they really want is the cheapest PC in the store anyway. However, one thing that computer buyers still aren't satisfied with is the experience of owning a PC. There are hardware upgrades, operating system upgrades, new device drivers, new application software, and new application software versions, all of which must be installed by the untrained owner. Some of these installations and upgrades have ripple effects, requiring other installations and/or upgrades to be done at the same time. And far too often an installation or upgrade will fail as the new hardware of software is apparently “rejected” by the other hardware and software already in the system. Perhaps one of the reasons that our society is looking toward embedded systems designers for leadership going into the postPC era is that we've got a reputation for building computer systems that are stable and easily maintained. It may even be that this reputation is well deserved. I have yet, fortunately, to own a TV, VCR, stereo, digital watch, or microwave oven that required a software or hardware upgrade. However, let's not fool ourselves into thinking we're somehow better than the engineers and computer scientists working in the PC industry. We too are capable of creating systems that crash and that are difficult to maintain. And the more complicated the assignment, the more likely that is. In my opinion, three key factors led to the complexity of maintaining a PC. The first is the requirement of backward compatibility. Each new PC operating system or processor is expected to run applications designed years before for a very different generation of computers. The second factor is the overall complexity of the problem. The designers of PCs, their operating systems, and applications have tried to capture in one system all of the possible ways a computer can be used. (The task of creating a video game console is much simpler than that of creating a device that can run both business applications and play video games.) Last, but certainly not least, the third factor is the “Christmas buying season.” The annual rush to release new hardware and software in time for the Christmas buying season has inspired many engineering compromises. More recently, the phrase “Internet time” has made this into a yearround problem for all of us. If we are indeed going to take a leadership role in the postPC era, we need to be humble and to think and act carefully when designing and implementing complex systems. If we do those things, I think we can be counted on to produce a better class of computers that, though less general purpose in nature, are ultimately far more useful and likeable than the personal computers of today. Always remember that you are no smarter than your PCindustry counterpart, only more cautious. And that caution has served you and your past products well.In this month's Internet Appliance Design section, you'll find a pair articles that fit right into this theme. The first article is the second part of Thomas Herbert's introduction to the TCP/IP protocol suite. This month Tom takes a look at the steps embedded developers can take to make a given TCP/IP stack fit into resourceconstrained systems and to improve its overall performance and reliability. The second article, by John Meadows, is an introduction to a different kind of protocol. The JetSend protocol was designed specifically to make the sharing of information between all sorts of different products simple. No special device drivers or formatspecific processing software must be present on the receiving system. The result of this could be applications that require no upgrades. My own contribution to this month's section is the final installment of my threepart discussion of checksums. Easier said than done As I discussed last month, CRCs are among the strongest checksum algorithms available for detecting and/or correcting errors in communications packets. However, the mathematics used to compute CRCs doesn't map easily into software. In the best possible scenario, CRC computations can be done in hardware with the results passed up to the software for placement into an outgoing packet or verification of an incoming packet's contents. Unfortunately, this is not always possible. This month I'm going to complete my discussion of checksums by showing you how to implement CRCs in software. I'll start with a naÃ”ve implementation and gradually improve the efficiency of the code as I go along. However, I'm going to keep the discussion at the level of the C language, so further steps could be taken to improve the efficiency of the final code simply by moving into the assembly language of your particular processor. For most software engineers, the overwhelmingly confusing thing about CRCs is their implementation. Knowing that all CRC algorithms are simply long division algorithms in disguise doesn't help. Modulo2 binary division doesn't map particularly well to the instruction sets of offtheshelf processors. For one thing, generally no registers are available to hold the very long bit sequence that is the numerator. For another, modulo2 binary division is not the same as ordinary division. So even if your processor has a division instruction, you won't be able to use it. Modulo2 binary division Before writing even one line of code, let's first examine the mechanics of modulo2 binary division. We'll use the example in Figure 1 to guide us. The number to be divided is the message augmented with zeros at the end. The number of zero bits added to the message is the same as the width of the checksum (what I've been calling c); in this case four bits were added. The divisor is a c + 1bit number also known as the generator polynomial. Figure 1: An example of modulo2 binary division The modulo2 division process is defined as follows:
The final value of the remainder is the CRC of the given message. What's most important to notice at this point is that we never use any of the information in the quotient, either during or after computing the CRC. So we won't actually need to track the quotient in our software implementation. Also note here that the result of each XOR with the generator polynomial is a remainder that has zero in its most significant bit. So we never lose any information when the next message bit is shifted into the remainder. All of the parts of the above algorithm that have no effect are written in italics. These steps can be ignored in an actual CRC implementation. Bit by bit Listing 1 contains a naÃ¯ve software implementation of the CRC computation just described. It simply attempts to implement that algorithm as it was described above for this one particular generator polynomial. Even though the unnecessary steps have been eliminated, it's extremely inefficient. Multiple C statements (at least the decrement and compare, binary AND, test for zero, and left shift operations) must be executed for each bit in the message. Given that this particular message is only eight bits long, that might not seem too costly. But what if the message contains several hundred bytes, as is typically the case in a realworld application? You don't want to execute dozens of processor opcodes for each byte of input data. Listing 1: A first CRC implementation
Cleaning up Before we start making this more efficient, the first thing to do is to clean this naÃ¯ve routine up a bit. In particular, let's start making some assumptions about the applications in which it will most likely be used. First, let's assume that our CRCs are always going to be eight, 16, or 32bit numbers. In other words, that the remainder can be manipulated easily in software. That means that the generator polynomials will be nine, 17, or 33 bits wide, respectively. At first it seems we may be stuck with unnatural sizes and will need special register combinations, but remember these two facts:
Since we already have the information in the uppermost bit and we don't need it for the XOR, the polynomial can also be stored in an eight, 16, or 32bit register. We can simply discard the most significant bit. The register size that we use will always be equal to the width of the CRC we're calculating.As long as we're cleaning up the code, we should also recognize that most CRCs are computed over fairly long messages. The entire message can usually be treated as an array of data bytes. The CRC algorithm should then be iterated over all of the data bytes, as well as the bits within those bytes. The result of making these two changes is the code shown in Listing 2. This implementation of the CRC calculation is still just as slow as the previous one. However, it is far more portable and can be used to compute a number of different CRCs of various widths. Listing 2: A more portable CRC implementation
Byte by byte The most common way to improve the efficiency of the CRC calculation is to throw memory at the problem. For a given input remainder and generator polynomial, the output remainder will always be the same. If you don't believe me, just reread that sentence as “for a given dividend and divisor, the remainder will always be the same.” It's true. So it's possible to precompute the output remainder for each of the possible bytewide input remainders and store the results in a lookup table. That lookup table can then be used to speed up the CRC calculations for a given message. The speedup is realized because the message can now be processed byte by byte, rather than bit by bit.The code to precompute the output remainders for each possible input byte is shown in Listing 3. The computed remainder for each possible bytewide dividend is stored in the array crcTable[] . In practice, the crcInit() function could either be called during the target's initialization sequence (thus placing crcTable in RAM) or it could be run ahead of time on your development workstation with the results stored in the target device's ROM. Listing 3: Computing the CRC lookup table
Of course, whether it is stored in RAM or ROM, a lookup table by itself is not that useful. You'll also need a function to compute the CRC of a given message that is somehow able to make use of the values stored in that table. Without going into all of the mathematical details of why this works, suffice it to say that the previously complicated modulo2 division can now be implemented as a series of lookups and XORs. (In modulo2 arithmetic, XOR is both addition and subtraction.) A function that uses the lookup table contents to compute a CRC more efficiently is shown in Listing 4. The amount of processing to be done for each byte is substantially reduced. Listing 4: A more efficient CRC implementation
As you can see from the code in Listing 4, a number of fundamental operations (left and right shifts, XORs, lookups, and so on) still must be performed for each byte even with this lookup table approach. So to see exactly what has been saved (if anything) I compiled both crcSlow() and crcFast() with IAR's C compiler for the PIC family of eightbit RISC processors.^{1} I figured that compiling for such a lowend processor would give us a good worstcase comparison for the numbers of instructions to do these different types of CRC computations. The results of this experiment were as follows:
So, at least on one processor family, switching to the lookup table approach results in a more than fivefold performance improvement. That's a pretty substantial gain considering that both implementations were written in C. A bit more could probably be done to improve the execution speed of this algorithm if an engineer with a good understanding of the target processor were assigned to handcode or tune the assembly code. My somewhateducated guess is that another twofold performance improvement might be possible. Actually achieving that is, as they say in textbooks, left as an exercise for the curious reader. CRC standards and parameters Now that we've got our basic CRC implementation nailed down, I want to talk about the various types of CRCs that you can compute with it. As I mentioned last month, several mathematically well understood and internationally standardized CRC generator polynomials exist and you should probably choose one of those, rather than risk inventing something weaker. In addition to the generator polynomial, each of the accepted CRC standards also includes certain other parameters that describe how it should be computed. Table 1 contains the parameters for three of the most popular CRC standards. Two of these parameters are the initial remainder and the final XOR value. The purpose of these two c bit constants is similar to the final bit inversion step we added to the sumofbytes checksum algorithm two months ago.^{2} Each of these parameters helps eliminate one very special, though perhaps not uncommon, class of ordinarily undetectable difference. In effect, they bulletproof an already strong checksum algorithm. To see what I mean, consider a message that begins with some number of zero bits. The remainder will never contain anything other than zero until the first one in the message is shifted into it. That's a dangerous situation, since packets beginning with one or more zeros may be completely legitimate and a dropped or added zero would not be noticed by the CRC. (In some applications, even a packet of all zeros may be legitimate!) The simple way to eliminate this weakness is to start with a nonzero remainder. The parameter called initial remainder tells you what value to use for a particular CRC standard. And only one small change is required to the crcSlow( ) and crcFast() functions: crc remainder = INITIAL_REMAINDER; The final XOR value exists for a similar reason. To implement this capability, simply change the value that's returned by crcSlow() and crcFast() as follows:
If the final XOR value consists of all ones (as it does in the CRC32 standard), this extra step will have the same effect as complementing the final remainder. However, implementing it this way allows any possible value to be used in your specific application. In addition to these two simple parameters, two others exist that impact the actual computation. These are the binary values reflect data and reflect remainder. The basic idea is to reverse the bit ordering of each byte within the message and/or the final remainder. The reason this is sometimes done is that a good number of the hardware CRC implementations operate on the “reflected” bit ordering of bytes that is common with some UARTs. Two slight modifications of the code are required to prepare for these capabilities.What I've generally done is to implement one function and two macros. This code is shown in Listing 5. The function is responsible for reflecting a given bit pattern. The macros simply call that function in a certain way. Listing 5: Reflection macros and function
By inserting the macro calls at the two points that reflection may need to be done, it is easier to turn reflection on and off. To turn either kind of reflection off, simply redefine the appropriate macro as (X). That way, the unreflected data byte or remainder will be used in the computation, with no overhead cost. Also note that for efficiency reasons, it may be desirable to compute the reflection of all of the 256 possible data bytes in advance and store them in a table, then redefine the REFLECT_DATA() macro to use that lookup table. Tested, fullfeatured implementations of both crcSlow() and crcFast() are available for download from ftp://ftp.embedded.com/pub/2000/Barr/. These implementations include the reflection capabilities I just described and can be used to implement any parameterized CRC formula. Simply change the constants and macros as necessary.
The final parameter that I've included in Table 1 is a check value for each CRC standard. This is the CRC result that's expected for the simple ASCII test message “123456789.” To test your implementation of a particular standard, simply invoke your CRC computation on that message and check the result:crcInit() ;
If checksum has the correct value after this call, then you know your implementation is correct. This is a handy way to ensure compatibility between two communicating devices with different CRC implementations or implementors. 