FOR
THE LOVE OF THE GAME
Michael Barr
For better or worse, the game of embedded software development is changing. Nowhere else are the changes more evident than in the area of internet appliance design. In fact, you might say that in some ways, the emergence of internet appliances is to blame. The incredible momentum of the Internet has been a factor
behind all sorts of societal changes. And its impact is also being felt in the field of embedded programming.
Until recently, the vast majority of embedded systems were developed by small teams of engineers. In fact, many of the engineering teams Ive been a part of have had just two membersone hardware engineer and one software engineer. The beauty of working on such a small team is that the design and behavior of all of the software and hardware in the system is easily understood. This makes
it possible to rigorously test the system for flaws and to certify that it will run properly and predictably in the field.
Unfortunately, as we are increasingly driven to engineer more complex products and to bring them to market in far less time, we are forced to work in larger teams and to incorporate software modules developed by others. It is simply not possible to implement a complex software component like a TCP/IP stack on your own and still add value to a product that must ship yesterday to find
a market. This is having a tangible impact on the experience of embedded software development.
Personally, I think such changes are for the worse. What I love most about embedded software developmentindeed, what brought me into this field in the first placeis the simplicity and cleanliness of the job. I like working close to the hardware and knowing why each and every line of software is in the system. In short, I like building useful products from the ground up and with my bare hands. I
dont like being dependent on software modules developed by others to behave predictably and reliably. And I dont like working on engineering teams that are so large that human communication breakdowns may lead to product failures in the field. Unfortunately, I dont think we really have much choice.
These changes are coming to our industry whether we like them or not. The increasingly common requirement to connect our systems to other embedded and non-embedded computers is one of the factors driving all of this. Its up to each of us to find ways to maintain the reliability for which our embedded systems have developed a reputation, even in the face of dramatic changes in how they are built.
In this months Internet Appliance Design section youll find two articles about technologies youre far more likely to buy than develop on your own. Thomas Herberts introduction to TCP/IP is the first of a two-part article that will help you understand what youre buying. The other article, by Steve Furr, is about adding real-time capabilities to the Java platform. Hell help us understand what changes are necessary so that we can all better understand a pair of recent proposals to make Java more compatible with the deterministic demands of embedded systems. My own column, however, will continue to focus on things you can do yourselfbecause thats the part of this game that I love the best.
Strength in
numbers
Checksum algorithms based solely on addition are easy to implement and can be executed quickly. However, many common types of transmission errors cannot be detected when such simple checksums are used. This month, Ill introduce a stronger type of checksum, commonly known as a cyclic redundancy code (CRC), that is based on division instead. The error detection capabilities of a CRC make it a much stronger checksum and, therefore, often worth the price of additional computational
complexity.
Last month, I introduced you to some of the basic terminology and algorithms associated with checksums. My reason for doing that lay primarily in the discussion to come. The Internet checksum algorithm is certainly relevant to this section of our magazine, but you probably didnt need my help to understand the what, why, and how of that particular algorithm. On the other hand, most engineers could probably use some help understanding the what, why, and how of CRCs. Though used frequently in all types of computer systems, few engineers seem to really understand CRCs or their various implementations.
For my own part, I once understood only enough about CRCs to implement a particular calculation required for a project I was working on. A few years later, when I was writing my book, I made it a goal to understand the theory well enough to include source code that could be easily adapted to most of the common CRC calculations. I didnt want the readers of my book to have to go through the
steep learning curve thats usually required. However, in hindsight, I dont think I spent enough time in that book explaining the what and why behind that particular implementation. This column and my next are an attempt to do better with the what and why of CRCs and to provide a few more options for the how.
Error correction
Other than a brief mention near the beginning, I gave short shrift to error correction codes in last months column. Instead I talked mostly about error detection codes. In truth, theres not really much difference between an error detection code and an error correction code. In both cases, you take the message you want to send, compute some mathematical function over its bits (usually called a checksum), and append the resulting bits to the message during transmission.
The difference between error detection and error correction lies primarily in what happens next. If the receiving system detects an
error in the packetfor example, the received checksum bits do not accurately describe the received message bitsit may either discard the packet and request a retransmission (error detection) or attempt to repair it on its own (error correction). If packet repairs are to be attempted, the checksum is said to be an error correcting code.
The key to repairing packets is a stronger checksum algorithm. Specifically, whats needed is a checksum algorithm that distributes the set of valid bit
sequences randomly and evenly across the set of possible bit sequences. For example, if the minimum number of bits that must change to turn any one valid packet into some other valid packet is seven, then any packet with three or fewer bit inversions can be corrected automatically by the receiver. (If four bit errors occur during transmission, the packet will seem closer to some other valid packet with only three errors in it.) In this case, error correction can only be done for up to three bit errors, while
error detection can be done for up to six.
This spreading of the valid packets across the space of possible packets can be measured by the Hamming distance, which is the number of bit positions in which any two equal-length packets differ. In other words, its the number of bit errors that must occur if one of those packets is to be incorrectly received as the other. A simple example is the case of the two binary strings 1001001 and 1011010, which are separated by a Hamming distance of three. (To see which bits must be changed, simply XOR the two strings together and note the bit positions that are set. In our example, the result is 0010011.)
The beauty of all this is that the mere presence of an error detection or correction code within a packet means that not all of the possible packets are valid.
 Figure 1: A packet of Information
Figure 1 shows what a packet looks like after a checksum has been appended to it. Since the checksum bits contain redundant information (they are completely a function of the message bits that precede them), not all of the 2(m+c) possible packets are valid packets. In fact, the stronger the checksum algorithm used, the greater the number of invalid packets will be.
By adjusting the ratio of m and c and carefully selecting the checksum algorithm, we can increase the number of bits that must be in error for any one valid packet to be inadvertently changed into another valid packet during
transmission or storage and, hence, the likelihood of successful transmission. In essence, what we want to do is to maximize the minimum Hamming distance across the entire set of valid packets. In other words, to distribute the set of 2m valid packets as evenly as possible across the set of possible bit sequences of length m + c. This has the useful real-world effect of increasing the percentage of detectable and/or correctable errors.
Binary long division
It turns out that once you start to focus on maximizing the minimum Hamming distance across the entire set of valid packets, it becomes obvious that simple checksum algorithms based on binary addition dont have the necessary properties. A change in one of the message bits does not affect enough of the checksum bits during addition. Fortunately, you dont have to develop a better checksum algorithm on your own. Researchers figured out long ago that modulo-2 binary division is the simplest mathematical operation that provides the necessary properties for a strong checksum.
All of the CRC formulas you will encounter are simply checksum algorithms based on modulo-2 binary division. Though some differences exist in the specifics across different CRC formulas, the basic mathematical process is always the same:
The message bits are appended with c zero bits; this augmented message is the dividend
A predetermined c + 1-bit binary sequence, called the generator polynomial, is the divisor
The checksum is the c-bit remainder that results from the division operation
In other words, you divide the augmented message by the generator polynomial, discard the quotient, and use the remainder as your checksum. It turns out that the mathematically appealing aspect of division is that remainders fluctuate rapidly as small numbers of bits within the message are changed. Sums, products, and quotients do not share this property. To see what I mean, look at the example of modulo-2 division in Figure 2. In this example, the message contains eight bits while the checksum is to have four bits. As the division is performed, the remainder takes the values 0111, 1111, 0101, 1011, 1101, 0001, 0010, and, finally, 0100. The final remainder becomes the checksum for the given message.
 Figure 2: An example of Modulo-2 binary division
For most people, the overwhelmingly confusing thing about CRCs is the implementation. Knowing that all CRC algorithms are simply long division algorithms in disguise doesnt help. Modulo-2 binary division doesnt map well to the instruction sets of general-purpose processors. So, whereas the implementation of a checksum algorithm based on addition is straightforward, the implementation of a binary division algorithm with an m + c-bit numerator and a c + 1-bit denominator is
nowhere close.1 For one thing, there arent generally any m+ c or c + 1-bit registers in which to store the operands. Well see how to deal with this problem and many others next month, when I talk about various software implementations of the CRC algorithms. For now, lets just focus on their strengths and weaknesses as potential checksums.
Generator polynomial?
Why is the predetermined c + 1-bit divisor thats used to calculate a CRC called a generator polynomial? In my opinion, far too many explanations of CRCs actually try to answer that question. This leads their authors and readers down a long path that involves tons of detail about polynomial arithmetic and the mathematical basis for the usefulness of CRCs. This academic stuff is not important for understanding CRCs sufficiently to implement and/or use them and serves only to create potential confusion. So Im not going to answer that question here.2 Suffice it to say here only that the divisor is sometimes called a generator polynomial and that you should never make up the divisors value on your own. Several mathematically well-understood generator polynomials have been adopted as parts of various international communications standards; you should always use one of those. If you have a background in polynomial arithmetic then you know that certain generator polynomials are better than others for producing strong checksums. The ones that have been adopted internationally are among the best of these.
Table 1 lists some of the most commonly used generator polynomials for 16- and 32-bit CRCs. Remember that the width of the divisor is always one bit wider than the remainder. So, for example, youd use a 17-bit generator polynomial whenever a 16-bit checksum is required.
As is the case with other types of checksums, the width of the CRC plays an important role in the error detection capabilities of the algorithm. Ignoring special types of errors that are always detected by a particular checksum algorithm, the percentage of detectable errors is limited strictly by the width of a checksum. A checksum of c bits can only take one of 2c unique values. Since the number of possible messages is significantly larger than that, the potential exists for two or more messages to have an identical checksum. If one of those messages is somehow transformed into one of the others during transmission, the checksum will appear correct and the receiver will unknowingly accept a bad message. The chance of this happening is directly related to the width of the checksum. Specifically, the chance of such an error is 1/2c. Therefore, the probability of any random error being detected is 11/2c.
To repeat, the probability of detecting any random error increases as the width of the checksum increases. Specifically, a 16-bit checksum will detect 99.9985% of all errors. This is far better than the 99.6094% detection rate of an eight-bit checksum, but not nearly as good as the 99.9999% detection rate of a 32-bit checksum. All of this applies to both CRCs and addition-based checksums. What really sets CRCs apart, however, is the number of special cases that can be detected 100% of the time. For example, I pointed out last month that two opposite bit inversions (one bit becoming 0, the other becoming 1) in the same column of an addition would cause the error to be undetected. Well, thats not the case with a CRC.
|
Table 1: International Standard CRC Polynomials
|
|
Name
|
CRC-CCITT
|
CRC-16
|
CRC-32
|
|
Checksum width
|
16 bits
|
16 bits
|
32 bits
|
|
Generator polynomial
|
10001000000100001
|
11000000000000101
|
100000100110000010001110110110111
|
By using one of the mathematically well-understood generator polynomials like those in Table 1 to calculate a checksum, its possible to state that the following types of errors will be detected without fail:
A message with any one bit in error
A message with any two bits in error (no matter how far apart, which column, and so on)
A message with any odd number of
bits in error (no matter where they are)
A message with an error burst as wide as the checksum itself
The first class of detectable error is also detected by an addition-based checksum, or even a simple parity bit. However, the middle two classes of errors represent much stronger detection capabilities than those other types of checksum. The fourth class of detectable error sounds at first to be similar to a class of errors detected by addition-based checksums, but in the case of CRCs, any
odd number of bit errors will be detected. So the set of error bursts too wide to detect is now limited to those with an even number of bit errors. All other types of errors fall into the relatively high 1-1/2
c probability of detection.
Ethernet, SLIP, and PPP
It turns out that Ethernet, like most physical layer protocols, employs a CRC rather than an addition-based checksum. Specifically, it employs the CRC-32 algorithm. The likelihood of an error in a packet sent over Ethernet being undetected is, therefore, extremely low. Many types of common transmission errors are detected 100% of the time, with the less likely ones detected 99.9999% of the time. Even if an error would somehow manage to get through at the Ethernet layer, it would probably be detected at the IP layer checksum (if the error is in the IP header) or in the TCP or UDP layer checksum above that. After all the chances of two or more different checksum algorithms not detecting the same error is extremely remote.
However, many embedded systems that use TCP/IP will not employ Ethernet. Instead, they will use either the serial line Internet protocol (SLIP) or point-to-point protocol (PPP) to send and receive IP packets directly over a serial connection of some sort. Unfortunately, SLIP does not add a checksum or a CRC to the data from the layers above. So unless a pair of modems with error correction capabilities sits in between the two communicating systems, any transmission errors must hope to be detected by the relatively weak, addition-based Internet checksum described last month. The newer, compressed SLIP (CSLIP) shares this weakness with its predecessor.
PPP, on the other hand, does include a 16-bit CRC in each of its frames, which can carry the same maximum size IP packet as an Ethernet frame. So while PPP doesnt offer the same amount of error detection capability as Ethernet, by using PPP youll at least avoid the much larger number of undetected errors that may occur with SLIP or CSLIP.
In next months final installment on checksums, Ill look at various software implementations of CRCs. Well start with an inefficient, but comprehendible, implementation and work to gradually increase its efficiency. Youll see then that the desire for an efficient implementation is the cause of much of the confusion surrounding CRCs. In the meantime, stay connected
.
References
1. Implementing modulo-2 division is much more straightforward in hardware than it is in software. You simply need to shift the message bits through a linear feedback shift register as they are received. The bits of the divisor are represented by physical connections in the feedback paths. Due to the increased simplicity and efficiency, CRCs are usually implemented in hardware whenever possible.
Back
2. If you really want to understand the underlying mathematical basis for CRCs, I recommend the following reference: Bertsekas, Dimitri and Robert Gallager. Data Networks, second ed. Inglewood Cliffs, NJ: Prentice-Hall, 1992, pp. 61-64.
Back
Michael Barr is the technical editor of Embedded Systems Programming. He holds BS and MS degrees in electrical engineering from the University of Maryland. Prior to joining the magazine, Michael spent half a decade developing embedded software and device drivers. He is also the author of the book
Programming Embedded Systems in C and C++ (OReilly & Associates).
|