# Efficient CRC calculation with minimal memory footprint

Almost every form of digital information exchange can introduce communication errors. Sometimes, these errors can be ignored (for example, an erroneous pixel in a high-resolution video is merely unnoticeable). However, most of the time, they cannot be ignored, and we want to ensure that the receiver gets an absolutely correct stream of information.

In order to overcome the inherent inaccuracy of information transmission, a few methods for error detection and correction have been developed. Generally speaking, these methods introduce some redundancy to the actual message, which in turn can be used to detect errors and in some cases to recover the original data.

One of the most common methods is the use of the *CRC (cyclic redundancy check) function,* a family of codes commonly used to ensure data integrity in digital data streams by detecting errors due to noise or dropouts of bits in the message stream. The CRC calculates a series of bits (also commonly referred to as a CRC) that is appended to the data stream and transmitted along with the message. The receiver performs a CRC function on the message and compares the result with the received CRC code to verify the integrity of the message. CRC is commonly used in a number of applications such as digital communications and computer data storage systems.

The CRC is performed through binary polynomial division between the transmitted message and a polynomial divisor and is usually implemented using a *linear feedback shift register (LFSR)* . An LFSR is a shift register where its next state is a linear combination of its previous state and input bits. In our context, the linear operators are Logical XOR and Logical AND. Since the operation of an LFSR circuit is deterministic, and the CRC is shorter than the message, typically few messages are mapped to each CRC value. A well-chosen polynomial ensures an evenly distributed mapping of messages to CRC values (for example, if all messages were mapped to the same CRC, detection of an error in a message bit would be impossible).

The trick in an embedded systems design is to implement this technique in the most efficient way possible and in the smallest possible footprint. In this article, we discuss aspects of the theory and implementation of LFSRs and CRCs, illustrated using a family of instructions on Analog Devices' Blackfin processor specifically defined for addressing the task of efficient LFSR implementation. We also provide a method for converting an implementation from an Internal LFSR form to an External LFSR form.

**Down to basics**

The CRC was developed to answer the requirement for an error detection mechanism. The code consists of a bit field (of some set length) that is transmitted with the message (usually appended to the message) over the communication medium. On the receiver side, a second CRC is calculated from the received message and then compared with the received CRC. If both fields are not the same, an error in the transmission has occurred (it is not known, though, if the message is bad, the CRC is bad, or both are bad).

In order to calculate the CRC code for a message, we look at the problem as an exercise in polynomial algebra over Galois Field GF(2). In short, *GF(2)* is a field consisting of the elements 0 and 1, with the **+** and ***** operators defined modulo-2; in other words, **+** is the Logical XOR and ***** is the Logical AND.

**Method**

For message string **M** of **k** bits in length and CRC field of length **n** bits, we define the following polynomials:

**M** _{k-1 } **(x)** is a polynomial of degree **k-1** , of the form:

**M** _{k-1 } **(x)** = *m* _{k } _{-1} *x* ^{k } ^{-1} + *m* _{k } _{-2} *x* ^{k } ^{-2} +…+ *m* _{0} *x* ^{0}

**G** _{n } **(x)** is the CRC generator polynomial of degree **n** :

*G* _{n } (*x* ) = *x* ^{n } + *g* _{n } _{-1} *x* ^{n } ^{-1} +…+ *g* _{0} *x* ^{0}

**C** _{n-1 } **(x)** is the calculated CRC polynomial of degree **n-1** :

*C* _{n } _{-1} (*x* ) = *c* _{n } _{-1} *x* ^{n } ^{-1} + *c* _{n } _{-2} *x* ^{n } ^{-2} +…+ *c* _{0} *x* ^{0}

The CRC is determined in this way:

*C* _{n } _{-1} (*x* ) = {*M* _{k } _{-1} (*x* ) â€¢ *x* ^{n } } *mod* *G* _{n } (*x* )

The *mod* division is performed modulo-2 over GF(2). The **k** bits of the message string are appended by **n** bits of zeros (this is the *x* ^{n } term in the equation). This kind of division is commonly implemented using LFSR circuits. There happen to be two canonical forms of LFSR implementation, which are interchangeable in that they will calculate the same result, given the same message and divisor polynomial.

One form, known as an *External, Type I* or *Fibonacci LFSR* has XOR gates in the feedback path, and the feedback terms are sampled from the relevant taps, added (modulo 2), and then fed back to the least significant bit (lsb).

Another form, known as an *Internal, Type II* or *Galois LFSR* takes the highest order bit (msb) as the feedback term that is fed back into the relevant taps through the XOR gates placed between the taps. This form is the more popular, as it seems to be faster when implemented with hardware logic gates. In fact, one can see the similarity between the algebraic process and this LFSR division. However, many implementers are unaware of the equivalence of the two forms.

An Internal CRC LFSR is implemented as shown in Figure 1. The equivalent External LFSR has the form shown in Figure 2.

As mentioned previously, the two circuits are equivalent, and the simple rule for conversion from an Internal form to an External form is:

1. While keeping the LFSR flow direction the same, reverse the order of the feedback taps;

2. After the first **k** clocks, feed the first tap (*S* _{0} ) with **n** zeros and read the **n** output bits (which are the required CRC) as the sum of the feedback and the input.

In both cases, **M** is fed in starting with the highest order bit *m* _{k } _{-1} . For example, consider two equivalent implementations of the generator polynomial *G* _{5} (*x* )=*x* ^{5} +*x* ^{4} +*x* ^{2} +1, shown in Figures 3 and 4.

Given the same bit sequence at the input, the resultant CRC from the two circuits will be the same.

**Initialization of the LFSR**

In practice, there are a few variations on the subject of CRC definition and implementation. These variations regard bit ordering, bit inversion and LFSR initialization. Several standards state that the LFSR taps are initialized to zero. For some good reasons, other standards define the taps to be initialized to ones. In general, any initialization string may be valid. Given an initialization string for an internal LFSR, we need a method for conversion of that string to the equivalent external LFSR. The following method is applied:

Consider an internal LFSR initialized with an initialization string **I** _{n } . The resulting CRC after **k+n** clocks will be just the same as if we have a LFSR initialized with zeros, and the message prepended by **I** _{n } (advancing **n+k+n** clocks in total). This is true because the first **n** bits just propagate through the LFSR without being fed back. So, after **n** clocks we are at the point where the LFSR contains **I** _{n } and the message is about to be fed in.

Since we know the two LFSR types are equivalent when initialized with zeros, we can repeat the same practice with the external LFSR. This time, after **n** clocks the taps are populated with some string that is determined by the generator polynomial–but is a well defined string for a given **I** _{n } . So, we can precalculate this equivalent initialization string and use it to initialize our LFSR.

For example consider the generator polynomial from the previous section *G* _{5} (*x* )=*x* ^{5} +*x* ^{4} +*x* ^{2} +1. If we need to initialize the internal LFSR state with **[11111]** , what will be the initialization string of the external form? Initializing it to **[00000]** and feeding with **[11111]** , the following LFSR states are calculated with each clock:

If we use this string to initialize our external LFSR, we can calculate a CRC of the message as if it was done with an internal LFSR initialized with **[11111]** .

**Implementing CRC**

When implementing CRC in software, a common way of accelerating the calculations is through the use of lookup tables. This requires allocation of memory space that grows exponentially with increasing performance (generally, we need 2^{p} rows of **n** bits each for **p** look ahead acceleration).

However, if you are building a system based on the Blackfin processor, you can take advantage of some instructions included in the basic Instruction Set Architecture (ISA) specifically defined for efficient LFSR implementation. These instructions–BXOR and BXORSHIFT–offer excellent performance, without a requirement for lookup tables. This fact makes it ideal for use when available memory is in shortage. Using these instructions, one can calculate an External CRC at a rate of two cycles per input bit. The example program in Listing 1 shows how to use these instructions.

This program consists of three functional parts. The first part is the initialization code. Here, the LFSR state (*A0* ) and the mask (*A1* ), which contains the coefficients of the generator polynomial, are initialized (Make sure to initialize the high byte of *A1* as well). According to the definitions of an External LFSR as in Figure 2, *A1* 's lsb is populated with the generator polynomial's high order coefficient. In the above example program, the 32-bit message fits into one register and is fed into the LFSR starting with the msb.

The second part of the program is a loop of **k** iterations. In each iteration, one bit from the message is extracted and put in CC. Then, it is fed in and the LFSR state is updated according to the BXORSHIFT functional description (see the Blackfin instruction set documentation). That is, *A0* is masked by *A1* , then the result is XOR-reduced and added to the input bit *CC* . The resulting bit is pushed into the lsb of *A0* while the whole register is shifted to the left. In case the message is longer than 32-bits, a loop wrapping this part may be required, where a new data word is read into *R2* from the message memory buffer.

In the third part of the program, we use the second variant of BXORSHIFT, this time without feedback. However, because in this variant the shift operation takes place prior to the XOR-reduction, we start by calculating the first output bit with a BXOR instruction operating on the last state. Then we continue with loop of **n-1** iterations. In each iteration, *A0* is updated and a new output is put in *CC* . Then, the CRC register (*R0* ) is shifted left with the CC bit put into its lsb. At the end of this part, *R0* contains the **n** bits of the CRC string.

**Yaniv Sapir** is a project leader at the Digital Media Technology Center in Analog Devices, developing embedded media algorithms and software for ADI's Blackfin processor. Yaniv holds a B.Sc. in aerospace engineering from the Technion, Israel Institute of Technology. He may be reached at .

**Yosef Stein** serves as DSP principal system architect/advanced technologies manager in the Digital Media Technology Center working on the development of broadband communication and image compression enhanced instructions for Analog Devices' fixed point DSPs. Yosef holds a BSc in electrical engineering, Technion. He may be reached at .