Recently, the concept of small femtocell base stations has been gaining popularity in mobile applications because of their advantages over traditional macrocells in terms of coverage, compatibility, and cost.
Because of cost and performance constraints, femtocell designs must have more or less the same level of modularity and complexity as macrocells, as well as affordability, to be used by individuals rather than communities.
But to achieve signal strengths at least the equal of traditional macrocell-based systems, femtocells have to be designed with multiple channels supporting bit rates as high as 14.4 Mbps. To do this, designers face a significant challenge: encode the multichannel bit-stream on the system's digital signal processing (DSP) engine with sufficient compute headroom for the rest of the system's essential operations.
In this article, we describe how to implement a highly efficient algorithm based on turbo codes for use in a Blackfin-based 14.4 Mbps 3G femtocell design that consumes as low as 100 MIPS of the 600 available Blackfin MIPS, leaving more than enough resources for other system operations.
Turbo codes have attracted great attention in the industry and research communities since their introduction in 1993 because of their remarkable performance. The turbo codes operate near–with a signal-to-noise ratio (SNR) gap of 0.7dB or less–the ultimate limits of capacity of a communication channel set by Claude E. Shannon.
Turbo codes were first proposed in by Berrou, Glavieux, and Thitimajshima.1 Turbo codes are constructed using two concatenated constituent convolutional coders. In the turbo-coding scheme, two component codes on different interleaved versions of the same information sequence are generated. On the decoder side, two maximum a posteriori (MAP) decoders are used to decode the decisions in an iterative manner. The MAP decoding algorithm uses the received data and parity symbols (which correspond to parity bits computed from actual and interleaved versions of data bits) and other decoder soft output (extrinsic) information to produce more reliable decisions.
Many algorithms and implementation techniques are suggested in the literature for decoding of turbo codes efficiently but not many techniques have been published on how to efficiently implement a turbo encoder for high bit-rate applications. In applications such as base stations, we transmit multiple data streams at a time. Therefore, it's important to have efficient implementation of turbo encoder that can encode multiple bit-streams with low processing power.
Since femtocells are targeted for individuals instead of group of communities, the users experience more dedicated wireless infrastructure with good signal strength for all of their mobile devices. In terms of data processing, both macro and femtocells will have more or less the same modules/complexity. However, femtocells design should be affordable as these are defined for individuals instead of communities. Therefore, usage of multiple costly processing devices in the femtocell design is not an option. Although we do not explain the details of full femtocell architecture in this paper, we will discuss the turbo coding of 3G standard2 used for error correction capability of wireless network by keeping in mind the femtocell design budget requirements.
Implementing a 3G turbo encoder
The turbo encoder basically contains two constituent encoders separated by an interleaver. The schematic block diagram of 3G turbo recursive systematic code (RSC) encoder is shown in Figure 1 . Each RSC encoder contains a forward path with a transfer function (1+D +D 3 ) and a feedback path with a transfer function (1+D 2 +D 3 ). The process of interleave address generation for each input is described in a paper from the 3rd Generation Partnership Project.2 Unless we have many compute units, computing an interleave address on the fly is not an option with two MAC (multiply accumulate) units DSP such as Blackfin.
Usually we precompute the interleave addresses and store them in a memory. The stored interleave addresses can be used for both turbo encoding and decoding. If codeword size is big, then we store precomputed interleaver addresses in L3 memory else we store them in L1 memory. In case of larger codewords, we use the window method to encode or decode the bits and use direct memory access to transfer data (such as inputs and interleave addresses) as needed from L3.
From Figure 1 , for each input, we output one systematic bit X i and two parity bits Y i and Z i . Here, parity bit Z i doesn't depend directly on actual input bit bi but it depends on the bit b 'i which is from the interleaver buffer at index i . Given the input message block B of N bits, we either perform address computation to get interleaved bit b 'i from block B for each input bit index or we perform interleaving of total block B at once and store in as an interleaved block B ' and access b 'i linearly with index i . Note that the computation of interleaved bit address according to 3G standard is not a simple task, and, therefore, we prefer to precompute the addresses and store them in a memory once for all N bits before start of encoding of multiple message data blocks. In this way we can ignore the complexity of the interleaver address generation of the turbo coding.
Simple encoding. In simple encoding, we code bit-by-bit. For each input bit, we output three bits. A sample C-code that simulates one RSC encoder is shown in Listing 1 . Usually, the input data bits are accessed in terms of 8-bit bytes from memory as the minimum size of the data that the processor can access from memory is a byte (or an 8-bit quantity). Once we get a byte of data, to encode bit-by-bit, we have to unpack the bits, which takes about 1.125 cycle per bit (or totally nine cycles with first bit unpack requiring two cycles and rest of the bits requiring one cycle per bit using the Blackfin rot instruction). Then, an additional seven cycles are required to code this bit as coding involves only sequential operations. After coding, we have to pack the coded bits and store them in a memory for other processing and transmission. Packing of data takes the same number of cycles as unpack, in other words, 1.125 cycles per bit. Thus, total of 10 cycles are required for encoding one bit of data and output one parity bit (including overhead).
Computational complexity. As we are using a look-up table for interleaver addresses instead of computing on the fly, we only spend cycles for look-up table accesses (which may come for free with compute operations). To interleave one data bit, it takes about three cycles (one cycle for loading offset, two cycles for computing absolute address). Since turbo encoding involves two RSC encoders and one interleave operation, which in sum we consume 25 cycles (including overhead) for encoding one data bit. For femtocell applications, with 14.4-Mbps bit rate, we require about 360 MIPS, about 60% MIPS of Blackfin processor.
Dealing with the turbo code
As we discussed earlier, a turbo encoder is a costly module at higher bit rates if we aren't implementing it properly. Next, we'll discuss techniques for efficient implementation of turbo encoder. We split the turbo encoder into two parts. In the first part, we deal with the encoding of bits; in the second part, we handle the interleaving of the data bits.
Encoding using look-up table. The turbo encoding with two RSC encoders consume about 20 cycles per input bit, as we explained earlier. We'll now describe a different approach using a look-up table that consumes only 2.5 cycles (for both encoders) per input bit. For this, we need 256 bytes of extra memory for storing look-up table data. Given the present state of RSC encoder, it's possible to encode more than one bit at a time using look-up table. By precomputing the look-up table for all possible combinations of input bits of length L and for all three combinations of state bits, we can encode L bits at a time. In this encoding, we use a look-up table that has 2L +3 entries. As the value of L increases, the size of look-up table also increases. With L = 4 (in other words, encoding of four bits at a time), we will have 27 or 128 entries in the look-up table as shown in Figure 2 . Each entry contains four encoded bits and three bits of updated state information. In other words, a byte (or 8-bits) is sufficient to represent each entry of look-up table.
In closely exploring the details of the 8-bit look-up table design, you can see that, to compute a 7-bit offset to the 128-entry look-up table from four input bits (say in register r0) and three current state bits (say in register r1), we have to extract (1-cycle) four data bits (say to register r2) from input byte (or from r0), we have to extract (1-cycle) current state (say to register r3) from look-up table output (or from r1) of previous encoding, we have to shift (1-cycle) three state bits by four (r3 = r3<<4) and OR (1-cycle) the extracted four input data bits to state bits (in other words, r4 = r2|r3). We can avoid the extract and shift operations for state bits by properly designing the look-up table. If we use two bytes for each look-up table entry and place the state bits in the shifted position as shown in the Figure 3 , we can avoid two (saving 50%) of the offset calculation cycles.
After computing the encoded bits, we have to pack the encoded bits. As we're encoding four bits at a time and outputting an encoded 4-bit nibble at a time, packing nibbles into bytes is easy. We pack two nibbles into a byte in two cycles (with one left shift and one OR or ADD operation). For two encoder outputs packing, we spend four cycles on Blackfin. By using multiply-accumulate (MAC) unit, we can do this packing in two cycles for two encoders as we have two MAC units on the Blackfin. The Blackfin ASM code for encoding and packing of turbo encoder for single iteration of loop is shown in Listing 2 . The ASM code of Listing 2 includes encoding and packing for both RSC encoders. It's clear from the ASM code that the turbo encoding of one byte consumes 20 cycles or 2.5 cycles per bit.
In Listing 2 , with the ASM code, we're encoding four bits at a time for two encoders. But, in reality the second encoder doesn't get the data directly from input bit-stream bytes. We have to interleave the input bit-stream before passing to the second encoder. We've already discussed the interleaving for second encoder, and we stored the interleaved bits in an interleaver buffer. There we assumed that the stored interleaved bits are accessed directly from the buffer for encoding by storing the interleaved bits in an addressable boundary (in other words, minimum of byte has to be used for storing one bit). Since, we're encoding in terms of nibbles using look-up table approach, we have to pack the interleaved bits back to bytes before storing them to the interleaver buffer.
Interleaver design. Because the look-up-table-based encoding method we just described needs the interleaved bits in byte form, we have to pack the interleaved bits to bytes. Therefore, to feed the bits to the second encoder in the right order, we have to perform the following three steps: unpack, interleave, and pack. As we represent the data in terms of bytes, packing and unpacking involve de-multiplexing and multiplexing of bytes into bits and bits into bytes respectively. Packing of bits to bytes need all interleaved bits, we have to perform interleaving first completely. The two operations (unpacking and interleaving) consume about three cycles per bit as shown in Listing 3 . The ASM code shown in Listing 3 performs unpacking and interleaving of one byte of input data. An instruction rotate through carry rot r by -1 provides an unpacked data bit from end with the most significant bit (MSB).
The next step is packing of interleaved data bits into bytes. The operation of packing is exactly reverse of unpacking operation shown in Listing 3 . But, if we use rotate through carry instruction rot r by 1 for packing, we need two cycles per bit as we spend one cycle to move a data bit from a register to cc . Instead of rotate, we use compare and select vit_max instruction for packing purpose. The vit_max instruction packs two bits at a time by comparing the two least-significant-bit (LSB) words present in two registers with two MSB words of the same register. As the output flag of the comparison result of vit_max instruction is saved to accumulator, we extract the packed byte from the accumulator after four iterations. As we can not load two bits from the same buffer in one cycle to feed to vit_max instruction, we have to spend one more cycle to load the data bit. With this, we take two cycles to pack two bits or one cycle per bit as shown in Listing 4 .
Estimating computional complexity
The computational load that this implementation of the turbo-code algorithm is relatively modest, whether it's estimated in terms of the cycles per bit or the memory required to perform the tasks.
Cycle estimation. As shown in Listing 3 and Listing 4 , the cycles consumed per bit for unpacking, interleaving and packing of interleaved data are four. In encoding of data as shown in Listing 2 , we spend 2.5 cycles per bit. With this, the turbo encoder total cycle cost is 6.5 cycles per bit. Note that the overhead cycles are not included in this estimation. Assuming over head of one cycle per bit, we consume about 7.5 cycles per bit for performing turbo encoding. With this efficient implementation, we use 108 MIPS of Blackfin processor or approximately 18% of processor MIPS at bit-rate of 14.4 Mbps. Where as, we use 60% of processor MIPS with simple-encoding method. With 18% of MIPS consumed by turbo encoder, we have sufficient headroom of about 82% of processor MIPS to fit the other modules of femtocells base station.
Memory estimation. With the look-up-table method for turbo encoding, we need 256 bytes of data memory to store precomputed encoding information. With the efficient method, we need less data memory (by a factor of eight) for storing the interleaved data as we pack the bits to bytes. Both methods require data memory for storing interleaver addresses as it is costly to compute interleaver addresses on the fly.
Using the precomputed look-up table, we are able to perform turbo encoding using only 18% of processor MIPS; otherwise we consume about 60% of MIPS with simple methods. For this efficient method, the memory size of 256 bytes is used to store the look-up table and the overall memory used is less when compared to simple methods.
Hazarathaiah Malepati joined Analog Devices in 2003, where he is currently working on embedded algorithm software development for the Blackfin family of processors. From 2000 to 2003, he worked as a research engineer in HIRP (HFCL-IISc research program), Bangalore, India. He has a masters in industrial electronics from KREC, Surathkal.
Yosi Stein serves as DSP principal system architect/advanced technologies manager, working in the Digital Media Technology Center on the development of broadband communication and image compression enhanced instruction set for Analog Devices' fixed-point DSP family. Yosi holds a B.S.c in electrical engineering from Technion–Israel Institute of Technology.
1. Berrou C, A Glavieux, and P. Thitimajshima,, “Near Shannon Limit Error-Correcting Coding: Turbo Codes,” Proc IEEE Int. Conf. Commun., Geneva, Switzerland, pp.1064–1070, 1993.
2. 3GPP: 3rd Generation Partnership Project, “Technical Specification Group Radio Access Network, Multiplexing and Channel Coding,” V8.0.0, 2007.