# Use ESL synthesis techniques to replace dedicated DSPs with FPGAs

**If your application isn't the most compute intensive, you may find that using an FPGA as a replacement is a good idea.**

During the last few years, the performance and capacity of field programmable gate arrays (FPGAs) has increased dramatically. At the same time, the cost of performing the basic digital signal processing (DSP) operations, such as multiply and add, per unit of time in FPGA chips has decreased dramatically.

During the same period, synthesis tools have evolved to the point where complicated DSP modules and even complete subsystems can be described in a high-level language such as C++. These tools take the high-level description and synthesize it into an ASIC or FPGA, which is tailored to the performance requirements of the design's implementation.

The convergence of these two technologies has opened some opportunities for designers. First, FPGA implementations of DSP designs become competitive with respect to dedicated digital signal processors and custom ASICs. And second, electronic system level (ESL) technology enables the designer to take advantage of this competitiveness to produce a finished design in a short time. This increased productivity reduces the nonrecurring engineering costs, further increasing the design's competitiveness.

Any given DSP algorithm specifies a number of basic DSP operations to be performed. In a dedicated DSP integrated circuit, the computation is performed serially, therefore, the number of operations and the clock speed at which the chip runs determines the time required to perform an algorithm. In an FPGA implementation, if the algorithm allows it, however, the algorithm can be parallelized in a number of execution units, which are tuned to the needs of the algorithm being implemented. Figure 1 shows two possible implementations of a finite impulse response (FIR) filter. The first implementation is the only possible one in a dedicated DSP chip. Note that the low-performance/low-parallelism implementation as well as the high-performance/high-parallelism implementation–and many implementations in between–are available to the FPGA designer.

In a traditional low-level synthesis environment using RTL synthesis tools, the designer would be reluctant to explore all of the multiple degrees of parallelism available to implement the filter. It would literally take weeks or even months to find the best architecture. However, using an automated high-level synthesis tool, all the various degrees of parallelism (one multiply-add, two multiply-adds, *n* multiply-adds) can be examined in a matter of hours. Listing 1 shows an example of how this filter can be coded as a C++ class, while Figure 2 shows the architectural constraints where the filter's loops can be fully (full parallelization) or partially (partial parallelization) unrolled. In this case, a parallelization of the algorithm to use two execution units (partial unroll by 2) has been selected.

Table 1 shows a partial list of the components utilized when the hardware is synthesized. Notice that two multipliers (in bold) are used as expected. Also, notice that four, not two, adders were allocated. That's because adders are also used in the synthesis of the state machines (loops), which control the algorithm's execution.

**Lower power consumption**

When implementing a low-performance DSP algorithm, such as a voice codec, in a dedicated DSP chip, channel multiplexing is used in most cases. The reason is the clock frequencies at which modern DSP chips operate can be very high. Such frequencies provide enough computing power to process many voice channels, which normally have a sampling frequency of only a few kilohertz.

Channel multiplexing ensures that the inputs to the serial DSP aren't correlated. Data that's not correlated is more likely to produce transitions at all stages of the DSP hardware, which implies more power consumption. By implementing various instances of low-performance hardware in an FPGA and running them at their minimum feasible clock frequency, substantial power savings can be achieved.

Again, high-level synthesis can help in creating an efficient implementation of such hardware because the array of low-performance instances can be expressed as an array of a particular C++ class in the source code and their execution can be paralleled (unrolled) in the constraints window of a high-level synthesis tool.

**Variable precision arithmetic**

The execution units of a dedicated DSP are fixed. These units can be fixed-point units (two's complement arithmetic in most cases) or floating-point units. In most applications, there's a mismatch between the precision required by the DSP algorithm at hand and the precision provided by the target DSP. This mismatch results in the designer having to choose a DSP chip that contains arithmetic execution units that can fit, but not necessarily match, the precision of the algorithm at hand. It also wastes computing resources (bits) and power. An FPGA implementation can be precisely matched to the algorithm's necessary precision and a high-level synthesis tool can synthesize to different precisions by simply changing the C++ class in which the algorithm is implemented.

**Arithmetic flexibility**

The same algorithm might need to be implemented in different arithmetic data types, including integer arithmetic, fixed- or floating-point arithmetic, and complex arithmetic. A high-level synthesis tool can synthesize the same algorithm into different arithmetic types and produce dedicated hardware to appropriately handle the arithmetic type. This flexibility isn't available on dedicated DSPs because, as previously discussed, the execution units on dedicated DSPs are fixed ahead of time.

Arithmetic flexibility can be taken to the extreme when the execution units synthesized into the FPGA aren't what's normally called an arithmetic execution unit. For example, when performing analog-to-digital conversion using a delta-sigma modulator, a stream of 1s and 0s generated by the analog front end is low-pass filtered and then decimated to obtain a pulse-coded modulation stream that represents the analog signal. The low-pass filter must interpret the input stream as high voltage having a value of -1 and low voltage having a value of +1. These semantics for the high and low values aren't what one would consider the “correct” semantics of an unsigned (0 and +1) or signed (0 or -1) bit stream.

The designer could code a class named “sign” in C++ and define the basic operations between this class and any arithmetic representation commonly used in DSP systems. The filter described in Listing 1 could be specialized to perform such a function and then synthesized into an FPGA. Figures 3 and 4 show the input and output streams for such a filter.

Note that the decimator's input has the correct meaning of +1 and -1. Also note that, with exception of a certain amount of time to flush the filter pipeline, the output for the synthesized hardware is identical to the output of the C++ high-level description. In this way, the designer can successfully specify a DSP engine that can perform arithmetic operations between a sign input and a set of filter coefficients expressed as fixed-point numbers. These operations aren't possible on a dedicated DSP chip, but instead are emulated using the fixed executions units.

Another example, which shows the advantages of polymorphism, is the back-substitution operation used when solving a system of simultaneous equations. In DSP communications systems, back substitution is used as a component of the channel estimation algorithm in MIMO/OFDM systems. In such systems, the matrix operations are performed in fixed-point real or complex numbers.

A totally different application for back substitution is Reed-Solomon decoding. One way of looking at the decoder is as a back-substitution engine that performs arithmetic in a Galois field of numbers, not in what we normally call “arithmetic representations.” If a back substitution is coded as a C++ template and instantiated as two different specializations using complex and Galois fields, two completely different pieces of hardware are obtained from the same algorithm description.

Finally, by using a combination of FPGAs and high-level synthesis tools, the user can match the implementation of the original algorithm described in C++ to a piece of dedicated hardware that meets the algorithm's performance needs. This flexibility isn't available with dedicated DSP implementations.

In some areas, dedicated DSP implementations are superior to FPGA implementations. These include very large algorithms, higher clock frequencies, and software portability.

The first advantage comes from the fact that an algorithm's size can be implemented on a DSP, only limited by the amount of memory available. On an FPGA implementation, there are other limitations, such as the number of available resources (such as execution units and I/O pins) and the algorithm's structure, which may or may not allow for efficient use of resources.

The second advantage for DSPs comes from the fact that the arithmetic portion of DSP ICs is designed using full-custom design techniques similar to the ones used for designing microprocessors. Therefore, the clock frequencies at which DSPs can run are usually much higher than FPGA. Because it can be expected that the clock frequency in an FPGA will be lower, an algorithm suitable for implementation on an FPGA must have an inherent parallelism that can produce a higher performance or a lower power design than the DSP running at a higher clock frequency. This isn't always possible.

Next, we'll discuss various experiences using a high-level synthesis tool (Mentor Graphics' Catapult C Synthesis) with the Altera Stratix II FPGAs for the replacement of dedicated DSP chips.

One way to replace dedicated DSPs is by implementing portions of the algorithm in software using the NIOS-II soft processor. As discussed before, the frequencies at which the FPGA can operate are much lower than that of the dedicated DSP. So we designed and implemented various hardware accelerators to compute in parallel the most compute-intensive (and parallelizable) portions of the algorithm.

Some of the hardware coprocessors that have been successfully implemented using this approach are:

- Systolic Fast Fourier Transform Array that can perform an
*N*point FFT in*N*clock cycles. - Systolic Matrix Triangularization Array that can reduce an
*N*x*N*matrix in*N*or*N*^{2}clock cycles, depending on the available memory bandwidth. - Front end for a single sideband modulator (complex arithmetic filtering).

We attempted to perform a direct translation of two standard telecom algorithms without modifying the original source code. The first algorithm was an AMR (Adaptive Multi-Rate) codec standardized by CCITT (ComitÃ© Consultatif International TÃ©lÃ©phonique et TÃ©lÃ©graphique, now known as International Telecommunication Union), and the second was a commonly used encryption algorithm.

During the codec implementation, we faced tool and FPGA capacity issues that forced us to implement only the compute-intensive (and parallelizable) portion of the algorithm in the FPGA, leaving the algorithm's serial portions to run in a dedicated DSP. The final product contains both dedicated DSP chips as well as FPGA coprocessors. The second experience had a better initial success because the complete encryption algorithm, with very minor modifications, was successfully implemented into an FPGA using Catapult C. However, given the fact that the supplied C algorithm wasn't written for hardware intent, the resulting implementation was too big.

To create more efficient hardware, some modifications were performed to the initial algorithm that was later successfully synthesized into efficient hardware using Catapult C and Altera FPGAs. Although this attempt was successful, some nonrecurring engineering costs were unavoidable in the process of performing the algorithm's modifications. Partial replacement of DSP

In another experience, we designed a motion estimation detector using a combination of a dedicated DSP, working with an FPGA as its coprocessor. The algorithm was profiled beforehand to decide which portions were suitable for a DSP implementation and which were suitable for an FPGA hardware accelerator.

The software was written with this partitioning in mind. First, the DSP portions were written with the target DSP processor in mind and, second, the motion estimation coprocessor was written in such a way that efficient video processing hardware was synthesized (a low memory bandwidth window-based algorithm). This project is still under way but the preliminary results seem promising.

**Room for improvement**

From the previous experiences, one can conclude that total or partial replacement of DSPs by FPGAs is a feasible cost-effective alternative. However, it's not yet a push-button operation. The following improvements are desirable to facilitate this new design methodology:

Support of most (if not all) of the features of the C/C++ programming languages. People who are accustomed to working with dedicated DSPs are used to compiling large algorithms out of the box without any modification. To be able to replace larger portions, or all of the algorithm by an FPGA, Catapult C must support most of the language features encountered in these algorithms.

**Faster clock frequencies in FPGA.**

To be able to translate algorithms that aren't completely parallelizable into FPGAs, it's desirable that the FPGAs' clock frequencies be closer to the clock frequencies at which a dedicated DSP can run.

**Larger tool capacity. **

While commercially available high-level synthesis tools can now synthesize blocks of substantial size, it's still impossible to synthesize some large algorithms, such as the codec described earlier.

**Hardware/software codesign. **

It's important to enhance both the DSP compilers as well as the ESL synthesis tools to accept the same source code and produce efficient machine code or hardware depending on the target platform. This will enable designers to defer some of the partitioning decisions to later stages of the algorithm development.

We have successfully replaced all or substantial parts of DSP-based algorithms with FPGA implementations. Given the continuous improvement of both high-level synthesis and FPGA technologies, we expect that this approach will continue to gain ground against dedicated DSP implementations.

**Sergio R. RamÃrez** is a technical marketing engineer for Mentor Graphics. He held an assistant professorship of electrical engineering and was the T. Jefferson Miers Fellow at Bucknell University. Ramrez can be reached at .

**Shawn McCloud** is the high-level synthesis product line director for Mentor Graphics, where he has also held technical and product marketing positions focusing on RTL and high-level synthesis. McCloud has a BS in electrical and computer engineering from Case Western Reserve University. He can be reached at .