This "How To" paper outlines practical steps, along with common mistakes to avoid, for successfully extracting optimal results in your DSP-based FPGA designs.
Although FFT and FIR filters may seem complex, in reality they use simple add/subtract/multiply operations. So how can these arithmetic modules, along with shift and pipeline registers in modern FPGAs, be configured in different modes to provide greater flexibility and control with desirable levels of performance? In this "How To" paper, we outline practical steps, along with common mistakes to avoid, for successfully extracting optimal results in your DSP-based FPGA designs.
In high-performance, FPGA-based DSP designs, which typically demand high bandwidth, high throughput, and low operating power, there is very little room for error during the design-planning process. In order to be successful when tackling such designs, you need to understand certain nuances about design specifications and target technology architectures, as well as synthesis tools. With the realization that it is difficult to be an absolute expert on every possible aspect of DSP-based design using programmable logic devices, this article outlines some actions you can take to meet your ultimate objectives when handling these designs.
Know your target technologies
DSP algorithms are being increasingly integrated into products like camera-ready cellular phones, portable media players, etc. FPGAs are an attractive platform for these applications, since they allow manufacturers to differentiate and compete via value-added features and upgrades, especially within the short product life spans typically seen in the consumer electronics arena. More recently, FPGA architectures have seen considerable performance improvements by virtue of their new embedded features, especially in terms of dedicated DSP blocks for building compute-intensive applications.
FPGAs now incorporate embedded features that enable multiplication, accumulation, addition/subtraction and summation, all of which are commonly used for DSP functions. With these basic arithmetic functionalities, designing the overall DSP-based application becomes fast, flexible, and efficient. At the core of a typical DSP block is a multiplier feeding an adder. DSP blocks have additional features that you can leverage for improved resource utilization and performance:
- Pipelined registers between the multiplier and adder.
- Built-in registers at the inputs and output of the DSP block.
- Dedicated input or multiplier output (or a combination) can synchronously load the output.
- You can cascade DSP blocks, such that the output of an input stage goes to the next block (great for FIR filter implementation).
High-end DSP resources are both unique and versatile. You can map several different types of functions on each of these DSP blocks, including multiply-accumulation (MAC), multiply-addition (MADD), counters, and shifters. But this versatility makes the design more complex, especially when it comes to instantiating these DSP blocks and connecting them into the final design[1]. Also, the several different configurations of these dedicated DSP resources make it challenging for you to understand all these components and use them efficiently in real designs. The DSP blocks available in Altera's Stratix II [2], Lattice Semiconductor's LatticeECP-DSP devices [3], and Xilinx Virtex-4 (XtremeDSP or DSP48 slice) [4], are illustrated in the following figures:
- Fig 1 shows an Altera Stratix-II DSP block.
- Fig 2 shows a LatticeECP DSP block.
- Fig 3 shows a Xilinx Virtex-4 DSP tile comprising two DSP48 slices.
Recognize the pros and cons of your synthesis tools
It takes in-depth understanding to instantiate each of these blocks and to stitch them together into the design directly as individual technology cells. Certainly, you benefit from access to techniques that can automatically take the HDL code and target it efficiently to any of these architectures. To aid in the efficient implementation of any DSP-based algorithms, make sure to use a vendor-neutral methodology that breaks the design into basic building blocks. You can then flexibly map these blocks to the DSP resources in the FPGA architecture or device that is most suitable for the target application.
Do not instantiate cells and generate netlists specific to each device vendor, because the resultant technology dependence locks you to a specific FPGA implementation. It is best to use the synthesis tool to infer all of the DSP functions whenever possible. Using generic RTL coding styles is an added benefit, because it enables you to efficiently reuse your optimized designs across different FPGA vendors and device architectures.
Another trend worth noting is that the RTL code for many of today's DSP-based designs is increasingly generated automatically by high-level algorithmic synthesis tools. Therefore, look for FPGA synthesis tools that provide the same level of advanced inferencing capability for the HDL netlists generated by high-level synthesis tools as that provided for hand-coded HDL.
Moreover, you really need to know exactly what your synthesis tool is capable of in terms of inferencing and mapping DSP blocks. These functions are not a subsidiary feature of FPGA synthesis tools. In fact, you should treat these as core functions, which should be an essential part of every FPGA design flow. These types of synthesis capabilities are further required by certain FPGA architectures, in order to achieve the optimal performance. To illustrate this point, consider a 35 x 35, fully pipelined multiplier as an example. Assume that Virtex-4 is your target device and some of the common RTL behavior for this circuit is similar to that shown in Fig 4.
Transforming this into an actual, optimal mapping algorithm (such as the example shown in Fig 5) involves at least a couple of key factors. First, your design team needs to be aware of particular nuances of the Virtex-4 architecture. Second, your synthesis tool must perform certain optimization techniques. For example:
- Based on the target device architecture, you need to know that there must be five output pipeline registers.
- Your synthesis tool must be able to accurately infer a 35 x 35 multiplier.
- Your synthesis tools is expected to perform forward and backward retiming algorithms in order to distribute the pipeline registers optimally.
- Your design team must also realize the impact of pipelining on clock latency (this may not be very critical in DSP designs and it is commonly used as a trade-off with a higher clock frequency).
What are your alternatives if the synthesis tools lack any of these inference capabilities? In this case, you can try to generate this function from the Xilinx Core Generator, but there are some disadvantages to this approach. First, you cannot possibly generate all the DSP functions in your designs. Second, you are forced to manually instantiate these generated DSP cells, thereby limiting and locking your RTL source files to this particular vendor only.
As mentioned previously, MAC and MADD are the two most common DSP operators supported by FPGA synthesis tools. Within each DSP block, there is a finite number of propagation delays on each major DSP function, e.g. multiplier, adder, etc. However, in order to maximize the operating clock speed or Fmax from inputs to outputs of the DSP block, it is recommended that you arrange to provide internal pipeline registers in each stage (as was shown in Fig 4). It is extremely important for synthesis tools to correctly infer and absorb all pipeline registers into DSP blocks.
Why is it so critical to utilize all the internal pipeline registers whenever possible? Because you can no longer afford unnecessary routing delays when designing DSP circuits in the 300~500 MHz Fmax range. Every time a synthesis tool has to use external flip-flops as pipeline registers, it will most likely cost you a penalty in terms of external routing delays, and the only thing you can hope for is that "it is not on one of those 311-MHz critical timing paths!" You can take the following steps to minimize these common timing problems:
- Find out from your synthesis tool vendor whether certain coding guidelines are required.
- Find out which type(s) of flip-flop(s) can be absorbed in the DSP blocks of your target device, e.g. synchronous, asynchronous, or both. This way, you can prevent synthesis tools from absorbing pipeline registers into DSP blocks if using the wrong type of sequential elements.
- Try to code up some small MAC and MADD operators and synthesize them to make sure that (a) they are inferred correctly and (b) you get the expected quality of results (QoR).
Understand key optimization techniques in DSP designs
As important as it is to know whether your synthesis tools offer key optimization techniques, it is also helpful for you to have better knowledge about some of the often used ones. This will also help you to debug and improve your designs much more efficiently. Three examples are provided as follows:
1. Optimizations using MAC and MADD
When writing RTL behavior for your DSP functions, there is hardly any limitation to what you can write to represent the same function, which in turn may create problems for synthesis tools. This can leads to poor QoR in terms of both area utilization and Fmax. Consider the expression (a*b + c*d) + (e*f + g*h) + i. Traditional DSP synthesis will result in two MADDs for (a*b + x) and (e*f + y)}, where x = c*d and y = g*h. This will take up six DSP blocks: two for the MADDs, two for the multipliers "x" and "y", and two for the adders (Fig 6).
This is not an effective utilization of DSP blocks. Alternatively, using automatic inferencing techniques available in tools such as Mentor Graphics Precision Synthesis, along with other methods described in this paper, allows you to re-organize the above expression as {a*b + (g*h + (e*f + (c*d + i)))}. This leads to inferencing of four MADD functions, so the whole expression gets mapped within only four DSP48 slices (Fig 7).
2. Optimizations using Cascading Connections
If you have a DSP application where the highest possible frequency is an absolute requirement, it is critical that you have the ability to build fast arithmetic operators. Consider the example shown in Fig 8, where a 48-bit adder is driving a MAC operator. You can implement this using two DSP blocks. The dedicated chaining from the output of the first DSP block to the input of the second reduces the routing delay and increases the speed of the design. This way, you can achieve a frequency of over 400MHz along with significant area reduction.
3. Optimizations using Scan Chains
Most FPGA technologies support scan chains for applications in which inputs arrive in a delayed manner, such as FIR filters (Fig 9). You can use scan chains to cascade DSP stages, where a stage requires the input of the previous stage delayed by a clock cycle. The output from the previous input stage is routed through a SCANOUT port to the input of the next stage. Another advantage is that the routing delay – associated with exiting the DSP to the FPGA fabric and then returning back from the FPGA fabric to the DSP – is eliminated.
Conclusion
There is no single magic panacea that solves all DSP design challenges encountered in programmable logic. Understanding the nuances of FPGA-based DSP architectures and then making use of innovative synthesis tools and techniques that provide certain important optimizations in a vendor-neutral manner are definitely key steps in the right direction. It is also reasonable for one to expect the best in class from both worlds – FPGA synthesis tools as well as FPGA devices – when designing high-performance FPGA-based DSP designs.
References:
[1] Using Precision Synthesis to Design with the XtremeDSP Slice in Virtex-4, Douang Phanthavong, http://www.mentor.com/products/fpga_pld/techpubs/index.cfm
[2] Altera Stratix II DSP Blocks, http://www.altera.com/products/devices/stratix2/features/dsp/st2-dsp_block.html
[3] Lattice Semiconductor sysDSP Block Brings High DSP Performance to FPGAs, http://www.latticesemiconductor.com/products/fpga/ecp/sysdsp.cfm
[4] XtremeDSP Design User Guide http://www.xilinx.com/bvdocs/userguides/ug073.pdf
Douang Phanthavong is a Product Marketing Engineer primarily focused on FPGA-based synthesis algorithms and advanced inference/mapping optimization techniques in the Design Creation and Synthesis Division of Mentor Graphics. Prior to Mentor, he worked as a hardware development engineer at NEC America, Inc. and as an IP designer and senior application engineer at Lattice Semiconductor. Douang has an MSEE degree from the OGI School of Science & Engineering. Douang can be reached at douang_phanthavong@mentor.com.