Editor's Note: In this series of design articles, the authors offer a close look at various design challenges and effective use of design tools and techniques for resolution of those challenges. Be sure to check out the second article in this series: Building 491MHz FPGA-based wireless radio heads
FPGAs keep getting larger, the designs more complex, and the need for high level design (HLD) flows never seems to go away. C-based design for FPGAs has been promoted for over two decades and several such tools are currently on the market. Model-based design has also been around for a long time from multiple vendors. OpenCL for FPGAs has been getting lots of press in the last couple of years. Yet, despite all of this, 90+% of FPGA designs continue to be built using traditional Verilog or VHDL.
No one can deny the need for HLD. New FPGAs contain over 1 million logic elements, with thousands of hardened DSP and memory blocks. Some vendor's devices can even support floating-point as efficiently as fixed-point arithmetic. Data convertor and interface protocols routinely run at multiple GSPS (giga samples per second), requiring highly parallel or vectorized processing. Timing closure, simulation, and verification become ever-more time-consuming as design sizes grow. But HLD adoption still lags, and FPGAs are primarily programmed by hardware-centric engineers using traditional hardware description languages (HDLs).
The primary reason for this is quality of results (QoR). All high-level design tools have two key challenges to overcome. One is to translate the designer's intent into implementation when the design is described in a high-level format. This is especially difficult when software programming languages are used (C++, MATLAB, or others), which are inherently serial in nature. It is then up to the compiler to decide by how much and where to parallelize the hardware implementation. This can be aided by adding special intrinsics into the design language, but this defeats the purpose. OpenCL addresses this by having the programmer describe serial dependencies in the datapath, which is why OpenCL is often used for programming GPUs. It is then up to the OpenCL compiler to decide how to balance parallelism against throughput in the implementation. However, OpenCL programming is not exactly a common skillset in the industry.
The second key challenge is optimization. Most FPGA hardware designers take great pride in their ability to optimize their code to achieve the maximum performance in a given FPGA, in terms of design Fmax , or the achievable frequency of the system clock data rate. This requires closing timing across the entire design, which means setup and hold times have to be met for every circuit in the programmable logic and every routing path in the design. The FPGA vendors provide automated synthesis, fitting, and routing tools, but the achievable results are heavily dependent upon the quality of the Verilog and/or VHDL source code. This requires both experience and design iteration. The timing closure process is tedious and sometime compared to “Whack-a-Mole,” meaning that when a timing problem is fixed in one location of the design, a different problem often surfaces at another location.
An oft-quoted metric for a high-level design tool is to achieve results that are no more than 10% degraded from a high-quality hand-coded design, both in terms of Fmax and the utilization of FPGA resources, typically measured in “LEs” (logic elements) or “LCs” (logic cells). In practice, very few tools can reliably deliver such results, and there is considerable skepticism among the FPGA design community when such a tool is promoted by EDA or FPGA vendors.
Having said this, there is a design tool that is being quietly adopted by FPGA engineers precisely because it not only addresses this QoR gap, but — in most cases — extends it in the other direction, meaning that the tool produces results that are usually better than their hand-coded counterparts.
This tool is called DSP Builder Advanced Blockset (the marketing folks were obviously not at their best when naming this tool). This is a model-based design tool, meaning that design entry is accomplished using models in the Mathworks' Simulink environment. The tool was first introduced to the market in 2007.
There are other model-based tools on the market, such as HDL Coder, Synplify, and System Generator; however, only DSP Builder Advanced Blockset offers the combination of the following ten features:
- Decoupling of system data rates from FPGA clock rates; native multi-channel capabilities.
- Automated timing closure at high Fmax , including auto-pipelining.
- Deterministic latency and data throughput.
- Optimal usage of FPGA hard block features.
- Design portability across FPGA families.
- Fixed- or floating-point numerical implementation.
- Support for vector manipulation.
- Math.h library.
- System simulation in the Mathworks' environment.
- Hardware simulation from the Mathworks' environment.
This combination is what allows the tool to deliver superior QoR along with the productivity advantages of a high level simulation, design, and verification tool flow. Let's look at each of these features in a little more detail…
Decoupling of system data rates from FPGA clock rates
Using DSP Builder, the user specifies the desired design clock rate. The data rate can be higher or lower than the clock rate, sometimes dramatically so. The tool will automatically parallelize the data and represent the data buses as vectors in cases where the data rate is higher than the clock rate. Integer ratios work most efficiently (4, 8, 12, 16, 32…) but any ratio will work and the control path will insert empty data into some of the vectors to accommodate this.
This capability provides the ability to support very high data rates of many GSPS using realistic FPGA clock rates of several hundred MHz, depending upon the FPGA family.
Within DSP Builder, the designer builds the datapath, often containing various rate FIR filters, memory blocks, NCOs, mixers, saturate and round blocks, and so forth. However, the designer need only lay down a single channel datapath assuming the design clocks at the required rate, regardless of the actual data rate. DSP Builder will build the data path with the specified number of channels, and vectorize (or parallelize) the design to achieve the needed data throughput. This is specified in a parameter file, which means it is easily changed, with the only effort being a recompile. The tool generates all needed control logic to handle multi-channel and higher data rates, even for complex datapaths. Further all configuration or coefficient registers can be read or written, with the addressing and accessing logic auto-generated. This will operate at a lower clock rate than the datapath.
Automated timing closure at high Fmax , including auto-pipelining
Most high-level design tools output Verilog or VHLD files, which the FPGA vendor's synthesis, fitter, and routing tools then use to try to achieve the best possible clock rates coupled with the most compact logic implementation. The generated code is generic, relying on the FPGA vendor tools to map into a particular FPGA architecture and speed grade.
In contrast, the DSP Builder tool works with the user-specified clock rate, FPGA family, and FPGA speed grade to auto-generate optimized Verilog or VHDL (your choice) code. One key aspect of this is auto-pipelining. The designers will have register stages throughout their design, but only the registers that are algorithmically necessary. For example, a FIR filter has a specific number of data register delays between taps, or an IIR filter has a specific feedback path register delay. DSP Builder will read the specific timing parameters for the chosen FPGA/speed grade, and — based on the desired clock rate — will add the appropriate stages of pipelining registers. This customizes the auto-generated code for that specific FPGA, balancing latency, and register resources so as to achieve the required clock rate.
Despite these efforts, timing closure is not guaranteed. DSP Builder estimates the degree of pipelining based upon timing parameters, but the final result is not known until the design is fitted and routed in the FPGA. Therefore, an addition parameter — known as clock margin — is available. A positive clock margin tells DSP Builder to try to meet timing for higher Fmax than specified, while a negative clock margin equates to a lower Fmax . This can be selectively applied to a portion of the design that is not closing timing, thereby directing DSP Builder to apply a greater or lesser degree of pipelining during the auto-generation of the Verilog/VHDL code.
This capability will be even more significant with new FPGA families being built on Intel's 14nm process. In addition to the conventional registers contained in the logic elements, these FPGAs also provide pipeline registers in all the routing paths throughout the device. All of these registers can be leveraged by the DSP Builder tool to achieve a design Fmax as high as 1 GHz.
Deterministic latency and data throughput
Latency will be affected by the degree of pipelining applied. However, design latency is another parameter that can be fixed, or set to a maximum, within the top-level design file. The normal procedure is to leave the latency initially unconstrained and to let the tool determine the required latency to achieve the requested Fmax . At that point, the latency can be constrained or locked down; thereby providing the necessary predictability to integrate the DSP Builder-generated design into the larger FPGA project file.
Within DSP Builder designs, latency balancing is performed across parallel data paths, so the appropriate signals line up when paths are merged in the tool. Again, the designer is only concerned with algorithmic delays; relative delays across different paths are maintained throughout the automated pipelining process. This allows for easy design changes without having to keep rebalancing latency in various paths.
Optimal usage of FPGA hard block features
FPGA architectures continually evolve, which requires modifications to the Verilog/VHDL code to take advantage of these changes. For example, the hardened DSP blocks have evolved from supporting simple 18×18 multiplies to 18×19, 27×27, and 36×36 multiply, dual-stage 64-bit accumulators, pre-adders, built-in coefficient registers, and — most recently — single-precision floating-point multipliers and adders. Similarly, the fundamental FPGA fabric has evolved from a simple 4-input LUT (lookup table) to a factorable 6-input LUT to a new routing structure known as HyperFlex, which is the key technology in enabling the 1 GHz FPGA.
The DSP Builder toolflow is on the front lines in terms of integrating and optimizing support for these features in successive FPGA families. Because the toolflow is used to describe the algorithmic flow in a Mathworks environment, the auto-generated code can be optimized for more advanced FPGA architectures without updating the design. This provides a future-proofing of algorithms and IP, and protection for long-lasting programs where the engineer(s) who created the original design may have left the organization or moved onto other projects.
Design portability across FPGA families
Along the same lines as the previous point, any design can be mapped to different FPGA families with simple parameter changes and a re-compile. This allows an existing Kalman filter design, for example, to be reused in a low-cost FPGA at lower clock rates and with a simpler FPGA architecture.
There is no need to recode designs to use on another program using a different FPGA family, thereby providing large savings in testing and verification efforts, as well as further “future-proofing” the design.
Fixed- or floating-point numerical implementation
Traditionally, all FPGA designs have used fixed-point numerical representations. Fixed-point requires simpler circuits, both in the hard DSP blocks and in programmable logic. It also works well for many applications, although it does require some engineering experience to manage the decimal point in order to maintain adequate dynamic range and signal-to-noise ratios.
Floating-point removes this burden on the designer, and — for some newer applications, such as advanced radar processing, MIMO based communications systems (5G), and datacenter acceleration — floating-point is a must.
Floating-point can be implemented using soft logic, and no tool does a better job of supported this capability than DSP Builder, with seven different floating-point formats. However, widespread use of floating-point in high GFLOPS applications requires that floating-point operators be hardened, rather than being implemented in programmable logic. Modern FPGA architectures now support floating-point adders and multipliers in the DSP blocks. This capability is very difficult to leverage on a wide scale using traditional FPGA design entry, but using DSP Builder — which comes equipped with a number of high GFLOPS example designs — it is as easy and natural to use as fixed-point.
Support for vector manipulation
Native support for vectors in a high level design flow is essential to enable “super-sample rate” designs, where the data rate is much higher than the FPGA clock rate, as the samples must be processed in parallel. The design tool has to provide not just vector data types, but the ability to manipulate vectors – vector sums, vector accumulation, vector dot product, vector cross products and so forth.
For FPGAs to act as accelerators to high performance CPUs, vector support is fundamental to for highly parallel algorithms, with a focus on linear algebra.
Along with native floating-point support comes the need for math functions. DSP Builder provides a full complement of math functions using underlying algorithms that are specifically designed for efficient FPGA implementation.
Due to the high Fmax and low logic usage of the math library, designers no longer need to craft their implementations to minimize use of functions such as divide, square root, trigonometric, logarithm, and exponent operators.
System simulation in Mathworks' environment
Complex algorithms must be developed and verified at the system level. Model-based designs allow the system engineer to leverage the extensive capabilities and toolboxes of Mathworks products. Oftentimes, the test bench can be more complex than the algorithm being implemented, so the ability to simulate using MATLAB and Simulink is extremely useful.
The DSP Builder design can also be simulated within the Mathworks environment, thereby allowing full visibility and easy debug before any hardware is involved. DSP Builder will also provide resource estimations (e.g., logic, DSP blocks, memory) without the need to perform any FPGA compiles, which can be fairly time-consuming. This allows for rapid design space exploration, enabling the designer to make trade-offs and ensure the design will fit into the chosen FPGA prior to invoking synthesis, fitting, and routing. Furthermore, DSP Builder will generate a ModelSim testbench using the same vectors generated by Mathworks testbench, and also run the hardware verification. This certifies the fidelity of the DSP Builder code-generation process, ensuring the behavior observed in the original Mathworks simulation is faithfully reproduced in hardware.
Hardware simulation from the Mathworks' environment
A further option is provided to accelerate verification. Rather than running the simulation on the x86 CPU, there is an option to run on the actual FPGA hardware. This capability is naturally called “FPGA in the Loop” or FIL. In most cases, FPGA development boards are readily available and — using a USB connection between the development machine and FPGA hardware — the actual processing can be performed in real time on the FPGA. The input and output data is buffered, and the FPGA is throttled to operate at the rate data can be supplied and retrieved. For high GFLOPS or GOPS algorithms, this can provide dramatic speedups in system verification, along with the added piece of mind that the actual implementation hardware is being exercised.
In conclusion, the best way to convince skeptical engineers who will actually have to rely on the tool is to provide representative, complete, open source design examples. These are currently available, and many ship with the tool itself.
The performance, resource utilization, and design methodology of a few of these example designs will be detailed in future articles as follows:
- High throughput FFT, 4K points at 10 GSPS.
- 4 GSPS Digital Upconversion.
- Wideband Beamformer with picosecond resolution.
- 4G Remote Radio Head including CFR and DPD.
- Cholesky and QRD matrix solvers in single-precision floating-point.
- SAR radar in single-precision floating-point.
- Short-range FMCW radar in low-cost FPGAs.
- Field-Oriented Motor Control in floating-point in low cost FPGAs.
- Space-Time Adaptive radar in single-precision floating-point.
This combination of designs serve to showcase unique capabilities that are not only unavailable in competing high-level design tools, but are often extremely difficult to implement in traditional Verilog or VHDL designs. In fact, many customers have adopted DSP Builder Advanced Blockset precisely because they ran into difficulties using traditional toolflows, and were only able to achieve their design goals using the DSP Builder design flow.