Hardware-based floating-point design flow

Michael Parker, Altera Corporation

January 17, 2011

Michael Parker, Altera Corporation

Optimal implementation of FP processing

With FPGAs, using a barrel-shifter structure for normalization requires high fan-in multiplexers for each bit location, and the routing to connect each of the possible bit inputs. This leads to very poor fitting, slow clock rates, and excessive routing. A better solution with FPGAs is to use multipliers. For a 24-bit single-precision mantissa (the signed bit is now included), the 24x24 multiplier shifts the input by multiplying by 2N. Many FPGAs today have very high numbers of hardened multiplier circuits that operate at high clock rates.

Another technique used to minimize the amount of normalization and de-normalization is to increase the size of the mantissa. This allows the decimal place to move a few positions before normalization is required, such as in a multiplier product. This is easily accomplished in an FPGA, as shown in Figure 1 below.

For most linear algebra functions, such as vector sums, vector dot-products, and cross products, a 27-bit mantissa reduces normalization frequency by over 50%. For more non-linear functions, such as trigonometric, division, square root, a larger mantissa is needed. A 36-bit mantissa works well in these cases. The FPGA must support 27x27 and 36x36 multipliers. For example, one recently announced FPGA offers over 2000 multipliers configured as 27x27, or over 1000 multipliers configured as 36x36.

 

Figure 1. New Floating-Point Approach (To view larger image, click here)

These techniques are used to build a high-performance floating-point datapath within the FPGA. Because IEEE 754 representation is still necessary to comply with floating-point processing, the floating-point circuits must support this interface at the boundaries of each datapath, such as a fast Fourier transform (FFT), a matrix inversion, or sine function. 

This floating-point approach has been found to yield more accurate results than if IEEE 754 compliance is performed at each operator. The additional mantissa bits provide better numerical accuracy, while the elimination of barrel shifters permits high clock-rate performance, as shown in Table 1 below.

 

Table 1. FPGA Floating-Point Precision Results

Table 1 lists the mean, the standard deviation, and the Frobenious norm where the SD subscript refers to IEEE 754-based single-precision architecture in comparison with the reference double-precision architecture, and the HD subscript refers to the hardware-based single-precision architecture in comparison with the reference double-precision architecture.

Floating-Point Verification

Floating-point results cannot be verified by comparing bit for bit, as is typical in fixed-point arithmetic. The reason is that floating-point operations are not associative, which can be proved easily by writing a program in C or MATLAB to sum up a selection of floating-point numbers.

Summing the same set of numbers in the opposite order will result in a few different LSBs. To verify the floating-point designs, the designer must replace the bit-by-bit matching of results typically used in fixed-point data processing with a tolerance-based method that compares the hardware results to the simulation model results.

The results of an R matrix calculation in a QR decomposition are shown Figure 2 below, using a three-dimensional plot to show the difference between the MATLAB-computed results and the hardware-computed results using an FPGA-based floating-point toolflow. Notice the errors are in the 10-6 range, which affects only smallest LSBs in the single-precision mantissa.

 

Figure 2. R Matrix Error Plot (To view larger image, click here)

To verify the accuracy of the non-IEEE 754 approach, matrix inversion was performed using single-precision floating-point processing. The matrix-inversion function was implemented using the FPGA and tested across different-size input matrices. These results were also computed using single-precision with an IEEE 754-based Pentium processor. Then a reference result was computed on the processor, using IEEE 754 double-precision floating-point processing, which provides near-perfect results relative to single-precision.

Comparing both the IEEE 754 single-precision results and the single-precision hardware results, and computing norm and the differences, shows that the hardware implementation gives a more accurate result than the IEEE 754 approach, due to the extra mantissa precision used in the intermediate calculations.

FPGA FP design-flow methodology

Hardware, including FPGAs, is typically designed using an HDL, either Verilog or VHDL. These languages are fairly low level, requiring the designer to specify all data widths at each stage and specify each register level, and does not support the synthesis of floating-point arithmetic operations.

To implement the approach above using HDL would be very arduous and has greatly discouraged the use of floating-point processing in hardware-based designs. Therefore, a high-level toolflow is needed to implement these floating-point techniques. 

The design environment chosen in this case is Simulink, a widely used product from The Mathworks. Simulink is model based, which allows the designer to easily describe the data flow and parallelism in the design, traditionally a challenge when using software language.

Compared to HDL, Simulink provides a high level of design description, allowing the designer to describe the algorithm flow behaviorally, without needing to insert pipeline registers or know the details of the FPGA hardware, and to easily switch between fixed-point processing and single- and double-precision floating-point variables. This level of abstraction provides the opportunity for an automated tool to optimize the RTL generation, including the floating-point synthesis.

An additional advantage of this choice is that the system can be simulated in the Simulink and MATLAB domains, and the same testbench used in system-level simulation is later used to verify the FPGA-based implementation. The automated back-end synthesis tool running under Simulink is called DSP Builder. It performs all the required floating-point optimizations to produce an efficient RTL representation. A simple circuit representation is shown in Figure 3 below.

 

Figure 3. Simple DSP Builder Floating-Point Processing Example (To view larger image, click here)

FPGAs, with their hardware architecture, distributed multipliers, and memory blocks, are ideal for high-bandwidth parallel processing. The distributed nature of the programmable logic, hardened blocks, and I/Os minimize the occurrence of bottlenecks in the processing flow. The embedded memory blocks store the intermediate data results, and the extensive I/O options of FPGAs provide easy interconnects to the larger system, whether they are processors, data convertors, or other FPGAs.

< Previous
Page 2 of 3
Next >

Loading comments...

Most Commented

Parts Search Datasheets.com

KNOWLEDGE CENTER