Hardware-based floating-point design flow - Embedded.com

Hardware-based floating-point design flow

Floating-pointprocessing is widely used in computing for many different applications. In mostsoftware languages, floating-point variables are denoted as “float” or double.”Integer variables are also used for what is known as fixed-point processing.

Floating-pointprocessing utilizes a format defined in IEEE 754, and is supported bymicroprocessor architectures. However, the IEEE 754 format is inefficient toimplement in hardware, and floating-point processing is not supported in VHDLor Verilog. Newer versions, such as SystemVerilog, allow floating-pointvariables, but industry-standard synthesis tools do not support floating-pointtechnology.

This article is from a class at DesignCon 2011. Click here for more information about the conference.

Inembedded computing, fixed-point or integer-based representation is often useddue to the simpler circuitry and lower power needed to implement fixed-pointprocessing compared to floating-point processing. Many embedded computing orprocessing operations must be implemented in hardware—either in an ASIC or anFPGA.

However,due to technology limitations, hardware-based processing is virtually alwaysdone as fixed-point processing. While many applications could benefit fromfloating-point processing, this technology limitation forces a fixed-pointimplementation. If feasible, applications in wireless communications, radar,medical imaging, and motor control all could benefit from the high dynamicrange afforded by floating-point processing. 

Beforediscussing a new approach that enables floating-point implementation inhardware with performance similar to that of fixed-point processing, it isfirst necessary to discuss the reason why floating-point processing has notbeen very practical up to this point. This paper focuses on FPGAs as thehardware-processing devices, although most of the methods discussed can beapplied to any hardware architecture.

After adiscussion of the challenges of implementing floating-point processing, a newapproach used to overcome these issues will be presented. Next, some of the keyapplications for using floating-point processing, involving linear algebra, arediscussed, as well as the additional features needed to support these type ofdesigns in hardware. Performance benchmarks of FPGA floating-point processingexamples are also provided. 

Floating-PointIssues in FPGAs

Floating-pointnumerical format and operations are defined by the IEEE 754 standard, but thestandard's numerical representation of floating-point numbers is not hardwarefriendly. To begin with, the mantissa representation includes an implicit 1.Each mantissa digital representation of range [0 : 0.999..], actually maps to avalue in the range of [1 : 1.999..]. Another issue is that the sign bit istreated separately, rather than using traditional twos complement signedrepresentation.

Inaddition, to preserve the dynamic range of a floating-point signal, themantissa must be normalized after every arithmetic operation. This aligns thedecimal point to the far left, and adjusts the exponent accordingly. This isnormally done using a barrel shifter, which shifts any number of bits in oneclock cycle. Additionally, for each arithmetic operation, specificfloating-point “special cases” must be checked for and flagged as they occur.

Infloating-point processors, the CPU core has special circuits to perform theseoperations. Typical CPUs operate serially, so one or a small number of computationalunits are used to implement the sequence of software operations. Since CPUcores have a small number of floating-point computational units, the siliconarea and complexity needed to implement the IEEE 754 standard is notburdensome, compared to the rest of buses, circuits, and memory needed tosupport the computational units.

Implementationin hardware, and in FPGAs in particular, is more challenging. In logic design,the standard format for signed numbers is the twos complement. FPGAsefficiently implement adders and multipliers in this representation. So thefirst step is to use the signed twos complement format to represent thefloating-point mantissa, including the sign bit. The implicit 1 in the IEEE 754format is not used.

With theIEEE 754 standard, normalization and de-normalization using barrel shifters isimplemented at each floating-point operation. For adder or subtracter circuits,the smaller number must first be de-normalized to match the exponent of thelarger. After adding and/or subtracting the two mantissas, the result must benormalized again, and the exponent adjusted. Multiplication does not requirethe de-normalization step, but does require normalization of the product.

Optimalimplementation of FP processing

With FPGAs,using a barrel-shifter structure for normalization requires high fan-inmultiplexers for each bit location, and the routing to connect each of thepossible bit inputs. This leads to very poor fitting, slow clock rates, andexcessive routing. A better solution with FPGAs is to use multipliers. For a24-bit single-precision mantissa (the signed bit is now included), the 24x24multiplier shifts the input by multiplying by 2 N . Many FPGAs today have very highnumbers of hardened multiplier circuits that operate at high clock rates.

Anothertechnique used to minimize the amount of normalization and de-normalization isto increase the size of the mantissa. This allows the decimal place to move afew positions before normalization is required, such as in a multiplierproduct. This is easily accomplished in an FPGA, as shown in Figure1 below .

For mostlinear algebra functions, such as vector sums, vector dot-products, and crossproducts, a 27-bit mantissa reduces normalization frequency by over 50%. Formore non-linear functions, such as trigonometric, division, square root, alarger mantissa is needed. A 36-bit mantissa works well in these cases. TheFPGA must support 27×27 and 36×36 multipliers. For example, one recentlyannounced FPGA offers over 2000 multipliers configured as 27×27, or over 1000multipliers configured as 36×36.


Figure 1. NewFloating-Point Approach ( To view larger image, click here )

Thesetechniques are used to build a high-performance floating-point datapath withinthe FPGA. Because IEEE 754 representation is still necessary to comply withfloating-point processing, the floating-point circuits must support thisinterface at the boundaries of each datapath, such as a fast Fourier transform(FFT), a matrix inversion, or sine function. 

Thisfloating-point approach has been found to yield more accurate results than ifIEEE 754 compliance is performed at each operator. The additional mantissa bitsprovide better numerical accuracy, while the elimination of barrel shifterspermits high clock-rate performance, as shown in Table 1 below.


Table 1. FPGAFloating-Point Precision Results

Table 1 lists the mean, the standarddeviation, and the Frobenious norm where the SD subscript refers to IEEE754-based single-precision architecture in comparison with the referencedouble-precision architecture, and the HD subscript refers to thehardware-based single-precision architecture in comparison with the referencedouble-precision architecture.


Floating-pointresults cannot be verified by comparing bit for bit, as is typical infixed-point arithmetic. The reason is that floating-point operations are notassociative, which can be proved easily by writing a program in C or MATLAB tosum up a selection of floating-point numbers.

Summingthe same set of numbers in the opposite order will result in a few differentLSBs. To verify the floating-point designs, the designer must replace thebit-by-bit matching of results typically used in fixed-point data processingwith a tolerance-based method that compares the hardware results to thesimulation model results.

Theresults of an R matrix calculation in a QR decomposition are shown Figure 2 below , using athree-dimensional plot to show the difference between the MATLAB-computedresults and the hardware-computed results using an FPGA-based floating-pointtoolflow. Notice the errors are in the 10 -6 range, which affects only smallest LSBs in the single-precision mantissa.


Figure 2. R MatrixError Plot ( To view larger image, click here)

To verifythe accuracy of the non-IEEE 754 approach, matrix inversion was performed usingsingle-precision floating-point processing. The matrix-inversion function wasimplemented using the FPGA and tested across different-size input matrices.These results were also computed using single-precision with an IEEE 754-based Pentiumprocessor. Then a reference result was computed on the processor, using IEEE754 double-precision floating-point processing, which provides near-perfectresults relative to single-precision.

Comparingboth the IEEE 754 single-precision results and the single-precision hardwareresults, and computing norm and the differences, shows that the hardwareimplementation gives a more accurate result than the IEEE 754 approach, due tothe extra mantissa precision used in the intermediate calculations.

FPGA FPdesign-flow methodology

Hardware,including FPGAs, is typically designed using an HDL, either Verilog or VHDL.These languages are fairly low level, requiring the designer to specify alldata widths at each stage and specify each register level, and does not supportthe synthesis of floating-point arithmetic operations.

Toimplement the approach above using HDL would be very arduous and has greatlydiscouraged the use of floating-point processing in hardware-based designs.Therefore, a high-level toolflow is needed to implement these floating-pointtechniques. 

The designenvironment chosen in this case is Simulink, a widely used product from TheMathworks. Simulink is model based, which allows the designer to easilydescribe the data flow and parallelism in the design, traditionally a challengewhen using software language.

Comparedto HDL, Simulink provides a high level of design description, allowing thedesigner to describe the algorithm flow behaviorally, without needing to insertpipeline registers or know the details of the FPGA hardware, and to easilyswitch between fixed-point processing and single- and double-precisionfloating-point variables. This level of abstraction provides the opportunityfor an automated tool to optimize the RTL generation, including thefloating-point synthesis.

Anadditional advantage of this choice is that the system can be simulated in theSimulink and MATLAB domains, and the same testbench used in system-levelsimulation is later used to verify the FPGA-based implementation. The automatedback-end synthesis tool running under Simulink is called DSP Builder. Itperforms all the required floating-point optimizations to produce an efficientRTL representation. A simple circuit representation is shown in Figure 3 below.


Figure 3. SimpleDSP Builder Floating-Point Processing Example ( To view larger image , click here )

FPGAs,with their hardware architecture, distributed multipliers, and memory blocks,are ideal for high-bandwidth parallel processing. The distributed nature of theprogrammable logic, hardened blocks, and I/Os minimize the occurrence ofbottlenecks in the processing flow. The embedded memory blocks store theintermediate data results, and the extensive I/O options of FPGAs provide easyinterconnects to the larger system, whether they are processors, dataconvertors, or other FPGAs.


Manydesigns requiring the dynamic range of floating-point processing are based onlinear algebra. Using linear algebra to solve problems is typical formulti-input and multidimensional systems, such as radar, medical imaging andwireless systems. For this reason, it is important to support vector processingof complex (quadrature) data.

Vectorprocessing is ideally suited for parallel processing. Processing seriallydramatically reduces throughput and increase latency. Due to the inherentparallelism, hardware implementation is well suited to vector processing.However, vector processing must be representable and synthesizable in thedesign entry process.

Figure 4 below shows a simple dot products example where vectorsare denoted by single lines, and the vector length or number of elements isdisplayed, all of which are implemented in a complex, single-precisionfloating-point numerical representation. The vector length is a parameter setin a top-level constant file, and the example uses a special block,SumOfElements, that acts as an accumulator to add together all the partialproducts.

Figure 4. DSPBuilder Floating-Point Dot Product Example

ManagingData Flow in Hardware

Simpleflow diagrams can implement some algorithms, but other algorithms require morecomplex control. For example, matrix multiplication, which is a series ofvector dot products, requires indexing of the rows and columns. Softwarelanguages have structures to implement looping, and the same is needed inhardware flow. For this reason, a for loop block has been added to the Simulinklibrary, as shown in Figure 5 below .


Figure 5. DSPBuilder For-Loop Block ( To view larger image , click here)

In addition, multiple for loops nest, just asin a software environment, to build indexing counters and control signals, asshown in Figure 6 below . This capability is critical in manyapplications, including the indexing often needed for linear algebraprocessing.


Figure 6. DSPBuilder Nested For-Loop Blocks ( To view larger image , click here)

Solvingsystems of multiple equations or inverting matrices often requires backsubstitution, where each unknown is solved iteratively, one equation at time.The code for this solution is as follows:

for (uint8countA=0; countA<16; countA++)


for (uint8countB=0; countB<=countA; countB++) {

qc1 = countA;

qc2 = countB;



Figure 7 below shows how it is necessary toindex across both vertical and horizontal elements. Using nested for-loopblocks allows complex hardware control functions to manage data flow.


Figure 7. DSPBuilder Back Substitution Diagram( To view larger image, click here)

Manyalgorithms requiring the dynamic range of floating-point processing areiterative. One such example is the recursive infinite impulse response (IIR)filter function. One of the challenges is to design the data flow in such amanner as to avoid stalls in the hardware blocks, and to parallelize thecritical path as much as possible for maximum throughput. Additionally, toprovide for fast clock rates, adequate delay registers must be placed in thefeedback path.

In theexample shown in Figure 8 below , the IIR bi-quad filter is constructed just asin a textbook diagram. However, the toolflow has the capability to createmultichannel designs. This example initially has four filter channels, thoughthe design depicts the operations of only a single channel. Next, the design isincreased to 20 channels.

Thechannel-in and channel-out blocks denote the boundaries of this function, andthe user can parameterize the function to have as many channels as necessary.The tool creates the scheduling logic to manage the N channels. In thisexample, using 20 channels requires each channel to have a delay of 20registers between each feedforward and feedback multiplier.

The toolautomatically distributes these registers throughout the circuit to functionboth as algorithmic delay registers and as pipeline registers, therebyachieving a high circuit f MAX . Also, should the input datatype be changed from real to complex, the tool automatically implements theadders and multipliers to handle the complex arithmetic.


Figure 8. IIRFilter ( To view larger image , click here)

Morecomplicated algorithms require more complicated implementations. An example ofsuch a dataflow is depicted in Figure 9 below in the recursive partof the Mandelbrot implementation, which computesz n+1 = z n2 + c, thenuses thefeedback FIFO buffer to delay feedback data until needed to stream data withoutstalling In this case, additional latency in the recursive path is needed toallow enough register levels to efficiently implement the logic. This isaccomplished through the use of FIFO buffers.


Figure 9.Floating-Point Mandelbrot Design Example ( To view larger image, click here)

LibrarySupport Requirements

In asoftware programming environment, the math.h utility is provided, whichincludes the following functions:

SIN, COS, TAN, ASIN, ACOS,ATAN, EXP, LOG, LOG10, POW(x,y), LDEXP, FLOOR, CEIL , FABS, SQRT, DIVIDE, and 1/SQRT .Given the very common use of these functions in many signal-processingalgorithms, these functions are also provided in this hardware flow (Figure10 below ).


Figure 10. Math.hFunctions

In an FPGAimplementation, parallelism is used to reduce the latency normally associatedwith such function. Computation of a trigonometric function in a processor easilytakes over 100 cycles. For FPGA implementations, the use of multiplier-basedalgorithms and use of several multipliers typically results in latency andresources usage of thr

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.