Optimizing complex floating point calculations on FPGAs

High-performance floating point processing has long been associated with high performance CPUs. In the last few years, GPUs have also become powerful floating point processing platforms, moving beyond graphics, and known as GP-GPUs (General Purpose Graphics Processing Units). A new innovation is FPGA-based floating point processing in demanding applications. This paper focuses on FPGAs and their floating point performance and design flows and the use of OpenCL, a leading programming language for high performance floating point calculations.

GFLOPs ratings of various processing platforms have been increasing, and now the term TFLOP/s is commonly used. However, the peak GFLOP/s or TFLOP/s often provides little information on the performance of a given device in a particular platform. Rather, it indicates the total number of theoretical floating point additions or multiplies which can be performed per second. This analysis indicates that FPGAs can exceed 1 TFLOP/s [1] of single precision floating point processing.

A common algorithm of moderate complexity is the FFT. A 4096 point FFT has been implemented using single precision floating point. It is able to input and output four complex samples per clock cycle. Each FFT core can run at over 80 GFLOP/s, and a large FPGA has resources to implement seven such cores.

However, as Figure 1 indicates, the GFLOP/s of FFT algorithm on this FPGA is nearly 400 GFLOP/s. This is a “push button” OpenCL compile result, with no FPGA expertise required. Using logic-lock and DSE optimizations, the seven core design can approach the Fmax of the single core design, boosting GFLOP/s to over 500, with over 10 GFLOP/s per Watt.

This GFLOP/s per Watt is much higher than achievable CPU or GPU power efficiency. In terms of GPU comparisons, the GPU is not efficient at these length FFTs, so no benchmarks are presented. The GPU becomes efficient with FFT lengths of several hundred thousand points, at which point it can provide useful acceleration to a CPU.

Figure 1: An Altera Stratix V 5SGSD8 FPGA Floating Point FFT Performance

To summarize, the useful GFLOP/s is often a fraction of the peak or theoretical GFLOP/s. For this reason, a more useful approach is to compare performance on an algorithm that can reasonably represent the characteristics of typical applications. The more complex the algorithm, the more representative of a typical real application the benchmark is.

Third-Party Benchmarking
Rather than rely upon vendor’s peak GFLOP/s ratings to drive processing technology decisions, an alternative is to rely upon third-party evaluations using examples of representative complexity. An ideal algorithm for high-performance computing is the Cholesky Decomposition.

This algorithm is commonly used in linear algebra for efficient solving of multiple equations, and can be used to perform matrix inversion functionality. It has high complexity, and almost always requires floating point numerical representation for reasonable results. The computations required are proportional to N3, where N is the matrix dimension, so the processing requirements are often demanding. The actual GFLOP/s will depend both on the matrix size and the required matrix processing throughput.

Results on based upon an Nvidia GPU rated at 1.35 TFLOP/s are shown in Table 1 , using various libraries, as well as a Xilinx Virtex6 XC6VSX475T, an FPGA optimized for DSP processing, with a density of 475K LCs. This is similar density as the Altera FPGA used for Cholesky benchmarks.

The LAPACK and MAGMA are commercially supplied libraries, while the GPU GFLOP/s refers to the OpenCL implementation developed at University of Tennessee. The latter are clearly more optimized at smaller matrix sizes.

Table 1: GPU and Xilinx FPGA Cholesky Benchmarks from Univ. of Tennessee [2]

A mid-size Altera Stratix V FPGA (460 kLE) was also benchmarked, using the Cholesky algorithm in single precision floating point. As seen in Table 2 , the Stratix V FPGA performance on the Cholesky algorithm is much higher than Xilinx results.

Table 2: Altera FPGA Cholesky and QR Benchmarks from BDTI [3]

It should be noted that the matrix sizes are not the same. The University of Tennessee results start at matrix sizes of [512×512]. The BDTI benchmarks go up to [360×360] matrix sizes. The reason for this is that GPUs are very inefficient at smaller matrix sizes, so there is little incentive to use them to accelerate a CPU in these cases. FPGAs can operate efficiently with much smaller matrices.

Secondly, the BDTI benchmarks are per Cholesky core. Each parameterizable Cholesky core allows selection of both matrix size, vector size, and channel count. The vector size roughly determines the FPGA resources. The larger [360×360] matrix size uses a larger vector size, allowing for a single core in this FPGA, at 91 GFLOP/s. The smaller [60×60] matrix use less resources, so two cores could be implemented, for a total of 2×39 = 78 GFLOP/s. The smallest [30×30] matrix size would permit three cores, for a total of 3×26 = 78 GFLOP/s.

FPGAs seem to be much better suited for problems with smaller data sizes. One explanation is that since computational loads increase as N3, which data I/O increases as N2, eventually the I/O bottlenecks of GPU become less important as the dataset increases. The other consideration is throughput. As the matrix sizes increase, throughput in matrices per second drops dramatically due to the increase processing per matrix. At some point, the throughput becomes so low as to be unusable in many applications. In many cases, large matrices may be tiled, and the smaller individual sub-matrices processed, to address the throughput limitations due to sheer processing loads.

For FFTs, the computation load increases an N log2 N, whereas the data I/O increases as N. Again, at very large data sizes, the GPU becomes an efficient computational engine. By contrast, the FPGA is an efficient computation engine at much smaller data sizes, and better suited in many applications where FFTs sizes are in the thousands, and GPUs where FFT sizes are in the hundreds of thousands.GPU and FPGA Design Methodology
GPUs are programmed usingeither Nvidia’s proprietary CUDA language, or an open standard OpenCLlanguage. These languages are very similar in capability, with thebiggest difference being that CUDA can only be used on Nvidia GPUs.

FPGAsare typically programmed using HDL languages Verilog or VHDL. Neitherof these languages is well suited to supporting floating point designs,although the latest versions do incorporate definition, though notnecessarily synthesis, of floating point numbers. For example, in SystemVerilog, a short real variable is analogue to IEEE single (float), andreal to an IEEE double.

Synthesis of floating point datapathsinto an FPGA is very inefficient using traditional methods. The lowperformance of the Xilinx FPGAs on the Cholesky algorithm, implementedusing Xilinx Floating Point Core Gen functions, substantiates this.However, Altera offers two alternatives. The first is using a Mathworksbased design entry, known as DSPBuilder Advanced Blockset. This toolcontains support for both fixed and floating point numbers. It supportsseven different precisions of floating point, including IEEE half,single and double precisions. It also supports vectorization, which isneeded for efficient implementation of linear algebra. However, the keyis its ability to map floating point circuits efficiently onto today’sfixed point FPGA architectures, as is demonstrated by the benchmarkssupporting close to 100 GFLOP/s on the Cholesky algorithm in a mid-size28 nm FPGA [3]. By comparison, the Cholesky implementation on a similarsized Xilinx FPGA without this synthesis capability shows only 20GFLOP/s of performance on the same algorithm, using a similar densityFPGA [2].

OpenCL for FPGAs
OpenCL is familiar to GPUprogrammers. An OpenCL Compiler [5] for FPGAs means that OpenCL codewritten for AMD or Nvidia GPUs can be compiled onto an FPGA. An OpenCLCompiler from Altera enables GPU programs to use FPGAs, without thenecessity of developing the typical FPGA design skill set.

UsingOpenCL with FPGAs offers several key advantages over GPUs. First, GPUstend to be I/O limited. All input and output data needs to be passed bythe host CPU via the PCIe interface. The resulting delays can stall theGPU processing engines, resulting in lower performance.

FPGAs arewell known for their wide variety of high bandwidth I/O capabilities.This can allow data to stream in and out of the FPGA over GigabitEthernet, SRIO, or directly from ADCs and DACs. Altera has defined avendor-specific extension of the OpenCL standard to support streamingoperation.

FPGAs can also offer much lower processing latencythan a GPU, even independent of I/O bottlenecks. It is well known thatGPUs must operate on many thousands of threads to perform efficiently.This is due to the extremely long latencies to and from memory and evenbetween the many processing cores of the GPU. In effect, the GPU mustoperate on many, many tasks so that it can keep the processing coresfrom stalling as they await data, which results in very long latency forany given task.

The FPGA uses a “coarse grained parallelism”architecture instead. It creates multiple parallel, optimized datapaths,and each datapath usually outputs one result per clock clock. Thenumber of instances of the datapath depends upon the FPGA resources, butis typically much less than the number of GPU cores. However, eachdatapath instance is much higher throughput than a GPU core. A primarybenefit of this approach is low latency. Minimizing latency is acritical performance advantage in many applications.

Anotheradvantage of FPGAs is the much lower power consumption, resulting indramatically lower GFLOPs per Watt. As measured by BDTI, the GFLOP/s perWatt for complex floating point algorithms such as Cholesky is 5-6GFLOP/s per Watt [4]. GPU energy efficiency measurements are much hardto find, but using the GPU performance of 50 GFLOP/s for Cholesky andtypical power consumption of 200W, results in 0.25 GFLOP/s per Watt,which is twenty times more power consumed per useful FLOP/s.

Both OpenCL and DSPBuilder rely on a technique known as “Fused Datapath” (Figure 2 ),where floating point processing is implemented in such a fashion as todramatically reduce the number of barrel shifting circuits required,which in turn allows for large scale and high performance floating pointdesigns to be built using FPGAs.

Figure 2: Fused Datapath Implementation of Floating Point

Toreduce the frequency of implementing barrel shifting, the synthesisprocess looks for opportunities where using larger mantissa widths canoffset the need for frequent normalization and de-normalization. Theavailability of 27×27 and 36×36 hard multipliers allows forsignificantly larger multipliers than the 23 bits required by singleprecision, and also the construction of 54×54 and 72×72 multipliersallows for larger than the 52 bits required for double precision. TheFPGA logic is already optimized for implementation of large, fixed pointadder circuits, with inclusion of built-in carry look- ahead circuits.

Wherenormalization and de-normalization is required, an alternativeimplementation which avoids low performance and excessive routing is touse multipliers. For a 24-bit single-precision mantissa (including thesign bit), the 24×24 multiplier shifts the input by multiplying by 2n.Again, the availability of hardened multipliers in 27×27 and 36x36allows for extended mantissa sizes in single precision, and can be usedto construct the multiplier sizes for double precision.

A vector dot product (Figure 3 )is the underlying operation consuming the bulk of the FLOP/s used inmany linear algebra algorithms. A single precision implementation oflength 64 long vector dot product would require 64 floating pointmultipliers, followed by an adder tree made up of 63 floating pointadders. This requires implementation of many barrel shifting circuits.

Figure 3: Vector dot product optimization

Insteadthe outputs of the 64 multipliers can be de-normalized to a commonexponent, being the largest of the 64 exponents. Then these 64 outputscould be summed using a fixed point adder circuit, and a finalnormalization performed at the end of the adder tree. This localizedblock floating point dispenses with all the interim normalization andde- normalization required at each individual adder, and is shown in Figure 3 .Even with IEEE754 floating point, the number with the largest exponentis going to basically determine the exponent at the end, so this justmoved this exponent adjustment to an earlier point in the calculation.

However,when performing signal processing, the best results are found bycarrying as much precision as possible as performing truncation ofresults at the end of the calculation. The approach here compensates forthis sub-optimal early de-normalization by carrying extra mantissa bitwidths over and above that required by single precision floating point,usually from 27 to 36 bits. The mantissa extension is done with floatingpoint multipliers, to eliminate the need to normalize the product ateach step.

Note that this approach can also produce one resultper clock cycle. GPU architectures can produce all the floating pointmultiplies in parallel, but cannot efficiently do the additions inparallel. This is due to requirements that different cores pass datathrough local memories to communicate to each other, lacking theflexibility of connectivity of an FPGA architecture.

Thisapproach generates results which are more accurate that conventionalIEEE754 floating point, as is shown by the measurements in Table 3 . Similar results were obtained with BDTI’s benchmarking effort [3].

Table 3: FPGA accuracy results compared to IEEE754 floating point

The results of Table 3 were obtained by implementing large matrix inversions using theCholesky Decomposition algorithm. The same algorithm was implemented inthree different ways – in Matlab/Simulink with IEEE754 single precisionfloating point, in RTL single precision floating point using FusedDatapath, and also in Matlab with double recision floating point. Doubleprecision is about one billion times (109) more precise than singleprecision.

Table 3 compares Matlab single precision; RTLsingle precision and Matlab double precision errors, confirming theintegrity of the Fused Datapath approach. This is shown for both thenormalized error across all the complex elements in the output matrix,and matrix element with the maximum error. The overall error or norm iscalculated by using Frobenius norm given by:

Please notice that since the norm includes errors in all elements, it is often much larger than the individual errors.

Inaddition, the both DSPBuilder Advanced Blockset and OpenCL tool flowswill transparently support and optimize current designs fornext-generation FPGA architectures. Up to 100 peak GFLOP/s per Watt canbe expected, due to both architecture innovations and process technologyinnovations.

High-performance computingapplications now have a new processing platform to consider. Forspecific classes of floating point algorithms, FPGAs can provide lowerlatency and higher GFLOP/s. For nearly all applications, FPGAs canprovide superior GFLOP/s per Watt. These advantages will become evenmore dramatic with the introduction of next generation, high-performancecomputing optimized FPGAs.

Altera’s OpenCL Compiler provides anear seamless path for GPU programmers to evaluate the merits of thisnew processing architecture. Altera OpenCL is 1.2 compliant, with a fullset of math library support. It abstracts away the traditional FPGAchallenges of timing closure, DDR memory management and PCIe hostprocessor interfacing.

For non-GPU developers, Altera offersDSPBuilder Advanced Blockset tool flow, which allows developers to buildhigh Fmax fixed or floating point DSP designs, while retaining theadvantages of a Mathworks based simulation and development environment.This product has been used for years by FPGA developers seeking a highlyproductive workflow, which offers the same Fmax performance as askilled FPGA developer.

[1] Achieving One TeraFLOPS with 28-nm FPGAs , by Michael Parker, Altera Corp
[2] Performance Comparison of Cholesky Decomposition on GPUs and FPGAs ,by Depeng Yang, Junqing Sun, JunKu Lee, Getao Liang, David D. Jenkins,Gregory D. Peterson, and Husheng Li, Department of ElectricalEngineering and Computer Science, University of Tennessee.
[3] An Independent Analysis of Floating-point DSP Design Flow and Performance on Altera 28-nm FPGAs by the staff of Berkeley Design Technology, Inc.
[4] An Independent Evaluation of Floating-point DSP Energy Efficiency on Altera 28 nm FPGAs  by the staff of Berkeley Design Technology, Inc.
[5] Implementing FPGA Design with the OpenCL Standard by Deshanand Singh, Altera Corp

Michael Parker is principal architect for DSP product planning at Altera Corp.,including the Variable Precision FPGA silicon architecture for DSPapplications, DSP tool development, floating point tools and DSP andVideo intellectual property (IP). He joined Altera in January 2007 andhas more than 20 years of DSP wireless engineering design experiencewith companies such as Alvarion, Soma Networks, TCSI, Stanford Telecomand numerous startup companies. He holds an MSEE from Santa ClaraUniversity, and BSEE from Rensselaer Polytechnic Institute. This articleis from a class he taught at the Embedded Systems Conference on “UsingOpenCL to maximize complex floating point processing in FPGAs (ESC-430)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.