Optimizing complex floating point calculations on FPGAs
High-performance floating point processing has long been associated with high performance CPUs. In the last few years, GPUs have also become powerful floating point processing platforms, moving beyond graphics, and known as GP-GPUs (General Purpose Graphics Processing Units). A new innovation is FPGA-based floating point processing in demanding applications. This paper focuses on FPGAs and their floating point performance and design flows and the use of OpenCL, a leading programming language for high performance floating point calculations.
GFLOPs ratings of various processing platforms have been increasing, and now the term TFLOP/s is commonly used. However, the peak GFLOP/s or TFLOP/s often provides little information on the performance of a given device in a particular platform. Rather, it indicates the total number of theoretical floating point additions or multiplies which can be performed per second. This analysis indicates that FPGAs can exceed 1 TFLOP/s  of single precision floating point processing.
A common algorithm of moderate complexity is the FFT. A 4096 point FFT has been implemented using single precision floating point. It is able to input and output four complex samples per clock cycle. Each FFT core can run at over 80 GFLOP/s, and a large FPGA has resources to implement seven such cores.
However, as Figure 1 indicates, the GFLOP/s of FFT algorithm on this FPGA is nearly 400 GFLOP/s. This is a “push button” OpenCL compile result, with no FPGA expertise required. Using logic-lock and DSE optimizations, the seven core design can approach the Fmax of the single core design, boosting GFLOP/s to over 500, with over 10 GFLOP/s per Watt.
This GFLOP/s per Watt is much higher than achievable CPU or GPU power efficiency. In terms of GPU comparisons, the GPU is not efficient at these length FFTs, so no benchmarks are presented. The GPU becomes efficient with FFT lengths of several hundred thousand points, at which point it can provide useful acceleration to a CPU.
To summarize, the useful GFLOP/s is often a fraction of the peak or theoretical GFLOP/s. For this reason, a more useful approach is to compare performance on an algorithm that can reasonably represent the characteristics of typical applications. The more complex the algorithm, the more representative of a typical real application the benchmark is.
Rather than rely upon vendor’s peak GFLOP/s ratings to drive processing technology decisions, an alternative is to rely upon third-party evaluations using examples of representative complexity. An ideal algorithm for high-performance computing is the Cholesky Decomposition.
This algorithm is commonly used in linear algebra for efficient solving of multiple equations, and can be used to perform matrix inversion functionality. It has high complexity, and almost always requires floating point numerical representation for reasonable results. The computations required are proportional to N3, where N is the matrix dimension, so the processing requirements are often demanding. The actual GFLOP/s will depend both on the matrix size and the required matrix processing throughput.
Results on based upon an Nvidia GPU rated at 1.35 TFLOP/s are shown in Table 1, using various libraries, as well as a Xilinx Virtex6 XC6VSX475T, an FPGA optimized for DSP processing, with a density of 475K LCs. This is similar density as the Altera FPGA used for Cholesky benchmarks.
The LAPACK and MAGMA are commercially supplied libraries, while the GPU GFLOP/s refers to the OpenCL implementation developed at University of Tennessee. The latter are clearly more optimized at smaller matrix sizes.
A mid-size Altera Stratix V FPGA (460 kLE) was also benchmarked, using the Cholesky algorithm in single precision floating point. As seen in Table 2, the Stratix V FPGA performance on the Cholesky algorithm is much higher than Xilinx results.
It should be noted that the matrix sizes are not the same. The University of Tennessee results start at matrix sizes of [512x512]. The BDTI benchmarks go up to [360x360] matrix sizes. The reason for this is that GPUs are very inefficient at smaller matrix sizes, so there is little incentive to use them to accelerate a CPU in these cases. FPGAs can operate efficiently with much smaller matrices.
Secondly, the BDTI benchmarks are per Cholesky core. Each parameterizable Cholesky core allows selection of both matrix size, vector size, and channel count. The vector size roughly determines the FPGA resources. The larger [360x360] matrix size uses a larger vector size, allowing for a single core in this FPGA, at 91 GFLOP/s. The smaller [60x60] matrix use less resources, so two cores could be implemented, for a total of 2x39 = 78 GFLOP/s. The smallest [30x30] matrix size would permit three cores, for a total of 3x26 = 78 GFLOP/s.
FPGAs seem to be much better suited for problems with smaller data sizes. One explanation is that since computational loads increase as N3, which data I/O increases as N2, eventually the I/O bottlenecks of GPU become less important as the dataset increases. The other consideration is throughput. As the matrix sizes increase, throughput in matrices per second drops dramatically due to the increase processing per matrix. At some point, the throughput becomes so low as to be unusable in many applications. In many cases, large matrices may be tiled, and the smaller individual sub-matrices processed, to address the throughput limitations due to sheer processing loads.
For FFTs, the computation load increases an N log2 N, whereas the data I/O increases as N. Again, at very large data sizes, the GPU becomes an efficient computational engine. By contrast, the FPGA is an efficient computation engine at much smaller data sizes, and better suited in many applications where FFTs sizes are in the thousands, and GPUs where FFT sizes are in the hundreds of thousands.