Implementing effective floating point support in wireless base station designs

Avi Gal, Dmitry Lachover, and Itay Peled, Freescale Semiconductor

July 07, 2013

Avi Gal, Dmitry Lachover, and Itay Peled, Freescale SemiconductorJuly 07, 2013

Editor’s Note: In this Product How-To design article, the Freescale authors describe how the floating point capabilities of the StarCore SC3900FP DSP core incorporated into the company’s QorIQ Qonverge B4860 system-on-chip (SoC) can be used to lower the cost and improve the performance of LTE wireless base stations.

Until recently most real-world signal processing applications, especially in wireless radio communications, used fixed point arithmetic implementations because, among other things, they had significant speed and cost benefits due to reduced hardware complexity. Also, time to market and development costs were lower.

Floating point DSP was only used where other capabilities, including a wider dynamic range and greater precision, were required, such as in radar systems, medical imaging, and industrial robotics. Even there developers had to consider the trade-off between the reduced hardware cost of fixed point versus floating point, which required more expensive hardware but in the long term reduced the cost and complexity of software development needed to compensate for its limitations.

The real-world needs of MIMO wireless
In the age of ubiquitous mobile and sensor wireless communications, what constitutes real-world signal processing has changed dramatically. Wireless communications standards such as LTE (Long Term Evolution) use Multi-Input, Multi-Output (MIMO) algorithms and require increased data rates, enhanced performance, and high accuracy. These algorithms rely on matrix inversion techniques, which are susceptible to precision, scaling, and quantization problems when traditional fixed point signal processing techniques are used, forcing developers to look for floating point alternatives.

Although fixed point implementations usually provide better performance in terms of cycle count, they require fairly large amounts of high precision mathematical operations, forcing the use of normalization during the calculation process. This significantly increases overhead, forcing developers to do additional simulations make sure the algorithm holds in fixed point representation.

Unlike the exponent-based nature of floating point, which dramatically increases dynamic range and exactness, fixed point signal processing requires additional operations such as scaling and common shift calculation to reach the same precision. Moreover, some fixed point-based algorithms require inter-flow realignment (associated with the same scale factor value, for example) to guarantee meaningful operations within the subsequent algorithm stages.

A new generation of floating point, DSP-enabled, system on chip (SoC) designs such as the QorIQ Qonverge B4860 (Figure 1) with StarCore SC3900FP DSP cores are now available. These new designs not only eliminate the cost and performance advantages of the fixed point solution, but resolve some of the precision, scaling, and quantization problems.

Figure1: The B4860 SoC.

What floating point brings to wireless design
In addition to addressing the specific problems of wireless MIMO design, floating point brings additional capabilities useful to the developer. In some cases, floating point implementations consume fewer cycles to execute than fixed point code. The floating point format offers a good tradeoff between implementation complexity and required dynamic range. Also, a general purpose DSP with floating point support benefits from easy programming and fast conversion from a Matlab simulation or C model to real-time code. These benefits speed up time to market

The SC3900FP mentioned earlier is a high-performance, flexible vector processor. It is used in the QorIQ Qonverge B4860 and B4420 multicore SoC products targeting wireless broadband equipment. The original SC3900 has four independent data multiplication units (DMU), each of which contains eight fixed point 16-bit multipliers. Together, the four DMUs can complete thirty-two 16-bit multiply-accumulates (MACs) per cycle - up to 38.4 GMACs at 1.2 GHz.

The floating point enhanced SC3900FP version, introduced in the production version of the B4860 and B4420, adds support for single precision, IEEE-compliant floating point arithmetic, where native floating point operations for single precision are supported. Each DMU can now also perform native floating point operations like multiply, add, subtract, fused multiply-add, compare, and convert operations. Floating point support is an integral part of the DMU (Figure 2). The main purpose of native floating point support is to provide superior precision to the fixed point mechanism and to omit the need to perform fixed point migration to the algorithm usually implemented in Matlab.

In addition, fast calculation of 1/x and 1/sqrt(x) functions is added to further accelerate basic floating point math. The performance of the SC3900 native 1/x operation is up to four times better than traditional floating point reciprocal implementations. Furthermore, fused multiply-add instruction increases accuracy, where each DMU is capable of performing up to two fused multiply-add (FMAD) operations, giving a total of 16 floating point operations per cycle - 19.2 GFLOPS when running at 1.2 GHz clock. The hardware is complemented by a diverse set of multiply instructions, including complex support. As a result, the advanced SC3900FP floating point instruction set ensures high precision at relatively low cycle-count penalty vs. fixed point.

Figure 2: SC3900FP Data Arithmetic Logic Unit (DALU)

The following code demonstrates the use of the SC3900FP floating point SIMD (Single Instruction Multiple Data) multiply instructions. The instructions “fmpy.2sp” and “fmsod.sax.2sp” can be combined to perform complex multiplication.

Figure 3: SC3900 complex multiplication. fmpy_2sp and fmsod_sax_2sp calculates both the real and imaginary portions of the product.

The SC3900FP also supports fused complex multiply-add (complex FMADD) operations (using “fmadd.2sp” instead of “fmpy.2sp”), which are equivalent to the MAC operations used in fixed point operations.

The following code demonstrates the use of the SC3900FP instructions to perform single precision complex dot-product:

   // D00 = a22 * a33
   __fmpy_2sp(a22re, a33re, a33im, &D00_r, &D00_i);
   __fmsod_sax_2sp(a22im, a33re, a33im, &D00_r, &D00_i);

   // D00 = a22 * a33 - a23 * a32
   __fmsod_ssi_2sp(a23re, a32re, a32im, &D00_r, &D00_i);
   __fmsod_asx_2sp(a23im, a32re, a32im, &D00_r, &D00_i);

Figure 4: SC3900FP complex dot-product code. The SC3900FP can perform two single precision fused complex multiply and add/sub per cycle.

Solving LTE MIMO nonlinear matrix inversion problems
To illustrate the capabilities of SC3900FP DSP core in the QorlQ Qonverge B4860 to execute effectively the fixed-matrix and other algorithms used in wireless LTE MIMO algorithms, the rest of this article will examine its use in LTE UE (User Equipment) scheduling and in calculation of 4x4 matrix inversion algorithm.

< Previous
Page 1 of 2
Next >

Loading comments...

Parts Search