Editor’s Note: In this Product How-To design article, the Freescale authors describe how the floating point capabilities of the StarCore SC3900FP DSP core incorporated into the company’s QorIQ Qonverge B4860 system-on-chip (SoC) can be used to lower the cost and improve the performance of LTE wireless base stations.
Until recently most real-world signal processing applications, especially in wireless radio communications, used fixed point arithmetic implementations because, among other things, they had significant speed and cost benefits due to reduced hardware complexity. Also, time to market and development costs were lower.
Floating point DSP was only used where other capabilities, including a wider dynamic range and greater precision, were required, such as in radar systems, medical imaging, and industrial robotics. Even there developers had to consider the trade-off between the reduced hardware cost of fixed point versus floating point, which required more expensive hardware but in the long term reduced the cost and complexity of software development needed to compensate for its limitations.
The real-world needs of MIMO wireless
In the age of ubiquitous mobile and sensor wireless communications, what constitutes real-world signal processing has changed dramatically. Wireless communications standards such as LTE (Long Term Evolution) use Multi-Input, Multi-Output (MIMO) algorithms and require increased data rates, enhanced performance, and high accuracy. These algorithms rely on matrix inversion techniques, which are susceptible to precision, scaling, and quantization problems when traditional fixed point signal processing techniques are used, forcing developers to look for floating point alternatives.
Although fixed point implementations usually provide better performance in terms of cycle count, they require fairly large amounts of high precision mathematical operations, forcing the use of normalization during the calculation process. This significantly increases overhead, forcing developers to do additional simulations make sure the algorithm holds in fixed point representation.
Unlike the exponent-based nature of floating point, which dramatically increases dynamic range and exactness, fixed point signal processing requires additional operations such as scaling and common shift calculation to reach the same precision. Moreover, some fixed point-based algorithms require inter-flow realignment (associated with the same scale factor value, for example) to guarantee meaningful operations within the subsequent algorithm stages.
A new generation of floating point, DSP-enabled, system on chip (SoC) designs such as the QorIQ Qonverge B4860 (Figure 1 ) with StarCore SC3900FP DSP cores are now available. These new designs not only eliminate the cost and performance advantages of the fixed point solution, but resolve some of the precision, scaling, and quantization problems.
What floating point brings to wireless design
In addition to addressing the specific problems of wireless MIMO design, floating point brings additional capabilities useful to the developer. In some cases, floating point implementations consume fewer cycles to execute than fixed point code. The floating point format offers a good tradeoff between implementation complexity and required dynamic range. Also, a general purpose DSP with floating point support benefits from easy programming and fast conversion from a Matlab simulation or C model to real-time code. These benefits speed up time to market
The SC3900FP mentioned earlier is a high-performance, flexible vector processor. It is used in the QorIQ Qonverge B4860 and B4420 multicore SoC products targeting wireless broadband equipment. The original SC3900 has four independent data multiplication units (DMU), each of which contains eight fixed point 16-bit multipliers. Together, the four DMUs can complete thirty-two 16-bit multiply-accumulates (MACs) per cycle – up to 38.4 GMACs at 1.2 GHz.
The floating point enhanced SC3900FP version, introduced in the production version of the B4860 and B4420, adds support for single precision, IEEE-compliant floating point arithmetic, where native floating point operations for single precision are supported. Each DMU can now also perform native floating point operations like multiply, add, subtract, fused multiply-add, compare, and convert operations. Floating point support is an integral part of the DMU (Figure 2 ). The main purpose of native floating point support is to provide superior precision to the fixed point mechanism and to omit the need to perform fixed point migration to the algorithm usually implemented in Matlab.
In addition, fast calculation of 1/x and 1/sqrt(x) functions is added to further accelerate basic floating point math. The performance of the SC3900 native 1/x operation is up to four times better than traditional floating point reciprocal implementations. Furthermore, fused multiply-add instruction increases accuracy, where each DMU is capable of performing up to two fused multiply-add (FMAD) operations, giving a total of 16 floating point operations per cycle – 19.2 GFLOPS when running at 1.2 GHz clock. The hardware is complemented by a diverse set of multiply instructions, including complex support. As a result, the advanced SC3900FP floating point instruction set ensures high precision at relatively low cycle-count penalty vs. fixed point.
The following code demonstrates the use of the SC3900FP floating point SIMD (Single Instruction Multiple Data) multiply instructions. The instructions “fmpy.2sp” and “fmsod.sax.2sp” can be combined to perform complex multiplication.
Figure 3: SC3900 complex multiplication. fmpy_2sp and fmsod_sax_2sp calculates both the real and imaginary portions of the product.
The SC3900FP also supports fused complex multiply-add (complex FMADD) operations (using “fmadd.2sp” instead of “fmpy.2sp”), which are equivalent to the MAC operations used in fixed point operations.
The following code demonstrates the use of the SC3900FP instructions to perform single precision complex dot-product:
// D00 = a22 * a33
__fmpy_2sp(a22re, a33re, a33im, &D00_r, &D00_i);
__fmsod_sax_2sp(a22im, a33re, a33im, &D00_r, &D00_i);
// D00 = a22 * a33 – a23 * a32
__fmsod_ssi_2sp(a23re, a32re, a32im, &D00_r, &D00_i);
__fmsod_asx_2sp(a23im, a32re, a32im, &D00_r, &D00_i);
Figure 4: SC3900FP complex dot-product code. The SC3900FP can perform two single precision fused complex multiply and add/sub per cycle.
Solving LTE MIMO nonlinear matrix inversion problems
To illustrate the capabilities of SC3900FP DSP core in the QorlQ Qonverge B4860 to execute effectively the fixed-matrix and other algorithms used in wireless LTE MIMO algorithms, the rest of this article will examine its use in LTE UE (User Equipment) scheduling and in calculation of 4×4 matrix inversion algorithm.Utility metric calculation
The goal of the UE scheduler is tooptimize the utility function, which translates to maximizing thethroughput of a system consisting of multiple base stations. Inaddressing advanced scheduling algorithms that can be executed in areal-time environment, this article focuses on performance as measuredin clock cycles/CPU utilization, the use of SC3900FP SIMD processing tooptimize such performance, and algorithmic performance in the wirelessdomain.
We evaluate SIMD implementation of the following generalized function:
is calculated as well in the implementation demonstrated below.
SC3900FP supports logarithm and exponential estimations, as well as reciprocal estimation for fast division implementation.
Inaddition to the advantage of diverse SC3900FP support for highlyoptimized, math-intensive functions, the challenge is to exploit theparallelism offered by the SIMD capabilities. Two options exist: thefirst is to perform the same operation on multiple users (that is,metrics) in parallel; the second is to parallelize the processing withineach user.
We chose the latter as it is less intrusive in theexisting codebase. The vectorized implementation of the utility functionis as follows:
__ld_4fl(&PFmetrics[idx].num, &num0, &num1, &num2, &num3);
__ld_4fl(&PFmetrics[idx].den, &den0, &den1, &den2, &den3);
__ld_4fl(&PFmetrics[idx].pow, &mul0, &mul1, &mul2, &mul3);
res0 = FP_LOG2(divide(num0, den0));
res1 = FP_LOG2(divide(num1, den1));
res2 = FP_LOG2(divide(num2, den2));
res3 = FP_LOG2(divide(num3, den3));
__fmpy_pp_2sp(mul0, mul1, res0, res1, &res0, &res1);
__fmpy_pp_2sp(mul2, mul3, res2, res3, &res2, &res3);
res0 = res0 + res1 + res2 + res3;
res0 = exp2(res0);
Or, in graphical form, as follows:
Theperformance advantage, thanks to the parallel operation and nativesupport of log/reciprocal instructions, is obvious. Note that the nativeinstructions provide estimated values with 15-bit Mantissa precision. Asingle Newton-Raphson iteration is required for achieving fullprecision. The SC3900 floating point cycle count performance for utilitymetric calculation is up to three times better than the performance onalternative floating point DSP architectures.
General linear algebra and matrix precision
Complexmatrix inversion stability is affected by precision; higher precisionmight be required in some cases. Floating point calculation gives betterprecision than fixed point and may be used for greater accuracy. For an8x8 matrix inversion, higher precision is even more critical.
Complex matrix inversion The inverse of a square matrix A, sometimes called a reciprocal matrix , is a matrix A-1 such that
where I is the identity matrix.
Severalmethods exist to invert a matrix, such as Gauss-Jordan elimination ,lower upper (LU) decomposition , cofactor method , and others.
TheGauss-Jordan elimination is a method to find the inverse matrix bysolving a system of linear equations. A good explanation of how thisalgorithm works can be found in Numerical Recipes in C . In thismethod, the choice of a good pivot is critical. This requires that allvalues of a specific column be tested against each other. Therefore, itis not well-suited for symmetric parallel code.
The descriptionof LU decomposition also can be found in Numerical Recipes in C .This method uses decomposition of a block matrix into a lower blocktriangular matrix  L and an upper block triangular matrix U. Thismethod is useful for matrices larger then 4×4, but it is less efficientfor our 4×4 matrix.
Cofactor Method We chose the cofactormethod because it is suitable for a symmetric parallel code and it isefficient for matrices up to 4×4. This inversion method uses thefollowing formula:
- Definition 1 If A is a square matrix, then the minor of a(i,j), denoted by M(i,j), is the determinant of the submatrix that results from removing the ith row and jth column of A.
- Definition 2 – If A is a square matrix, then the cofactor of a(i,j), denoted by C(i,j), is the number (-1)i+j *M(i,j)
The minor 3×3 determinants can be calculated as Eq.3 :
[ a b c ]
[ d e f ] = aei – ahf + dhc – dbi + gbf – gec = a(ei – hf) + d(hc – bi) + g(bf – ec)
[ g h i ]
Usingthis equation, we reduce the number of complex multiplications from 12to 9. Once the cofactor matrix C(i,j) is computed, the result is used tocalculate the determinant of A. In our case, A is 4×4 a complex matrix.
Theorem 1: If A is a matrix:
- Choose any row, say row i, then,
det(A) = a(i,1)C(i,1) + a(i,2)C(i,2) + … + a(i,n)C(i,n)
- Choose any column, say column j, then,
det(A) = a(1,j)C(1,j) + a(2,j)C(2,j) + … + a(n,j)C(n,j)
Complex matrix inversion implementation – cofactor calculation
From Eq. 3 , we can see that each minor element is calculated in three parts:
a(ei – hf) + d(hc – bi) + g(bf – ec)
Noticethat the minor calculation of M(i,1) and M(i,2) uses the same matrixelements except for the first column. Similarly, M(i,3) and M(i,4) usethe same matrix elements except for the last column. Thus, calculationof two minor elements requires 12 fused complex multiplications andadd/subs. Obviously, floating point-based calculations save scalingprocedures required in fixed point-based calculations.
The following code calculates the first part of the first two minor elements: a0(ei – hf) and a1(ei –hf).
// Column0 removal: col0 = c1, col1 = c2, col2 = c3
// Column1 removal: col0 = c0, col1 = c2, col2 = c3
// a22 = e, a23 = h, a32 = f, a33 = i
// a22 * a33
__fmpy_2sp(a22re, a33re, a33im, &D00_r, &D00_i);
__fmsod_sax_2sp(a22im, a33re, a33im, &D00_r, &D00_i);
// a12 * a33 – a13 * a32
__fmsod_ssi_2sp(a13re, a32re, a32im, &D01_r, &D01_i);
__fmsod_asx_2sp(a13im, a32re, a32im, &D01_r, &D01_i);
// a11 * (a22 * a33 – a23 * a32)
__fmpy_2sp(a11re, D00_r, D00_i, &m00r, &m00i);
__fmsod_sax_2sp(a11im, D00_r, D00_i, &m00r, &m00i);
// a00 * (a22 * a33 – a23 * a32)
__fmpy_2sp(a00re, D00_r, D00_i, &m11r, &m11i);
__fmsod_sax_2sp(a00im, D00_r, D00_i, &m11r, &m11i);
Toutilize the four DMUs, we calculate one row at a time (i.e., all minorsfor the same removed row). Each row requires four complexmultiplications and 20 complex multiply and add/subs. The data loadingis done by the address generation unit in parallel to the dataarithmetic logic.
To avoid the transpose operation, we calculatethe cofactor values column by column and write the output row by row.After we calculate the first cofactor column, we calculate thedeterminant using Theorem 1 (choosing the first column):
det(A) = a(0,0)C(0,0) + a(0,1)C(0,1) + … + a(0,3)C(0,3) Eq.4
Inv(detA) is calculated using the following equation:
1/detA = conj(detA) / (detA * conj(detA)) = conj(detA) / detA2 Eq.5
Floatingpoint matrix inversion and MIMO equalizer implemented on the SC3900provides much greater precision compared to the fixed pointimplementation, at the cost of only about 2.5x cycle count performancedegradation, which is excellent cost for the precision gain.
Inaddition, Starcore SC3900 cores may be offloaded by the MAPLE-B3baseband multi-accelerator platform embedded by the B4860. MAPLE-B3platform contains a number of processing engines, where one is the EQPE(equalizer processing engine) hardware accelerator, designed to performMMSE (Minimum Mean Square Error) MIMO equalization for OFDMA/SC-FDMAreceivers and general-purpose matrix inversion. Among others, theseoperations are implemented using internal floating point engines.
TheSC3900FP’s high-performance, native single-precision floating pointsupport is useful in several ways: shorter time to market as MATLAB codecan be used directly; higher precision for algorithms that requireextended dynamic range; and relatively low cycle count for the gainedprecision in several algorithms, as no scaling and no quantization arerequired. In addition, the QorIQ Qonverge B4860 MAPLE-B3 contains EQPEthat is designed to perform MMSE MIMO equalization processing, and thuscan offload the SC3900 cores from processing these algorithms.
BeyondLTE wireless base stations, the high performance SC3900FP integratedinto the B4860 is a powerful device for other DSP-intensive applicationsin communications, medical, industrial, defense, and others. It willenable a new generation of DSP processors that eliminate the dilemma ofmaking a choice between a fixed point and a floating point architectureand gives significant freedom and capabilities to real-time programmersof embedded systems software implementation.
Avi Gal is a DSP applications expert in the Wireless infrastructure Design inFreescale Israel. He has a BSc. in Mathematics and Computer Science fromthe Hebrew University and a MSc. in Electrical Engineering from theTel-Aviv University.
Dmitry Lachover is DSP applicationsteam leader and communications expert, part of the WirelessInfrastructure Design in Freescale Semiconductor. He has a BSc. inElectrical Engineering from the Technion – Israel Institute ofTechnology, an MSc. in Electrical Engineering – Communications from theTechnion and an MBA from the Tel-Aviv University.
Itay Peled is the StarCore Architecture Manager, part of the WirelessInfrastructure Design in Freescale Semiconductor. He has a BSc. inElectrical Engineering from the Ben-Gurion University and an MBA fromthe Tel-Aviv University.