Making a wireless MIMO equalizer more efficientEditor’s Note: This article describes the use of the Freescale Qonverge B4860 system-on-chip (SoC) with integrated StarCore SC3900 DSP cores to implement a 4x4 LTE MIMO (multiple-input and multiple-output) Equalizer. Performance is further enhanced through the use of an on-chip high-speed Equalizer Processing Element accelerator.
To meet growing demand for advanced 4G services, manufacturers of wireless infrastructure equipment increasingly require components that offer exceptional performance and flexibility. 4G services require multi-RAT (Radio Access Technology) devices to support Baseband Physical Layer and higher layer requirements for FDD-LTE, TDD-LTE and LTE-Ad base stations. To enable these technologies, the processor used has to provide low latency and high throughput communications at an affordable price. In addition, a balance of high-performance, low-power processing with sufficient programmability is needed.
To explain the requirements for 4G services, as an example we examine a 4x4 LTE MIMO Equalizer implementation of the Qonverge B4860 system-on-chip (SoC) with integrated StarCore SC3900 DSP cores. The Equalizer’s offloading to the processor’s high-speed Equalizer Processing Element (MAPLE-B3 EQPE) accelerator will be described in detail.
The LTE Equalizer is one of the key functions in a LTE receiver and is based on a complex algebra-like matrix of multiplication, decomposition and inversion. Offloading of compute intensive functions such as the Equalizer is of great importance to the performance and efficiency of the SoC used in a LTE base station.
This article describes in detail the LTE Equalizer performance and implementation on DSP cores embedded on a SoC. In addition, an alternative is presented for offloading the LTE Equalizer to the dedicated MIMO accelerator (equalizer processing element or EQPE) that can handle complex antenna MIMO configurations of two, four, or eight in-parallel. The EQPE capabilities are presented in terms of low latency, high throughput, and optimized control; combined, these efficiently meet increasing data rate requirements. The solution delivers significant reduction in cost per megabit for the growing data traffic in cellular LTE networks.
The Qonverge B4860 (Figure 1) is Freescale’s multicore SoC for macro base stations. It integrates six SC3900 StarCore flexible vector processor (FVP) cores and a MAPLE-B baseband multi-accelerator platform.
Click on image to enlarge.
Figure 1: The B4860 SoC
The Qonverge B4860 is designed to deliver flexibility, integration and affordability while meeting demand from wireless base station original equipment manufacturers (OEMs) for ultra-high computational performance in baseband applications. The B4860 offers a total of 230.4 GMACs (Giga Multiply Accumulates) per second of DSP core performance. The MAPLE-B3 (third generation Multi Accelerator Platform for Baseband) combines a set of efficient high-speed processing elements (PEs) - the LTE/LTE-A MIMO EQPE is one of them.
StarCore SC3900 architecture
The SC3900 (Figure 2) was augmented to accelerate baseband PHY processing from end to end, from computation intensive kernels to control code. To excel in computation intensive DSP processing, the SC3900 introduces an optimized data path that includes very high memory bandwidth, matching register file and execution unit capacity and supporting fixed point operations along with a variety of application-specific instructions.
Each SC3900 DSP core has four data multiplication units (DMUs) each with eight 16x16 MAC per DMU and performs at 38.4 milliard multiply accumulates per second (GMACS) at 1.2 GHz.
Moreover, the SC3900 data path is designed to be flexible, enabling each execution unit to execute different instructions and to access any register without penalty. This flexible data path is the essence of the FVP and enables customers to maintain high efficiency in less parallel DSP code than with traditional vector processors. Complex, matrixed algebraic calculations, which are the heart of the equalizer processing, are performed efficiently utilizing its four execution units.
Click on image to enlarge.
Figure 2: SC3900 Cores Cluster
The MAPLE-B3 (Figure 3) embedded in B4860 enables Freescale Layer 1 acceleration for large-scale base transceiver station (BTS) SoCs. The accelerators target high bandwidth, highly computational applications and optimize the SoC solution by offloading the most computational intensive Layer 1 functionality from the DSP cores. The MAPLE-B3 also optimizes system traffic, memory consumption and controls overhead utilizing its advanced and flexible RISC-based control processors and embedded fabric.
The MAPLE-B3 includes a Programmable System Interface (PSIF) and a set of PEs, including the the EQPE. The PSIF handles all the system integration aspects including buffer descriptor-based jobs, queuing, arbitration, and low level PE control. All control and I/O transactions for execution of tasks are performed by the PSIF using a programmable system direct memory access (DMA) unit and an internal DMA unit to transfer data directly between PEs. These tasks are implemented in a flexible architecture using quad-RISC processors, which allows Freescale to develop SoCs that adopt standard technologies and minor changes can be implemented using firmware updates.
Click on image to enlarge.
Figure 3: MAPLE-B3 Block Diagram
The calculation implemented by the EQPE for each sub-carrier is the Minimum Mean Square Error (MMSE) Equalization method shown in Equation 1.
x [Ntx1] is the MMSE estimation of the transmitted signal
H [NrxNt] is the Channel Estimation matrix
Cx [NtxNt] is the layer covariance matrix
Cn [NrxNr] is the noise covariance matrix
y [Nrx1] is the received signal
(.)H denotes Hermitian transpose
Nt is number of layers, Nr is number of receive antennasThe inputs to the EQPE are samples of H, y, and . The output is the X samples, which are the MMSE estimation of the transmitted signals (The EQPE also supports inserting samples instead of samples.)
MMSE 4x4 MIMO Equalizer implementation on SC3900
We assume that the input matrix elements are represented in fixed point Q15.
The implementation of matrix multiplication parts of Equation 1 is based on utilizing the 4 eight-multiplier DMUs. The hardware is complemented by a diverse set of multiply instructions including complex 16x16 and 32x16 multiply instructions. Complex 16x16 multiplication is performed using the mpycx.2x instruction that computes the real and imaginary portion of the product. All inputs and outputs come from 40-bit registers The source operands are assumed to contain a packed complex number, where the high portion holds the real part (signed, fractional 16 bits), and the low portion holds the imaginary part (signed, fractional 16 bits). The output of the operation is stored as a 40-bit value (Figure 4):
Using the 4 DMUs, eight complex 16x16 multiplications can be performed in one cycle. Figure 5 illustrates the complex multiply instructions.
Click on image to enlarge.
Figure 5: SC3900 complex multiplication. MPYCX_2X calculates both the real and imaginary portions of the product.
The following code demonstrates the use of the new SC3900 complex dot-product instruction which did not exist in previous generation cores. The operation of two MPYCX.2X instructions can be combined using an MPYCXD.PP.S.2X instruction.
The SC3900 also supports complex multiply-accumulate (MAC) operations. Note that in order to perform the same operation on the SC3850, six separate instructions were required.
Example 1: SC3900 complex dot product code. SC3900 performs two complex 16x16 multiplications and a complex subtraction in a single instruction.
The 4x4 matrix inversion is part of 4x4 equalization algorithm and is implemented using a cofactor method, using the following formula (as defined here http://tutorial.math.lamar.edu/Classes/LinAlg/MethodOfCofactors.aspx):
A-1 = (1/det(A)) * Transpose (Cofactor(A))
This method is based on minor determinants calculations where minor 3x3 determinants can be calculated as:
a(ei - hf) + d(hc - bi) + g(bf - ec)
The number of operations and cycles is reduced using this equation by calculating the cofactor values column by column and writing the output row by row, and by utilizing the Hermitian matrix features and the 4 DMUs. The algorithm requires 16x16 complex multiplication (described above) and 32x16 complex MACs. The determinant norm is used to determine the scale factor and to perform scaling.
Complex 32x16 multiplication (i.e., mixed precision multiplication) is somewhat more complicated. As with 16x16 multiplication, one operand is a 16-bit fractional complex number in a packed complex format. The other operand is a 32-bit fractional complex number which is placed in two registers: one register holds the 32-bit real portion and the other register holds the 32-bit imaginary portion. The result is placed in two registers with 40-bit precision. The MACCXM.R.2X instruction performs this complex multiplication rounding and accumulation.
The throughput is four complex MACs/cycle.
Example 2: SC3900 32x16 complex MAC code. In this example, b is the 32-bit complex input.
The performance is near the optimal cycle count based on the number of required complex 16x16 and 32x16 MAC operations, determinant inversion and scaling.
MAPLE-B3 EQPE (Equalizer Processing Element)
LTE- and LTE-A-based single carrier FDMA (SC-FDMA) technology standards target much higher data throughput than current 3G technology. These higher data throughputs drive requirements for high throughput equalization blocks. To enable low latency and high throughput equalization processing required for reliable communications, equalization capacity is needed to increase by factors of up to 10x compared to current 3G base station designs (Figure 6).
Figure 6: MMSE Equalization Throughput
The MAPLE PSIF completely offloads the required control and configuration of the EQPE (and all other PEs) from the DSP cores in the physical layer implementation. The EQPE is a hardware accelerator, designed as part of the MAPLE-B3 platform to perform MIMO equalization for OFDMA/SC-FDMA receivers and general-purpose matrix inversion. In addition, it provides full support for a multicore SoC, in which multiple DSP cores need to be able to use a specific hardware accelerated function.
The EQPE supports MMSE (Minimum Mean Square Error) MIMO Equalization and Matrix Inversion. These operations are implemented using internal floating point engines.
EQPE features for the LTE receiver
The flexibility of EQPE enables it to perform MMSE/ZF/IRC MIMO equalization for LTE and matrix inversion. It supports the following features:
- High-precision floating point calculations.
- Input/output samples are in block-floating; internal calculations performed using custom floating point.
- Support for diagonal and full (Hermitian) noise and interference covariance matrix (Cn) – with configurable granularity.
- On-the-fly channel estimate matrix interpolation using configurable weights.
- Support for advanced iterative receivers (Turbo-SIC):
- Layer cancellation
- Rank reduction
- Signal covariance matrix (Cx)
- Optimized processing order – allows for pipelining with iDFT operation.
- High throughputs to provide low latency:
- MMSE equalization: up to 425[MRE/sec] for 4x2/2x2, up to 210M[MRE/sec] for 8x2, and up to 100M[MRE/sec] for 8x4/4x4 equalization.
- Matrix inversion: Up to 240M[invps] for 2x2 Symmetric (Hermitian) matrices and up to 96M[invps] for 4x4 matrices.
Depending on the use case and assumption, the EQPE may replace up to two SC3900 cores at 1.2 GHz for 4x4 MIMO MMSE Equalizer. The EQPE latency is about three times better, on average, than the equalizer implemented on a SC3900 core normalized to frequencies.
Comparing core performance
As stated above, the B4860 integrates six SC3900 cores running at up to 1.2 GHz, which is an equivalent of 7.2 GHz . Assuming the above scenario using two SC3900 cores at 1.2 GHz dedicated to a MIMO equalizer algorithm, the total capacity would have decreased to 4.8 GHz for the rest of the Layer 1 processing outside the MIMO equalizer – this is 33 percent less capacity. The estimated core performance is for fixed point arithmetic. The EQPE provided floating point arithmetic in order to increase precision. Floating point is supported by the SC3900 core, but requires more Mega cycles per second (MCPS). Furthermore, using the EQPE reduces the total device power, as it requires less power than two SC3900 cores.
Freeing cores improves performance
The EQPE generally improves B4860 performance by 50 percent by freeing two additional cores (resulting in six cores instead of four cores) available for the LTE uplink and downlink processing. This is excluding the MIMO equalizer while keeping the flexibility in case the base station developer wishes to implement the MIMO equalization algorithm on cores.
In summary, the B4860 delivers a high level of performance, flexibility and integration, combining six fully programmable new and enhanced SC3900 DSP cores, each running at up to 1.2 GHz with embedded MAPLE-B3 baseband multi-accelerator platform providing an architecture highly optimized for wireless infrastructure applications.
Dmitry Lachover is DSP applications team leader and communications expert in the Wireless Infrastructure Department in Freescale Semiconductor. He has a BSc. in Electrical Engineering from the Technion - Israel Institute of Technology and a MSc. in Electrical Engineering - Communications from the Technion.
Avi Gal is a DSP applications expert in the Wireless Infrastructure Design Department in Freescale Israel. He has a BSc. in Mathematics and Computer Science from the Hebrew University and a MSc. in Electrical Engineering from the Tel-Aviv University.
Ran Zamir is a HW architect in the Wireless Infrastructure Department in Freescale Semiconductor. He has a BSc. in Electrical Engineering from the Ben-Gurion University.