Editor's Note: In this series of design articles, the authors offer a close look at various design challenges and effective use of design tools and techniques for resolution of those challenges. Be sure to check out the first article in this series : Model-based FPGA design tool quietly gains adherents
Today’s 3G and 4G wireless systems require an enormous amount of signal processing. The smart phones offer ever more features and capabilities, but behind all of this connectivity is the wireless infrastructure. FPGAs play a key role in the hundreds of thousands of wireless base stations supporting our communications network. Smart phones demand much greater data bandwidth, which puts more pressure on the network, demanding greater performance from the FPGAs within.
Due to particulars of the 3G UMTS and CDMA wireless standards, the signal processing circuits operate at multiples of 61.44 MHz. The 3G base stations typically operated at 245.76 or 368.64 MHz. The newer 4G base station designs are targeting 491.52 MHz, which is extremely challenging to support across large designs in the latest FPGAs.
These designs can be built using HDL (Verilog or VHDL) by expert FPGA designers. Alternatively, there is one high level toolflow capable of delivering the required performance: DSP Builder Advanced Blockset. This article will describe in detail the design, performance and resources of radio head design built using DSP Builder on Altera’s 20nm Arria 10 FPGAs.
Base station architecture
The base station architecture is generally divided into two sections. The baseband section performs the OFDMA/CDMA modulation and demodulation, error correction and generally all the L1 functions in the standard, and is often located at a central office location. The radio section implements the functions closely associated with the antennas: digital up and down conversion, crest factor reduction, and digital pre-distortion. Often referred to as the remote radio head, it is typically located on the antenna tower or in cabinets at the base of the antenna tower.
An antenna site typically supports three sectors, each nominally covering 120 degrees. Each sector radio head can be further described by the number of transmit antennas, the number of receive antennas, and the number of RF carriers (sometime called channels) supported per antenna.
The interface to each antenna is through serial JEDS204B interfaces to the receive ADC and transmit DAC, each operating at 491 MSPS. The interface to the baseband processing is through the CPRI protocol, supporting serial interfaces to the baseband processing. A system with two receive and two transmit antennas is shown in Figure 1.
Figure 1: Wireless Radio Head Block Diagram
Functionally, the processing blocks are show below in Figure 2. In this case, each antenna supports two RF carriers. While the data rates vary for each block, the clocking rate for each block is 491 MHz. Up to 8 RX and TX antennas can be supported in a single FPGA.
Figure 2: Radio Head Signal Processing Chain
Digital upconversion begins with FIR filters. At the lower sample rate of consists 30.72 MSPS, the channel shaping filtering is performed. This will limit the spectral width. To increase the sample rate, for example to 122.88 MSPS, half-band filters are used. The DSP Builder toolflow produces optimal structures for both type of filters, taking advantage of all the special hardware features of the chosen FPGA. The hardened DSP blocks include built in fixed coefficient register banks, pre-adders for symmetric filters, systolic post adder structure, and biased rounding, which allows for minimal use of programmable logic in FIR filter chains.
Complex Mixer and NCO circuits are also needed to modulate the signal onto the selected IF carrier frequency. This again is easily supported in the toolflow. Multi-channels are supported, without the necessity of the designer to provide control or synchronization circuits, in this case providing for 16 separate baseband channels. Best of all, the tool is able to close timing with Fmax well in excess of 491 MHz in -1 speed grade of Arria 10 FPGA.
Digital down conversion, performed in the receive path, is similar. Figure 3 depicts a design and test bench using both DUC and DDC circuits to verify proper operation in the MathWorks Simulink environment. With a sample rate of 122.88 MHz (complex), this allow for an aggregate RF bandwidth of up to 80 MHz. Individual channels are normally 5 MHz, 10MHz, or 20 MHz bandwidth, depending upon the number of OFDMA carriers in each channel.
Figure 3: Digital Up and Down Conversion
FIR filters can be designed in MATLAB, and even the FIR filter command line can be entered into the DSP Builder parameterization filed. The coefficients can be fixed, or read/writable. Alternately, a coefficient file can be provided. The filter can be folded (using data rate lower than clock rate), super-sampled (using data rate higher than clock rate), and provide rate changes (decimation or interpolation) by integer or non-integer rates. Saturation and rounding options are supported, to manage the fixed point dynamic range. Floating point FIRs are also supported, if desired.
Crest Factor Reduction
Crest factor reduction (CFR) is a more complex function, used to reduce the peak to average of the wireless signal, while minimizing impact on signal quality measured using Error Vector Magnitude (EVM) of the demodulated signal. The larger the EVM, the more distorted the demodulated OFDMA constellation will be, which in turn degrades data recovery and system performance.
Figure 4: QAM Constellation after OFDMA Demodulation
CFR allow the high powered RF amplifiers to operate more efficiently, which allows for higher RF transmit power and reduced power dissipation in the radio head.
Two CFR methods have been explored using DSP Builder. The first method is known as “Pulse Allocation CFR”. Pulse Allocation CFR searches for the maximum peak in the signal which exceeds a defined threshold. When found, a sync pulse is generated with the correct phase and amplitude to cancel the peak energy above the threshold. The filter is also used to assure that spectral limits are not exceeded. Multiple iterations are used; the first iteration will cancel the largest peaks, leaving remaining nearby peaks to be cancelled by subsequent peaks. The input data is processed in blocks.
Figure 5: Pulse Allocation CFR Block Diagram
All of this can be readily implemented in DSP Builder, which allows the designer to focus on algorithm verification, and let the tool deal with optimizing the design to the FPGA architecture, and pipelining as necessary to meet the aggressive Fmax requirements.
Verification is simplified by leveraging the test bench, plotting andvisualization capabilities of the MathWork environment. Any design caneasily be debugged by adding time or frequency domain monitoring at anyprocessing stage.
Figure 6: Pulse Allocation Implementation
The second CFR implemented is the FIR filter based method. A FIRfilter is used to create a cancellation pulse for peaks that exceed thethreshold. A clipping ratio is calculated based upon the magnitudeexceeding the threshold.
The cancellation pulse is subtracted from a delayed version of theinput data. However, to achieve desired results, two iterations inseries are used, consuming more FPGA logic and DSP resources, and addingmore latency.
Figure 7: FIR based CFR
The CFR results can be summarized below. The key metric is PAR, orpeak to average ratio, which is a measurement of how large the peaks ofthe signal are compared to average, and is measured in dB. Because thisis a statistical measure, the peaks need to be defined in terms of thefrequency of occurrence. The CCDF (complementary cumulative distributionfunction) provides this basis, and in this case, the PAR is measured at0.01%, which means measuring the level of peaks which occur withaverage duty cycle of 1 part in 10,000. The highest peaks occur with lowfrequency. The EVM is measured, which ensures that the distortion addedto the signal by CFR processing is insignificant compared to normaldegradation caused by wireless transmission.
Figure 8: CFR Results Comparison
The results show about a 3 dB improvement in PAR (input signal hasabout 11 dB PAR) as compared to without CFR processing; however only theFIR based CFR method is able to close timing well in excess of 491 MHz,with the FIR based CFR showing similar EVM and lower logic usage but atthe expense of higher DSP resource usage. FIR based CFR is implementedusing only feed forward logic circuits, leading to a higher Fmax as wellas folding capability. If the CFR is running in a sample rate slowerthan the system clock rate, the designers can share a single moduleamong several antennas to save resource usage. The FIR based CFR methodis chosen for this radio head implementation.
DigitalPredistortion (DPD) is an essential part of modern radio heads whichimproves the efficiency of RF power amplifiers (PA). It works in concertwith CFR to improve transmit efficiency. CFR acts to reduce the dynamicrange of the transmit signal. DPD applies a “reverse distortion” inadvance of the PA to compensate for the distortion which will beintroduced during the amplification process. As an amplifier operatescloser to its maximum output level, it gradually ceases to operate in alinear fashion, introducing distortion into the amplified output signal.
Figure 9: DPD Basics
The amplifier distortion is both within the bandwidth of the signal,which will degrade EVM (not visible), and adjacent to the signal(visible), which will cause out of band RF emissions, in violation ofregulatory requirements.
The forward DPD path is fairly straightforward and must run at 491MSPS, matching the rate of the DAC driving the RF circuitry. It consistsof a special type of FIR filter to provide compensation for thenon-linear behavior of the PA. Lookup tables (LUTs) are used, with theaddresses being generated by the amplitude of the input signal into thePA. This provides a complex coefficient for every power level of the PA.The DPD must also compensate for the “memory effect”, where the recentlevel of the PA inputs also affects the PA response. This is done byhaving different LUTs for each delay stage of the FIR filter, and afilter length sufficient to cover the memory effect duration.
Figure 10: DPD Forward Path
The biggest DPD complexity is in the reverse or feedback path. The PAoutput is sampled, and a sophisticated adaptive algorithm is used togenerate the LUT values for each delay stage of the forward path FIRfilter. These values are individual for each amplifier, and vary overtemperature and aging of the PA. Fortunately, the DPD feedback path isnot on the main signal path, so has no specific throughput or Fmaxrequirement. Generally, the DPD adaptive algorithm is partitioned acrossFPGA logic and in software on the built in ARM A9 CPUs contained in theFPGA for fast adaption ARM A9 CPUs are connected to FPGA fabric by awide bus to enable high throughput.
The approach here is to implement the DPD forward path processingusing DSP Builder, and to provide an instrumented framework for thecustomer to add their own DPD adaptive algorithm to generate the LUTvalues at runtime. Due to dependences on the PA and RF circuitry, it isdifficult to build a turn-key DPD algorithm, and this area is often anarea of system level differentiation of wireless OEM equipmentproviders. An example of the level of instrumentation provided is shownin below. The DSP Builder toolflow provides for tight integration withthe MathWorks environment.
Figure 11: Instrumented DPD Operation
For those customers who desire a more complete DPD solution, Alterais developing a partner based solution on Arria 10 FPGAs, and can beported to customer radio head RF hardware. The DPD performance for a 1.9GHz LTE radio head is shown below. This configuration has two 10 MHzLTE channels at 44.5 dBm PA output level with a PAR of 7.9 dB isprovided below.
Figure 12: Twin 10 MHz channel LTE DPD results
DSP Builder will generate a ModelSim testbench using the same vectorsgenerated by MathWorks' testbench, and run the hardware verification.This ensures the fidelity of the DSP Builder code generation process,ensuring the behavior observed in MathWorks' simulation is faithreproduced in hardware. However, in addition, special “system in theloop” capabilities allows for monitoring and instrumenting algorithmsrunning in hardware at real-time, which providing data to be displayedin the MathWorks environment.
This can be extended to the entire radio head signal chain, with allfunctional block parameters mapped to a high level GUI in MathWorks'environment, as shown in Figure 13.
Figure 13: Instrumented, Parameterizable Radio Head Development
High Fmax optimizations
While we have shown that ahigh level design tool can achieve the needed Fmax performance for theindividual parts of a radio head, closing timing on an entire design is afurther challenge. This challenge exists for both hand-coded and highlevel tools. For this reason, designers need to be wary of Fmax claimsof small designs, because significant Fmax degradation will occur oncethe whole design is integrated. The place and route effort becomes moredifficult for larger designs, due to the greater number of signals toroute and the longer distances. As FPGAs become ever larger, this issuebecomes more problematic.
Both DSP Builder and Quartus II have provisions to help manage thisissue. DSP Builder will read the specific timing parameters for thechosen FPGA/speed grade, and auto-pipelines to the required clock rate.To allow a further degree of control, an addition parameter isavailable, known as clock margin. A positive clock margin pushes DSPBuilder to try to meet timing for higher Fmax than specified, whilenegative clock margin for a lower Fmax (when high Fmax is not required,resources can be reduced by restricting pipelining efforts). This can beselectively applied to a portion of the design not closing timing,which would indicate for DSP Builder to apply a greater or lesser degreeof pipelining during the auto-generation of the synthesizedVerilog/VHDL code. Explicit “ClockMargin” settings can be applied permodule (DUC, CFR, DPD etc..) to allow for variable pipeline effort permodule, as needed. Latency will be affected by the degree of pipeliningapplied. However, latency is another parameter which can be fixed, orset to a maximum within the top level design file or at the modulelevel. The normal procedure is to leave the latency initiallyunconstrained, and let the tool determine the required latency toachieve the requested Fmax. At that point, the latency can beconstrained or locked down, to allow the needed predictability tointegrate the DSP Builder generated design with the larger FPGA projectfile.
A number of other DSP Builder optimizations for high Fmax designs areused, such as: to select the threshold setting to steer the tooltowards use of ALM registers rather than MLABs for FIR filter inputdelay stages, or to set a threshold to force pipelining or largecounters of many bits. Designers can also added “dummy” pipeline stagesinto the DSP Builder design between large modules, such as CFR & DPDfor instance. This allows the Quartus II place and route softwaregreater latitude in placement, without compromising the overall designFmax.
These techniques are sufficient to achieve very high Fmax on moderatesize designs. However, larger designs may require leveraging the floorplanning capabilities within Quartus II. Traditionally, floor planninghas been the first step in the common physical design flow of the ICdesign. It is being widely used for decades in ASIC design integration.While being an essential step in IC design, floor planning has not beenstandardized in the FPGA compilation flow, especially if Fmaxrequirements can be met without resorting to floor planning. However,this changes for large FPGA designs requiring timing closure in the 500MHz range. Fortunately, floor planning a custom logic design on atopology of a modern FPGA device such as the 20nm Arria 10 can be morestraightforward than in custom IC design, inasmuch as the clock-tree andall the hard-logic components such as the IO transceivers, the PCIcontrollers, the DSP blocks, the RAM blocks etc. are already floorplanned and fixed, location-wise. That simplifies the physical designtask, allowing the engineering design teams to largely focus on thecharacteristics and the logic requirements of the logic circuit.
When targeting modern wireless, this argument can be even taken astep further. Due to the support of many antennas, the floor planningmethod can effectively leverage both the homogeneity among the paralleldata paths supporting each antenna and the design’s granularity. Thisallows for modular partitioning once the layout is determined for asingle RX/TX antenna path, since any further antenna paths arefunctionally and logically identical.
The goal is to implement 8 RX and 8RX antenna systems, with two independent LTE channels for each antennasystem. This will require eight identical CPRI to JEDS204B antennapaths. The chosen FPGA, the Arria 10 10AX066 (660 kLE), has the CPRIand JESD204B transceivers on the left hand side of the chip Therefore,each antenna’s Tx/Rx path “folds” across the horizontal dimensions ofthe chip forming a mirrored “C” shape, due to the fact that all itsinput and output interfaces are placed on the left side. The horizontallimit of these 8 parallel data paths can balance the distance betweeneach data path and the left vertical end-side of the chip, where thetransceivers lie. Further, the FPGA’s dual core ARM A9 processingsubsystem is centered in the middle of the FPGA, dividing theprogrammable logic areas into and upper and lower regions. Four RX/TXcircuits will be located above the ARM processors, and four below. Thefitter tool will have to solve 8 almost identical sub regions, one perantenna’s data path. The radio head floorplan consists of 8rectangular-shaped, mutually exclusive “stripy” regions, stretching fromthe left side of the targeted chip to the right.
The Design Partition and the LogicLock tools in Quartus II realizethese floor planning rules in an efficient and reusable manner. In thesystem’s hierarchy, each Tx/Rx data path is a single design partitionassigned to a pre-allocated LogicLock region. The physical design startswith the creation of a new partition per Tx/Rx antenna path. Next isthe design of LogicLock regions for JESD204B and CPRI and theirplacement along the left side of the chip. Meeting timing for these IOinterfaces is much easier than for the full 491MHz data path, becausetheir resource utilization is small and their clock rates are notrequired to run at 491 MHz. Then, each partition is mapped and lockedonto the LogicLock regions. As previously described, the individualantenna regions are mutually-exclusive, horizontally-oriented,rectangular areas. Each region contains multiple cascaded and attachedsub regions, such as the DUC, CFR, DPD, DDC, and IO. The finalpreparation step involves the LogicLock assignment of the DSP Buildermodules onto the selected sub regions, according to the data pathpartition selected.
The last phase of the integration flow is the incremental compilationof all design partitions. This process consists of: a) adding onedesign partition/LogicLock region at a time, b) compiling, c)readjusting its dimensions in order to meet timing, d) locking down itsP&R netlist and, finally, e) moving to the next targeted designpartition, until all 8 antenna data paths have been compiled. Theincremental compilation starts by targeting the design partitionssurrounding the HPS on the Arria 10 floorplan. Next the top and bottomstrips are compiled, and last the middle upper and lower strips. Thiseffectively targets the most geometrically-constrained partitions firstand adjust their dimensions (e.g. height of the LogicLock region), until491 timing is met.
Figure 14: Fmax achieved across 8T8R radio head
The resources used in each antenna path will not be identical, norwill the Fmax. Place and route is a statistical process, as each regionhas unique boundary conditions and differing amount of resourcesavailable. The eight antenna paths utilization of LABs (logic) and DSPblocks. The radio head design utilization averaged 64% and 64% perLogicLock region’s available for both LABs and DSPs.
The plot of Figure 15 presents the achieved Fmax, ordered by the LAButilization of each Tx/Rx design, as an attempt to correlate the timingperformance of each HEPTA to its most critical type of logicutilization, the LABs (74% average utilization). The performance resultswere captured without seed sweeps or the use of the Design SpaceExplorer; they are instead simply obtained by the default Fittersettings (seed 1). Among the antenna partitions, the worst Fmax is510MHz; the best is 565MHz and the average among all instances is535MHz. The overall design achieves 491 MHz timing closure withsignificant margin, by using the high level design tool DSP BuilderAdvanced Blockset and the floor planning resources available in QuartusII design software.
Figure 15: Fmax achieved across 8T8R radio head
The Quartus II tools suite, as well as optimization techniques in DSPBuilder are undergoing continuing improvement with each release, and soimproved results will likely be available in future releases.
Looking ahead to 5G
In the near future, 5Gwireless systems will be developed. One of the key capabilities of thesesystems is MIMO and antenna steering capabilities (Figure 16). Thiswill increase network capacity, by allowing the base station antennabeams to be steered on a per user basis, on the downlink or uplink orboth. This will impose very large processing requirements which arequite unlike current wireless systems.
Figure 16: 5G MIMO based antenna beam steering
This type of capability will require solving large numbers ofsimultaneous equations on a per user basis. It will require using matrixmultiplication and matrix inversion based algorithms, which requireboth high dynamic range and high processing rates, on the order ofhundreds of GFLOPS per antenna. Due to the high dynamic range,single-precision floating point will be required, unlike the traditionalfixed point processing used in today’s wireless systems.
CPU based solutions do not have enough processing power or theefficiency needed. GPUs have more processing power, but not thedeterministic and low latency required, nor the serial-based interfaceprotocol support. The preferred platform will be FPGAs, but a new breedof FPGA, which incorporates hardened floating point circuits across theentire device. Below is the resources show for matric multiply for asingle antenna with 128 elements, which needs to be performance for eachuser every one millisecond. The processing latency of the FPGA is onthe order of 30us, compared to several milliseconds using a 1.5 GHz CPU.
Arria 10 is the only available FPGA family natively supportingfloating point processing, and the benefits are dramatic. Shown beloware the resource and power comparisons of Arria 10 and the previousgeneration Stratix V FPGA family.
Figure 17: Benefits of Arria 10 Floating Point processing for 5G
For this type of application, designing with both Verilog and VHDL isnot recommended because it will lss than desirable results. The metricsshown are achieved with the DSP Builder toolflow, which is designed tosupport floating point as well as vector based processing necessary forlinear algebra-based algorithms. Future article installments willdetail more complex floating point designs.
To recap, designers of very high performance communication systemsshould reconsider high level design tools. The DSP Builder AdvancedBlockset supports the achieve real-world results shown here for thesevery challenging applications. The design examples referenced werebuilt and implemented on Arria 10 FPGAs. The use of these high leveldesign tools will also be very applicable in achieving Fmax of 737 MHz,and even 983 MHz, with Stratix 10 and its HyperFlex architecture,offering dense pipelining registers in all FPGA routing lines,well-suited for the DSP Builder tool’s optimization methodology.