Designing an ARM-based Cloud RAN cellular/wireless base station

Cellular service providers are looking for cost-effective, scalable ways to manage their networks profitably. Cloud radio access network (Cloud RAN) technology is gaining traction with service providers as an efficient means of processing wireless network signals by virtualizing baseband processing onto large server farms and ultimately reducing costs.

This article describes a novel architecture for baseband processing using ARM’s Cortex A57 processors for use in mobile wireless base stations in conjunction with our modem processing unit (MPU), a real-time, reconfigurable platform that allows for the implementation of a wide variety of communication standards in Cloud RAN as a tightly integrated co-processor to general-purpose computers. This approach reduces power consumption, increases overall network throughput, and decreases CAPEX and OPEX by offloading tasks from older base stations that are expensive to operate.

Radio access networks
Conventional base stations are the core of wireless network RANs. They include unified RF units and baseband processing units positioned at the base station site. From an operator’s point of view, this approach has significant limitations [1]. Each base station connects to a fixed number of sector antennas outfitted for peak voice and data demand at their coverage region. With this approach, it is nearly impossible to improve system capacity since interference mitigation techniques are difficult to employ. Furthermore, the base stations are built on proprietary platforms, and as such, they are expensive to construct, maintain and operate.

Cloud RAN is a compelling alternative approach. It is composed of three main parts: distributed radio units, antennas, and remote radio unit (RRUs) located at the remote site, a base band unit (BBU) pool comprised of high-performance general purpose processors (GPP) located in a data center, and a high-bandwidth, low-latency optical transport network that connects the RRUs and the BBU pool. The Cloud RAN approach not only reduces the construction costs (assuming fiber optical backhaul already exists) and operation costs of each base station facility, it also allows for dynamic reallocation of virtualized processing resources from one base station to another when utilization shifts throughout the day and week.

Figure 1: Cloud RAN architecture, fully centralized solution [1]

A main technical challenge of Cloud RAN is the BBU pool implementation. According to a recent study [1], a centralized BBU pool of a medium-sized dense urban network (25 km2) should support about 100 base stations (300 sectors) while each BBU should meet the high throughput and low latency requirements of a modern wireless communication standard with the goal of executing the all software layers on GPPs. In addition, the BBU pool should be highly power-efficient in order to show a real decrease of power consumption compared to conventional systems that use efficient base stations.

Successful attempts to develop full base stations on a pure GPP platform are reported in recent studies [1, 2, 4]. However, these studies show that in spite of using state-of-the-art platforms and innovative techniques, such platforms are not as optimized as dedicated SoC platforms when it comes to executing intensive physical layer tasks, such as Turbo decoding, FFT, and large-scale MIMO decoding.

This gap can be mitigated by offloading the intensive processing tasks of the physical layer from the GPP to an optimized co-processor, provided the co-processor is an open platform and provides multi-standard support, a programmable radio, and other characteristics required for Cloud RAN.

Physical layer background and requirements
The LTE physical layer component is discussed here to explain the characteristics of wireless PHY components and the challenges of implementing them on GP CPUs.

Figure 2 : Typical processing functions of LTE UL and DL

A typical processing chain of the LTE physical uplink shared channel (PUSCH) and physical downlink shared channel (PDSCH) is shown in Figure 2 . In the UL, complex samples coming from the RRU at a rate of up to 30.72 Ms/sec per Rx antenna are fed to 2048 points FFT blocks (FFT block per Rx antenna) and then proceed through the UL processing chain, yielding throughput of up to 100 Mbps/sector, assuming that 2 MU-MIMO layers are used. In the DL, bits at a rate of up to 150 Mbps (per carrier) are encoded through the DL chain, modulated, pre-coded, and fed to the IFFT module per Tx antenna, producing up to 30.72 Ms per antenna.

The baseband processor must also process several control channels that are mapped together with the DL and UL data channels. Studies show that the IFFT, the turbo encoding and the MIMO pre-coding blocks are the most demanding tasks in the DL processing chain, especially when the system includes 8 Tx antennas [2, 3]. In the UL processing chain, the turbo decoder is the most demanding block, followed by the FFT, channel estimation (CE), and the MIMO equalizer.

In addition to its high throughput capabilities, LTE has a stringent delay budget as depicted in Figure 3. The physical layer HARQ protocol places the highest demand on processing delay. In the downlink, the baseband processor must decode the HARQ feedback coming from the UE (UL ACK/NACK information), then, based on the decoding result, it must decide whether to schedule new data or retransmit the previous data, and finally it must encode and transmit the data on the optic interface in less than 3ms in order to maintain successive transmission.

These 3ms include the two-way propagation and transporting delay of the optic interface between the BBU and the RRH, which can take up to 400us. In the uplink, the base band processor must decode the PUSCH and encode accordingly the HARQ feedback in less than 3ms in the worst case scenario. Overall, each processing chain (DL or UL) must be complete in less than 2.6ms.

Figure 3 : Processing time budget, TDD UL/DL configuration = 1

Modem processing unit – co-processor to a GP CPU
The MPU is a heterogeneous, multi-core signal processing platform designed for use as a co-processer to CPUs in a Cloud RAN BBU. Just as a graphic processor unit (GPU) accelerates the graphics operations in a PC or a workstation, an MPU accelerates complex physical layer tasks common to most communication systems. It supports a large range of system partitioning solutions, from a simple accelerator to receiving and transmitting chains. The MPU is controlled through a standard API implemented in C called a modem programming language (MPL). The MPL interface de-couples the internal operation of the MPU from the L1 control layer, giving the designer a powerful, flexible tool to implement various algorithms and the ability to support various air interface technologies and standards.

MPU architecture
Figure 4 shows the MPU architecture, which is connected to GP CPUs through a high-speed interface for transferring data and control (PCIe for example). It is comprised of processing elements (PEs) to process key communication tasks such as, FFT/DFT, turbo and Viterbi decoding, complex arithmetic operations (required for large scale MIMO decoding), interleaving/de-interleaving, address and code generators and more. Each PE contains a light RISC processor called a standard sequencer (SSQ) that controls the PE’s execution. The PE’s SSQ is in charge of buffer allocation, configuration of parameters, handshakes with other PEs, and more.

Data is transferred between PEs through buffers located in the memory bank. Control messages are transferred between PEs through a dedicated control interface that can connect PEs to other PEs.

Figure 4 : MPU architecture

System architecture
Figure 5 shows a system blockdiagram of a proposed BBU. Four quad-core Cortex A-57 processors areused with an MPU co-processor. The A-57 processors are connected to theMPU through an ARM interconnect interface and a high-throughput PCIeinterface. Samples coming from the discrete RRUs are connected directlyto a CPRI interface located in the MPU chip. The number of CPRI streamsis dictated by the number of sectors and the number of antennas persector that the BBU supports. The CoMp interface is a high-throughputinterface used to exchange frequency and/or time domain I/Q samplesbetween cooperated sectors.

Figure 5: BBU architecture

Thesystem performance and capabilities are highly dependent on many systemdesign choices and parameters, including the MPU/CPU task partitionscheme, the algorithms for tasks running on the CPU, the complexity ofthe L1 scheduler, network duplex mode (TDD or FDD), and the number of Rxand Tx antennas. In order to show consistent assessments, this paperassumes that a sector is configured according to Table 1 .

Table 1 : Sector configuration

CPU/MPU tasks partition
Anearlier study has reviewed various partition alternatives, attemptingto find the optimal partitioning between GP CPUs and the co-processor[5]. Offloading too many tasks to the co-processor will improve theperformance and efficiency of a GP CPU but it will also diminish theflexibility and ease of programming. An additional criticalconsideration is how to prevent excessive data transfers between the GPCPU and the co-processor. The most efficient alternative, which balancesbetween the above options, is to run the entire data path on the MPU,leaving the processing of the UL CE, the control channels decoding andencoding, and SRS processing to the GP CPU (Figures 6 and 7 ).

Inthis example, the GP CPU is running the control channels (PUCCH, PDCCH,PCFICH, and PHICH), encoding and decoding tasks. First, the channelsrequire low CPU resources, and second, these channels are particular toLTE and they cover multiple formats and options that make them moresuitable for software implementation. The decision to run CE on the GPCPU is a less obvious one. On the one hand, CE tasks demand significantCPU effort and high-data transfers between the cores. On the other hand,the CE task is a non-standard algorithm that highly impacts receiverperformance, so locating it in a programmable domain provides maximumflexibility to the developer for implementing its secret algorithm andfull control for this fundamental block.

Figure 6 : CPU/MPU task partition and PE assignment for LTE DL

Figure 7 : CPU/MPU task partition and PE assignment for LTE UL

Tasks assignment to PEs
Figures 6 and 7 show the relationship between tasks and PEs. As can be observed fromthe figures, one PE may execute more than one LTE task. For example, thearithmetic PE runs in the DL, as well as the layer mapping andpre-coding tasks. The UL runs the equalizer and the PRACH downconversion task, and optionally, it can be used to offload part of theCE task from the CPU by utilizing its capabilities to inverse matrices.This is achievable by re-using the PE in time if possible, and/orduplicating the PE as needed.

CPU/MPU interface
Throughputanalysis of the MPU/CPU interface per sector is provided in Table 2. Itis assumed that the above task partition is applied and the sector isconfigured in accordance with the parameters provided in Table 1.

Table 2: Required throughput of CPU/MPU interface

Core partition and system capabilities
Ourassessment shows that, in addition to an MPU, up to two Cortex A-57cores are required to run the physical layer of up to 8 LTE sectors,configured according to Table 1. Resources of less than one core arerequired to control the MPU chip and to process the control channels,while the rest are required to execute the CE. To guarantee highcomputational resources and precise timing control, these cores shouldbe dedicated exclusively to physical layer processing, as suggested in arecent study [2].

Once we have allocated cores for processingthe physical layer tasks, we must also assess how many cores per sectorare required for running the upper LTE layers to realize the full systemcapabilities. Based on the specs of several existing base stationsolutions that use Cortex A-15 cores, we expect a performanceimprovement of up to 50% by using Cortex-A57 processors cores. We assumethat up to two Cortex A-57 cores are needed to run the L2/3 layers ofone sector. We conclude that this system is capable of running up to 8sectors.

Barak Ullman is Chief Architect at ASOCS Ltd.He has over sixteen years of experience in research and development ofwireless technologies. Prior to joining ASOCS, Mr. Ullman led the designof system architecture for LTE receiver at Marvell. Prior to that, hewas responsible for development of GSM/GPRS physical layer for a UE atIntel. Mr. Ullman holds a B.Sc. in electrical engineering from theBen-Gurion University.

References

1. China Mobile Research Institute. (2010). C-RAN: The road towards green RAN , White Paper, Version 2.5 (Oct, 2011)

Tan, K., Zhang, J., Fang, J., Liu, H., Ye, Y., Shen, Y. Z., Voelker, G. (2009). Sora: High performance software radio using general purpose multi-core processors .

Jianwen Chen, Xiang Chen, Jing Liu & Ming Zhao. Open Wireless System Cloud: An Architecture for Future Wireless Communications System

Jianwen Chen, Qing Wang, Zhenbo Zhu, Yonghua Lin. An Efficient Software Radio Framework for WiMAX Physical Layer on Cell Multicore Platform

Tal Kaitz, Gaby Guri. CPU-MPU partitioning for C-RAN applications

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.