Femtocells/small cells are considered to be key to next-generation wireless operator networks, as they providing coverage, they offer capacity and cost advantages compared to large cell deployments. Using existing IP backhaul infrastructure (e.g. DSL/FTTH) and Self Organizing Networks (SON), deployment is simple and low-cost.
However, in order to achieve the system price point associated with wide consumer deployment, system cost needs to be orders of magnitude below that of a traditional macro cell solution. Also, power supply restrictions leave little room for ‘over-design’ on either the hardware or software side.
This two-part article describes one way to achieve this goal. In this first part we describe an efficient low-cost femtocell design in which a Linux-based fast-path software architecture is implemented on base-station-on-a-chip hardware containing all the necessary digital processing, from Ethernet interfacing all the way to A/D converter, including the control plane, packet processing, and Layer-1 signal processing (Figure 1 ).
In Part 2, we will describe how to use the principles of software performance engineering http://embedded.com/design/real-time-and-performance/4395691/Software-performance-engineering-for-embedded-systems–Part-1—What-is-SPE- to integrate hardware and software elements and to evaluate whether or not the resulting implementation meets the design goals.
The particular base-station-on-a-chip hardware chosen is based on Freescale’s BSC913x family (Figure 2 ), which targets a variety of use cases, for example 100Mbps DL, 50Mbps UL operation with 16 active UE operation. On the Layer-1 side, such performance is achieved by using a mix of a StarCore SC3850 high-performance DSP core and the MAPLE hardware acceleration for a.o. the physical downlink shared channel (PDSCH) and physical uplink processing element, which performs decoding of physical uplink shared channel (PUSCH) resulting in decoded information bits.
The remainder software stack (L2, L3, OAM, transport components) runs on the Power Architecture/e500 core with associated hardware acceleration for IPSec, 3GPP ciphering, timing, etc.
Achieving the challenging system throughputs on a single-chip solution leaves little room for inefficiencies on the software architecture and implementation. As such, close cooperation between software and hardware development teams is crucial during architecture and implementation phases. The work presented focuses on the challenges imposed on the GPP Power Architecture e500 processor software architecture and the optimum solution to such challenges as reached by close cooperation between 3rd party software developers and Freescale.
Using Linux as a fast-path OS architecture
In order to achieve portability, debugability, and code re-use targets, Linux is an obvious choice for the OS for the small-cell platform. However, as well known in the industry, Linux is not capable of achieving the 1 mSec hard real-time deadlines required for LTE (long term evolution) applications. Two industry approaches exist to enhance Linux to achieve real-time deadlines:
Real-Time Linux (Figure 3 ) approaches such as Real-Time Linux and the Xenomai development framewor. Such approaches create an isolated real-time environment in parallel to Linux by trapping interrupts. This means that applications need to be ported to the thin kernel that is provided by the real-time portion of Linux. Besides this drawback, debugability can be an issue (the standard user space Linux toolset is not available).
Approaches that enhance the Linux kernel to make it fully pre-emptible and real-time capable (Figure 4 ). The PREEMPT_real-time patches by Ingo Molnar convert Linux into a fully pre-emptible kernel with predictable response time without loss of debugability and with limited performance impact.
We chose the PREEMPT_real-time approach for the base station application. Performance was benchmarked (Figure 5 ) to be reliably below 50μS using the cyclic test benchmarks.
Note that even though the PREEMPT_real-time patches minimize the latency to be well within the 1 mS boundary imposed by the LTE standard, the performance is still an order of magnitude worse than that of a true real-time OS.
Given this worst-case latency to be assumed with task switching, the user-space, data path portion of the L2 application is separated into a minimum number of threads that allow partitioning between hard and soft real-time. Scheduling of tasks within a thread is done by the application. As shown in Figure 6 , this allows two main threads defined as:
- Hard real-time – Scheduler, MAC, and RLC components, with deadline-driven execution times.
- Optionally, the uplink MAC/RLC components can be executed in a separate thread that doesn’t have a strict deadline
- Soft real-time – PDCP, GTP, UDP components, with throughput/performance requirements but no execution deadline.
Note that system calls that are executed from the fast path code effectively translate to ‘Linux tasks’ that are scheduled by the kernel. Performance requirements dictate that such task switch overhead cannot be accepted. As a result, all drivers called from the user-space application (e.g. L2/L1 interface, PDCP ciphering) are implemented as user-space-only drivers using the UIO framework http://lwn.net/Articles/232575/. Also, as part of the performance optimization efforts, care is taken to remove all but the necessary system calls from the remainder application components.
Note that the overall goal of minimizing system calls is the main driver for performance-related design decisions throughout the software architecture process.Memory management for fast paths
The goal of theimplementation is to minimize the amount of data copies in the fast-pathoperation, specifically those copy operations that involve the CPUcore. The fast path portion of the stack is implemented using two memorydomains (Figure 7 ):
The first domain is defined byTransport, UDP/GTP, and the portion of the PDCP stack that interfacesto the GTP stack (downlink: pre-security). This memory domain ischaracterized by:
- Buffer size defined by transport packet size
- Single-use buffers
- Requirement for fast/optimized implementation of physical to virtual and vice versa address translation as required for interfacing to accelerator hardware (security engine, Ethernet controller)
The transport blockbuffers are shared between the user space (UDP, GTP, PCDP) and kernelspace (Ethernet, IPSec). Zero-copy operation and buffer sharing areenabled by use of user-space buffers for both domains and aPF_packet-style interface between the kernel space and user-space.
Definedby typical maximum Ethernet packet size (~1500B), transport packets aresmaller than standard Linux page size (4KB) and as such can beallocated through a malloc() mechanism.
The second memory domain is defined by the MAC/RLC and the portion of the PDCP stack thatinterfaces to the RLC (downlink: post-security). This memory domain ischaracterized by:
- Buffer size defined by maximum transport block size (and/or other L2-defined maximum buffer size)
- Multi-use buffers as required for optimized implementations of HARQ and ARQ
- Requirement for fast/optimized implementation of physical to virtual and vice versa address translation as required for interfacing to accelerator hardware (security engine, DMA engine)
Defined by the maximum L2packet size, L2 packet buffers are typically larger than 4KB and as suchcannot be allocated as a physically contiguous buffer using standardLinux mechanisms. Hence, typical L2 stack implementations include aproprietary memory allocation mechanism early in the kernel boot,providing mmap() access for physical and virtual addressing. In thePower Architecture e500 core, physical mapping of this memory windowforces the use of a single large TLB1 entry. The size of the proprietarymemory allocation is set statically. Linux support for such memorymapping is provided by hugetlbfs .
Datacopy between the two memory domains is handled by the security engine,which has an integrated DMA controller. The security engine allows for aNULL-ciphering operation, and as such can be used even for bearers thatdo not have cryptography requirements. Using the security engine fordata copy, in combination with usage of the DMA engine for TransportBlock building over the L2/L1 interface, allows for a zero-copy fastpath stack.
An added advantage of the proposed split-memorydomain implementation pre- and post-security is that both theunencrypted and the encrypted packets are available to software. Thisenables implementation of handoff in the PDCP layer.
Linux fast-path transport stacks
Eventhough the Linux stack includes a fully functional implementation oftransport stack components such as Ethernet, QoS, IPSec, and UDP,initial benchmarking of this stack showed significant performancebottlenecks associated with it. For small cells, the optimum tradeoffbetween performance and cost (die size, power) was found to implement a‘soft fast path’ stack (Figure 8 ) that offloads the Linux stackby implementing the fast path (ie Ethernet, IPSec, UDP traffic thatcarries GTP-U payload) in performance optimized software that makesextensive use of all hardware acceleration features available, such asIPSec hardware protocol acceleration and hardware QoS.
Whileimplemented within the Linux kernel, the application specific fastpathstack achieves zero-copy performance by utilizing user-space buffers (Figure 9 ) that are made available through an optimized, socket-like interface called ‘PMAL’.
Hardwareclassification is used to separate transport traffic from ‘other’ Linuxtraffic and as such allows different buffer pools to be used for eachtraffic type.
The BSC913x fast-pathdatapath accesses the SEC engine in both user-space (PDCP) and kernelspace (IPSec). As such, access to the SEC engine needs to be via amechanism that must be accessible to both, without any performanceimpact: copies, translation, or locks. In the SoC hardware, this goal isachieved by accessing the security hardware through different hardwareimplemented job-rings that allow lock-free access, and a memorymanagement scheme as described above.
Using separate job-rings,both kernel-space and user-space applications can use SEC services andas such SEC Job Rings are partitioned between kernel-space (transport)and user-space (PDCP) drivers. Exclusivity is on a per-channel basis,with the kernel driver acting as master for all interrupts andinitialization, error handling, etc. This is consistent with well knownUIO-based drivers (see http://lwn.net/Articles/232575/ and related articles/discussions).
Froma software point of view, the SEC engine is handled as an asynchronousaccelerator, with job posting and result polling optionally handled bydifferent threads. Operating the SEC engine in an asynchronous mannerallows for improved system performance because there is no waiting timeinvolved in SEC processing.
L2/L1 Interface for inter-process communication
Inter-processcommunication is required for a number of interfaces (Figure 10). Thissection focuses specifically on IPC regarding the communication betweenthe L2 and L1 domains, providing a communication mechanism for:
- L2/L1 interfacing as required for communication between MAC and PHY layers, providing driver level support for FAPI based communication. Given that FAPI message exchange is high-frequency and time-critical, the interface implementation is to be tuned to optimum performance for FAPI.
- Status/command exchange between StarCore (SC) and Power Architecture (PA) side L1 and L2 framework synchronization as required for a fully synchronized system startup procedure
Logically,the L2/L1 interface is based on the FAPI specification(http://www.smallcellforum.org/resources-technical-papers). Physicalimplementation is based on shared memory between the StarCore and PowerArchitecture domains.
Amemory-mapped interface permits use of a DMA engine for data transferoffload from the CPU. The L2/L1 interface is designed as a push-basedmechanism interface. The L2 pushes data to the L1 in downlink, and L1pushes data to L2 in the uplink.
The implementation of the L2/L1interface defines low-level ’channels’ (buffer rings) that have asingle producer and single consumer associated with them, allowing for alock-free implementation. Typically, a channel is associated with asingle FAPI message type. The buffer rings themselves are allocated inthe consumer side memory controller for optimum performance, where theread latency is more critical than write latency (which is hidden by DMAand/or hardware write buffering).
Optionally, an indication isassumed as an interrupt generated to the consumer side and the list ofsupported interrupts can be defined at application compile time. Thesame type of indication is shared between all the messages of a givenchannel. This type is defined at channel open stage and cannot bemodified in run time. Hardware features are used to rmove software costassociated with generating the indication.
An important aspect ofthis design was the use of software performance engineering techniquesto achieve two goals: to provide an efficient mechansim by whichintegrate the hardware and software development and implementation andto provide a means by which the architecture can be evaluated andbenchmarked to make sure it meets the final design goals.
Wim Rouwet is a senior systems architect in the Digital Networking group ofFreescale Semiconductor. He has a background in network processing,wireless and networking protocol development, and systems andarchitecture, focusing on wireless systems. His experience includeshardware and software architecture, algorithm development, performanceanalysis/optimization and product development. Wim holds a master'sdegree in Electrical Engineering/Telecommunications from EindhovenUniversity of Technology.