Building a Linux-based femtocell base station using software performance engineering: Part 2 - Embedded.com

Building a Linux-based femtocell base station using software performance engineering: Part 2

Historically, embedded systems implement separate specifications for hardware and software. Hardware partitioning and definition of offload components is done based on mostly inaccurate inputs, on a timeline that is different from that associated with the software design. Hardware specifications need to be available a priori to the silicon design, whereas software work is only started after device (or simulator) availability. Such timelines prohibit extensive re-design of hardware based on software findings, often leading to sub-optimal overall system design and time-to-market.

In the case of a tightly constrained system (due to cost and power consumption) such as the Linux fast-path femtocell design described in Part 1 in this series, concurrent hardware/software design is obviously preferred, and such a system is made possible by the well-defined nature of the end system, a WCDMA/LTE base station operation with known properties. A tailored Software Performance Engineering (SPE) process was used for the performance evaluation activity. In this process, performance calculators are used to model the important performance use cases for the application.

Modeling is a significant aspect of SPE. Some of the performance modeling best practices include using performance scenarios to evaluate software architecture and design alternatives before be.g.inning the software coding and implementation phase. SPE starts with the development and analysis of the simplest model that identifies problems with the system architecture, design, or implementation plans. Details are added as more and more details of the software become apparent.
In the case of the design described in Part 1, as a first step the use case is broken down by its key components for the case of LTE (Figure 11 ), based on the associated protocol stacks that are being implemented (integrated eNB):

Figure 11: LTE components and associated protocol stacks

In this example, from analysis of the protocol stack the following key performance impacting components were defined:

  • RLC/MAC
  • Scheduler
  • PDCP/GTP
  • Transport (UDP, IP, IPSec, QoS, Ethernet)
  • RRC/Control plane
  • Miscellaneous (e.g. OS overhead)

Pre-silicon modeling targets two goals:

  • Definition of SoC architecture with regard to hardware sizing, for example core, cache, bus frequencies, DDR controller configuration, etc.
  • Definition of per-component cycle counts that allow for budgeting of resources to software blocks

SoC level modeling
SoC level modeling focuses on architecture exploration modeling of the L1 and L2 domains of the SoC. From a L2/transport perspective, the modeling focuses on the e500v2 Core Complex, Frontside L2 Cache, the coherency manager, DDR queue and controller. Specifically, how DDR traffic from other initiators affects, and is affected by, DDR traffic from the e500v2 Core Complex.

The other DDR traffic initiators modeled were the DSP core, Maple B2F L1 accelerator complex, antenna interface (i.e. Layer 1 processing modules), veTSEC Ethernet controller, and SEC engine. Modeling was performed using a Freescale proprietary System C-based simulation environment.
DDR traffic for the simulations was generated using statistical inputs to generic configurable modules from the modeling environment. Statistical inputs were provided by packet flow analysis based on the use case (amount of traffic to/from the security engine can be derived from over-the-air traffic rates), as well as inputs derived from prototyped software (L1/L2 cache hit rates). The e500v2 Core Complex was modeled in more detail in that its L1 instruction and data caches were modeled. A statistical cache miss rate was also used in the front-side L2 Cache model to determine cache hits/misses.

All bus transactions generated by the modeled initiators do not contain memory map addresses, but instead contain the targeted DDR bank index (any value 0 to 7 for the 8-bank DDR) and a DDR page miss value (0 or 1) indicating whether the transaction is a DDR page miss or hit if no other initiator has accessed the DDR bank since the initiator itself last accessed the DDR bank.

If another initiator has accessed the DDR bank, the transaction is always assumed to be a DDR page miss. The DDR page miss value in transactions is set according to an intrinsic DDR page miss rate input for each initiator. The intrinsic DDR page miss rate being the DDR page miss rate an initiator would be expected to achieve if it were the only active initiator generating traffic to DDR.

SoC level modeling ensures that the hardware does not have any inherent bottlenecks that limit the performance of the system: it guarantees that the device is physically capable of processing the use case and is well-balanced. SoC level modeling focuses on the following three key metrics:

  • Target 0.6-0.9 achieved Instructions Per Cycle on an e500 core
  • Total bus utilization/loading at or below 80% (leaving sufficient margin for unseen additional loading and to guard against the approximate accuracy of the model)
  • DDR memory utilization/loading at or below 50% (DDR latency typically increases significantly when DDR loading goes above this level)

Providing accurate statistics
As required for the SoC level modeling (the need to provide accurate statistical characteristics of the software components), prototype software is implemented for each key component in the femtocell base station (RLC/MAC, Scheduler, PDCP/GTP, Transport, RRC/Control plane). Besides providing reliable statistical inputs to the SoC modeling environment, this prototype software can also be used to establish a baseline CPU utilization/budget for each software component in the system that is used throughout the product development cycle (see Part 1).

Definition of hardware acceleration is also done using prototyping software; cycle counts for each component in the system identify a prioritized list of targets for offload to hardware. Here, the tradeoff on hardware vs software is based on die size and power (mm2 and Watt cost) where a function is either implemented by means of a fixed hardware cost or a shared software cost.

The prototype software is executed on existing hardware platforms with the same or similar characteristics as the final product. In this example, a P2020 platform was chosen for prototyping, with configurations, such as the number/frequency of cores and cache size, to establish cycle costs with different SoC options.

Evaluating performance targets
As shown in Figure 12 , performance targets (cycle counts per component) are used to create a performance report, which is used to document a software statement of work (SoW), which in turn is used for internal and external software development. It serves as a requirement document to all parties contributing to the system development process. Software implementation is focused on meeting the performance targets from initial architecture design through the implementation phase, with performance analysis conducted formally at each major phase of the project to ensure that goals are being met.

Figure 12: Determining performance targets

Evaluating the femtocell’s performance
Performancemeasurement is an important aspect of SPE. This includes planningmeasurement experiments to ensure that results are both representativeand reproducible. Software also needs to be instrumented to facilitateSPE data collection. Finally, once the performance critical componentsof the software are identified, they are measured early and often tovalidate the models that have been built and also to verify earlierpredictions.

How this process was applied to the Femto basestation design described in Part 1 and the outcome of that evaluation isshown in Figure 13 .

Figure 13: Performance tracking through product life cycle

Notethat for the purpose of such benchmarks, it is useful to identify a fewkey use cases, centering around bitrate, packet rate, user count, etc.

Performance metrics are divided into three categories, each of them tackling separate areas of the performance picture:

  • Cycle counts per software component
  • CPU loading summaries
  • SoC performance counters

Software component cycle counting
Per-componentcycle count tracking is implemented by reading the CPU timebase(‘stopwatch’) before and after execution of each software module, andlogging the cycle count readings in a logging database. This database isextracted to a host for offline analysis.

Cycle countingidentifies which software components are causing excessive CPU loading.Cycle counting is the baseline for focusing optimization effort and auseful tool in tracking optimization actual numbers vs. target numbers.Iteratively, the results of cycle counting are input into the modelingbaseline for the next generation, increasing accuracy and trust in themodeling hardware/software co-design approach.

Although theper-component cycle counting provides good insight into which componentsneed to be optimized, it is not always the best tool to identify wherethe cycles are consumed. For this purpose, a CPU loading profile at avery fine granularity (i.e. per symbol name) is more useful. Anadvantage of using Linux as a development environment is that such aprofile can be generated with open-source tools such as OProfile .

Both software cycle counting as well as OProfile do not provide insight into the SoC metrics that are established earlier:

  • Target Instructions Per Cycle (IPC), including associated L1/L2 ICache and DCache hit/miss ratios
  • Bus utilization
  • DDR memory utilization

Suchmetrics are taken from SoC hardware counters on a running system. Thesecounters are implemented in each IP block in the device, with thespecific purpose of being able to track system level performance.

SoClevel performance counters are mainly useful to verify that systemlevel parameters have enough headroom, such as for bus/DDR loading. Theyalso help identify software architectural issues such as low cache hitrates due to code/data locality, etc.

Candidates for performance optimization
Lessonslearned during this project included the following checklist of itemswe determined are candidates for performance optimization:

Systems and Software Architecture. Contrary to the ‘implement first, optimize later’ mantra, we feel thatit is imperative to ‘ground-up’ architect the solution to be implementedin a architecturally optimized way. This means a memcpy()-freefastpath, and other topics shown earlier in this document. Developmentteams that are under pressure to add features often do not have time tochange an existing implementation, so the up-front architecture work is aworthwhile investment.

Compiler and flags. Even thoughGCC is the most commonly used compiler, for reasons of familiarity andavailability use of a proprietary compiler can be an easy way to get adouble-digit percentage performance bump with minimum effort. Also,within the GCC family, we suggest using the latest compiler version. Forexample, benchmarking on Power Architecture silicon showed around 5%performance improvement moving from GCC 3.4.x to GCC 4.4.x, and another5% moving to GCC4.7.2, which was the latest available release at thetime of writing.

Optimized libraries for key functions .The practice of pushing for optimized basic libraries such asmemset/memcpy is well established in the embedded industry. Otherlibraries such as math are used sporadically in a software stack, butstill often enough to have a significant performance impact.Implementing such libraries in an optimized manner can offer significantperformance gains.

Pre-fetching. Both instruction anddata prefetching can be useful. The base station application discussedin this document uses multiple processes/threads on a single core. Giventhe large footprint of the codebase and the user contexts, this meansthat the achieved IPC is relatively low. Performance improvement thuscomes from pre-fetching both code and data.

Cache locking. Like pre-fetching, cache locking can provide an improved IPC.

Large memory pages. These limit the software load associated with MMU misses by allocatingpacket buffers from a single large MMU entry rather than from standardLinux 4KB pages.

Interrupt coalescing/NAPI. The overheadassociated with interrupts and associated task switches is significant.Most networking stacks provide the option to tune the tradeoff betweenresponse time (latency) and throughput.

Optimized hardware configuration. The default device configuration as provided by the Board SupportPackage typically is not optimized for final hardware. For example, DDRtiming settings may be less strict than supported by the actualhardware. Review of such settings between hardware and software teamscan provide a overall performance gain.

Limit context switching. Remove unnecessary context switches. Examples include regular message queue checking using a separateprocess that could be triggered on message send, or excessive processcount where tasks could better be managed from within a softwareprocess.

Removal of system calls. These are normally madefrom the real-time path to replace necessary functionality withoptimized user space functions. However, timers from the user space is agreat example of code that can efficiently be replaced withsingle-instruction time-base read functionality.

Conclusion
Achievingprojected SoC goals centers is only possible when the system is jointlyarchitected from a hardware and software perspective. The solutionshows that Linux can be used in a high-performance hard real-timesystem, given that enough consideration is given to removing interactionwith the Linux kernel in the fastpath. This limits the impact of the OSto a negligible amount, but requires development effort to removesystem calls from the fast path portion of the processing chain byremoving kernel activity and system calls.

Hardware/softwareco-design is key to tracking and achieving performance targets.Monitoring of performance metrics throughout both hardware and softwaredevelopment phases allows for an optimal system without significantre-design during the latter half of the implementation phase.

Part 1: A Linux-based single base station SoC design

Wim Rouwet is a senior systems architect in the Digital Networking group ofFreescale Semiconductor. He has a background in network processing,wireless and networking protocol development and systems andarchitecture, focusing on wireless systems. His experience includeshardware and software architecture, algorithm development, performanceanalysis/optimization and product development. Wim holds a master'sdegree in Electrical Engineering/Telecommunications from EindhovenUniversity of Technology.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.