Building a Linux-based femtocell base station using software performance engineering: Part 2
Historically, embedded systems implement separate specifications for hardware and software. Hardware partitioning and definition of offload components is done based on mostly inaccurate inputs, on a timeline that is different from that associated with the software design. Hardware specifications need to be available a priori to the silicon design, whereas software work is only started after device (or simulator) availability. Such timelines prohibit extensive re-design of hardware based on software findings, often leading to sub-optimal overall system design and time-to-market.
In the case of a tightly constrained system (due to cost and power consumption) such as the Linux fast-path femtocell design described in Part 1 in this series, concurrent hardware/software design is obviously preferred, and such a system is made possible by the well-defined nature of the end system, a WCDMA/LTE base station operation with known properties. A tailored Software Performance Engineering (SPE) process was used for the performance evaluation activity. In this process, performance calculators are used to model the important performance use cases for the application.
Modeling is a significant aspect of SPE. Some of the performance modeling best practices include using performance scenarios to evaluate software architecture and design alternatives before be.g.inning the software coding and implementation phase. SPE starts with the development and analysis of the simplest model that identifies problems with the system architecture, design, or implementation plans. Details are added as more and more details of the software become apparent.
In the case of the design described in Part 1, as a first step the use case is broken down by its key components for the case of LTE (Figure 11), based on the associated protocol stacks that are being implemented (integrated eNB):
In this example, from analysis of the protocol stack the following key performance impacting components were defined:
- Transport (UDP, IP, IPSec, QoS, Ethernet)
- RRC/Control plane
- Miscellaneous (e.g. OS overhead)
Pre-silicon modeling targets two goals:
- Definition of SoC architecture with regard to hardware sizing, for example core, cache, bus frequencies, DDR controller configuration, etc.
- Definition of per-component cycle counts that allow for budgeting of resources to software blocks
SoC level modeling
SoC level modeling focuses on architecture exploration modeling of the L1 and L2 domains of the SoC. From a L2/transport perspective, the modeling focuses on the e500v2 Core Complex, Frontside L2 Cache, the coherency manager, DDR queue and controller. Specifically, how DDR traffic from other initiators affects, and is affected by, DDR traffic from the e500v2 Core Complex.
The other DDR traffic initiators modeled were the DSP core, Maple B2F L1 accelerator complex, antenna interface (i.e. Layer 1 processing modules), veTSEC Ethernet controller, and SEC engine. Modeling was performed using a Freescale proprietary System C-based simulation environment.
DDR traffic for the simulations was generated using statistical inputs to generic configurable modules from the modeling environment. Statistical inputs were provided by packet flow analysis based on the use case (amount of traffic to/from the security engine can be derived from over-the-air traffic rates), as well as inputs derived from prototyped software (L1/L2 cache hit rates). The e500v2 Core Complex was modeled in more detail in that its L1 instruction and data caches were modeled. A statistical cache miss rate was also used in the front-side L2 Cache model to determine cache hits/misses.
All bus transactions generated by the modeled initiators do not contain memory map addresses, but instead contain the targeted DDR bank index (any value 0 to 7 for the 8-bank DDR) and a DDR page miss value (0 or 1) indicating whether the transaction is a DDR page miss or hit if no other initiator has accessed the DDR bank since the initiator itself last accessed the DDR bank.
If another initiator has accessed the DDR bank, the transaction is always assumed to be a DDR page miss. The DDR page miss value in transactions is set according to an intrinsic DDR page miss rate input for each initiator. The intrinsic DDR page miss rate being the DDR page miss rate an initiator would be expected to achieve if it were the only active initiator generating traffic to DDR.
SoC level modeling ensures that the hardware does not have any inherent bottlenecks that limit the performance of the system: it guarantees that the device is physically capable of processing the use case and is well-balanced. SoC level modeling focuses on the following three key metrics:
- Target 0.6-0.9 achieved Instructions Per Cycle on an e500 core
- Total bus utilization/loading at or below 80% (leaving sufficient margin for unseen additional loading and to guard against the approximate accuracy of the model)
- DDR memory utilization/loading at or below 50% (DDR latency typically increases significantly when DDR loading goes above this level)
Providing accurate statistics
As required for the SoC level modeling (the need to provide accurate statistical characteristics of the software components), prototype software is implemented for each key component in the femtocell base station (RLC/MAC, Scheduler, PDCP/GTP, Transport, RRC/Control plane). Besides providing reliable statistical inputs to the SoC modeling environment, this prototype software can also be used to establish a baseline CPU utilization/budget for each software component in the system that is used throughout the product development cycle (see Part 1).
Definition of hardware acceleration is also done using prototyping software; cycle counts for each component in the system identify a prioritized list of targets for offload to hardware. Here, the tradeoff on hardware vs software is based on die size and power (mm2 and Watt cost) where a function is either implemented by means of a fixed hardware cost or a shared software cost.
The prototype software is executed on existing hardware platforms with the same or similar characteristics as the final product. In this example, a P2020 platform was chosen for prototyping, with configurations, such as the number/frequency of cores and cache size, to establish cycle costs with different SoC options.
Evaluating performance targets
As shown in Figure 12, performance targets (cycle counts per component) are used to create a performance report, which is used to document a software statement of work (SoW), which in turn is used for internal and external software development. It serves as a requirement document to all parties contributing to the system development process. Software implementation is focused on meeting the performance targets from initial architecture design through the implementation phase, with performance analysis conducted formally at each major phase of the project to ensure that goals are being met.