The trend towards high-quality multimedia content and higher communication bandwidth drastically increases the complexity of the underlying SoC architecture. In previous designs a single application processor was sufficient to run the rather simple phone software and to control the modem subsystem. Today numerous dedicated IP blocks are necessary to perform the multimedia functions with the required performance and energy efficiency.

1. Block diagram of a multi-media mobile phone.
The high-level block-diagram of the multi-media subsystem of a mobile phone is depicted in Figure 1. The four components on the top are initiators on the bus, whereas the multi-port memory controller is a target.
Design Time Performance Analysis Issues
The goal of the architecture definition phase is to determine the optimal configuration of the design parameters in interconnect and memory subsystems, in order to deliver sufficient performance at minimal cost. In the past, the performance requirements were analyzed using spread-sheets. However, this static performance analysis approach is not applicable for the complexity of today's SoC platforms.
Multiple Initiators: As shown in the block diagram, we have a much higher number of IP blocks, which act as masters on the interconnect architecture.
Dynamic traffic: The traffic generated by the multimedia accelerators is rather bursty and greatly varies depending on use cases. As an example, a viewfinder operation will deliver quite regular memory access since the data is processed in raster scan order. On the other hand, functions like video encoding or decoding tend to exhibit scattered memory accesses, especially with the latest generation of video CODECs. Another example is the influence of the frame buffer organization on the memory accesses: a coplanar organization will provide quite linear accesses, whereas a planar organization will require interleaved accesses through several frame buffer planes. An other factor influencing the traffic pattern is the dimension of the accessed objects: single dimension objects will again provide linear accesses, whereas the stride of 2-D objects will induce scattered and interleaved accesses. The combination of all possible configurations rapidly exceeds the capabilities of performance analysis using spread-sheets.
Arbitration: To cope with such a complex workload, we need multiple levels of arbitration and queuing in the bus matrix and in the multi-port memory controller. This hierarchical arbitration mechanism cannot be accurately predicted without a proper system simulation environment.
QoS: The Memory Controller offers advanced Quality-of-Service (QoS) features like bandwidth reservation for the multimedia blocks and a low latency access for the MCU.
This results in the following set of configuration parameters, which should be optimized by the architect:
Interconnect: bus-width, clock-period, topology, arbitration algorithm, priorities.
Memory Controller: bus-width, number of ports, low latency versus high bandwidth port, buffering, number of access beats.
The design space becomes even larger by the configuration parameters in the IP blocks: the data-layout of the video frames in the memory and the memory access pattern are highly configurable. This in turn has a significant impact on the effective DRAM bandwidth and access latency. One can typically face tens of parameters, the key challenge is then to isolate the most important ones and ensure the right tuning setup.
The ESL tool helps us to traverse the design space in a coordinated way. This is done by setting up simulation runs to sweep certain design parameters. The analysis results from the simulations are stored in separate data-bases. The comparison of the results unveils the significance of a design parameter with respect to a certain performance metric, like e.g. the time it takes to render one frame, the bandwidth headroom on the bus, the average latency of the MCU transactions, etc. Understanding the significance of a design parameter allows us to decide whether the corresponding implementation cost would be justified by the performance improvement.
Run-Time Performance Analysis Issues
Many performance-relevant parameters can be configured at run-time by the embedded software. Therefore performance analysis is not only a design-time issue, but also a post-silicon run-time issue.
For example, during the embedded Software development we faced a performance limitation with our current GSM chips. Both the baseband and the application processor are using the SDRAM memory controller. When the Multimedia application was tested on the prototype board the software developer saw black pixels in the image processed by the camera sensor. We analyzed this issue in the architecture group using the prototype board and logic analyzers. It was very difficult to identify the root cause by just looking at hardware traces coming from the board. It turned out that this was not a functional error in the hardware, but that the memory controller was not configured correctly by the software. This led to congestions in the queues of the memory arbiter, which in the end caused the pixel errors on the display. This correction could have been corrected earlier with a different method.
In summary, today's highly re-usable IP blocks offer many design-time and run-time configuration parameters to tune the block for the specific SoC. However, traditional design methodologies hinder us to take advantage of all this flexibility.
Using spreadsheets is no longer an option, because QoS based queuing and arbitration of dynamic workloads makes it impossible to predict the actual performance and utilization for a specific configuration of interconnect and memory subsystems.
Using RTL simulation is not an option for architecture analysis due to the long turn-around time for compiling and running a simulation. We also lack statistical analysis capabilities for performance related metrics like throughput and latency of the different components in the system.
Emulation solves the simulation speed issue of RTL simulations and is heavily used for RTL sign-off. However the late availability, lack of performance analysis views, and the long turn-around times are not addressed, so emulation alone is not a good fit for early architecture analysis studies.
Development boards are typically used for post-silicon SW debugging, but this is not at all suitable for performance analysis. The visibility, especially into interconnect and memory architecture, is very limited. This makes it very hard to identify the origin of a performance issue.
The shortcomings of our current design and debugging methodology motivated us to try out a new approach based on commercially available ESL technology, which is described in the next section. We have selected CoWare Platform Architect as a SystemC-based ESL environment for platform capture and performance analysis. Together with the RTL co-simulation capabilities and the ESL model library it delivers all the necessary ingredients for our architecture exploration and validation use-model.
Methodology/Approach
The goal was to build an ESL model of the performance relevant portion of the SoC platform. It is important to define the right modeling approach. For our use-case, we need cycle accuracy for interconnect and memory subsystems, which are at the center of our investigation. Also the traffic needs to be sufficiently accurate to reproduce specific scenarios. Our job is to define the SoC architecture, so obviously we don't want to spend too much time creating the models ourselves.
Given these requirements on the model accuracy and modeling effort we decided to use a combination of models for the component in our platform.
The easiest part was the modeling of the AHB bus matrix. Here we used the fully cycle-accurate SystemC TLM model, which is available in the commercial ESL model library. We only had to instantiate, connect and configure the bus nodes according to our wishes. The AHB model provided all the configuration options of the real AHB protocol and is instrumented with all the integrated performance analysis views we need for our architectural investigation.
We did not have SystemC models of the initiators IP blocks (MCU, Camera, Render, and display), and we did not want to spend the time to create them. We are anyway only interested in the bus transactions generated by these components and not in their functional behavior. Therefore we used the Generic File Reader Bus Master (GFRBM) provided by CoWare. This model reads in a transaction trace file in the Socket Transaction Language (STL) and generates the corresponding bus transactions. The GFRBM is generic in that it can be hooked to different bus protocols by means of transactors and generates cycle accurate traffic.
For the memory subsystem we used RTL co-simulation between the RTL memory controller and memory on the one hand side and the rest of the SystemC model on the other side. We had no ESL model of our proprietary memory subsystem available, and it would have been too much effort to create a cycle-accurate model of a complex multi-port memory controller ourselves. As a replacement we used the RTL co-simulation capability provided by the commercial ESL environment. The resulting simulation speed is sufficient for our architecture analysis work.
Assembling the platform from existing library elements and the RTL memory sub-system was straight forward and a matter of a few hours. Of course the majority of the blocks in the system are omitted or modeling using the trace-driven initiators. Still this partial model of our platform provides us with exactly the configurability and accuracy we need for our investigations.
So far we used this performance model of our phone platform for two use-cases:
Validation of Run-Time Performance Analysis Issues
As a first experiment, we validated the performance limitation in the existing platform. We configured the memory controller in the same way as the real software does and stimulated the bus using the Generic File Reader Bus Masters. We converted the original board traces into STL files driving the GFRBM. This way we were able to reproduce the performance limitation as observed in the real system. The bus analysis views immediately revealed the contention on the memory controller ports as the root cause of the problem.
This exercise convinced us of the fidelity of the analysis results obtained from the ESL simulation environment. Subsequently we used the same setup for the investigation of architectural alternatives as described in the next section.
Traffic Generation Utility
As the first step in setting up the architecture exploration experiments we created a small utility, which generates input trace files for the GFRBM initiators from a high-level traffic description. The traffic description is tailored to our image processing accelerators, which access the memory in a very specific way. The traffic description file contains the following attributes:
By modifying the attributes of the traffic description we can mimic the different design parameters and operating modes, like e.g. frame-rate of the video camera. This way it is very easy for us to set up all kinds of scenarios that can occur in the real platform. The accuracy of the traffic generation has been validated by comparing the traces from the ESL model against the reference traces. The reference traces are derived from the development board from the previous design, which is using the same IP blocks for the multi-media subsystem.
Performance Analysis Scenarios
In this section we discuss experiments we conducted with the ESL performance model of our chip architecture. The absolute performance metrics are only available to our customers. Here we restrict the results to relative numbers.
A snapshot of the performance analysis results is depicted in Figure 2.

2. Performance analysis results.
The lower left view shows the contribution from each of the initiators to the overall transaction throughput. The upper right view shows the relative contention in each of the 3 output stages, which are connected to the input-ports of the memory controller. The results are statistically aggregated over intervals of 500 micro-seconds to analyze the dynamics of the system over time. This view allows us to easily identify bottlenecks in the interconnect and memory subsystem.
The following enumeration briefly summarizes the results we obtained from our performance analysis studies.
Address Mapping
In this scenario we investigated the impact of the data organization. The memory controller supports "full-row", "full page", and "bank-interleaved" operation modes and the possibility to map data differently into two separately configurable memory regions with a physical to logical address conversion.
We had the possibility to simulate several combinations and find a good trade off between throughput and power consumption.
Validation Quality of Service
In this exercise we validated that the multimedia subsystem does not impair the performance of the other parts of the system (MCU, modem subsystem). The multimedia components (Camera, Rendering Engine, and Display Controller) share one port on the memory controller, whereas the other ports are reserved for other subsystems. We applied the stimuli representing the other subsystems to the memory controller ports and measured the resulting throughput and latency. Not surprisingly the memory controller is able to separate the traffic streams from the different memory ports such that the low latency requirements of the MCU are satisfied independently whether or not the multimedia subsystem is active.
Increased Bus Frequency
Here we investigated the potential for increasing the memory throughput by using a higher clock frequency for the memory controller. Increasing the memory controller clock by 17% increases the overall throughput by less than 5%. We discarded this option due to the high effort we foresee to implement this change.
Enable Bufferable Flag
The AHB protocol allows specifying a "bufferable" flag for each transaction. The memory controller could take advantage of this information, because enabling the internal buffers would improve the memory bandwidth and reduce the transaction latency. However this flag is currently not used by the multimedia subsystem. We have added the bufferable to the STL stimuli files where applicable and found a 10% improvement compared to the default setting.
Memory Controller Configuration
We found that the current driver software for the memory controller does not exploit the full performance potential of this complex block. By adjusting the Quality-of-Service settings of the memory controller to the current operation mode the memory bandwidth can be significantly improved.
Summary and Outlook
We are very happy with the new way of doing performance analysis, as our initial work immediately provided value to our product design. Previously we were using spread-sheets for very high level analysis and only when the RTL became available we validated the performance using emulation. Spread-sheets are no longer able to capture the effects of multiple levels of arbitration and queuing in multi-master systems. Emulation is way too late and not flexible enough to carry out architecture performance studies, e.g. it is not easy to vary the memory controller clock independent of the bus clock.
We have adopted CoWare Platform Architect together with the CoWare ESL Model Library. ESL design in a commercial tool environment gives us far more flexibility to explore architectural alternatives and quantify potential performance improvements. By changing the attributes of the traffic generation utility we can easily set up a large set of scenarios, which would be far more difficult with the real IP blocks. The ESL model also gives us a lot more flexibility, as we can freely modify the bus clock, the arbitration policy and priorities, and even the bus topology.
Moving forward we are planning to replace the RTL model of the memory controller and memory with a SystemC transaction-level model (TLM). This will further improve the simulation speed and will give us more flexibility to explore further architectural options. The SystemC TLM models will be either generated from the RTL model or it will be manually created from our central IP modeling team.
For the next generation of our NXP cellular systems product platforms, we will continue to use and extend this approach to carry out much broader architectural studies, like e.g. assess the benefit of replacing the AHB multi-layer bus with an AXI bus. Our next generation products will encompass an increasing integrated set of audio, image & video and telecom features. This method will contribute to an optimization of the system architecture very early in the development phase.
About the Authors:
Danilo Piergentili is a system architect in the Feature Phone Product Line of the Mobile & Personal Business Unit at NXP Semiconductors. He graduated from the University of Rome "Tor Vergata" with a master degree in electronics. he can be reached at: danilo.piergentili@nxp.com
David Coupe is Multimedia Architect in the Feature Phone Product Line of the Mobile & Personal Business Unit at NXP Semiconductors. He graduated from ISEN in Lille (FR) with a microelectronics engineer diploma in 1987. He can be reached at: david.coupe@nxp.com.