How Texas Instruments' designers used the SystemC hardware design language to do performance modeling when creating both the company's OMAP-2 platform and the devices based on it.
As an embedded systems designer, you may find you're working more with hardware design languages and the system-on-chip (SoC). Perhaps you're building boards or systems using components, both of which often now deal with SoCs either in ASIC or FPGA form. How SoCs are modeled and simulated may feel like a prequel to your design, but it's valuable story.
This article discusses the role of performance modeling in creating both the OMAP-2 platform and the devices based on it. The OMAP-2 is a platform from Texas Instruments for creating SoCs. The platform is underpinned by a basic set of rules and guidelines covering programming models, bus interfaces, and RTL (register transfer level) design.
The platform is highly generic. It's capable of supporting a wide range of functional and performance requirements, some of which may be unknown when the platform is created.
Surprisingly, the same can be true for the specific devices, which are frequently openly programmable and expected to have a life extending beyond that of the products that drive their development. It is, however, generally true that device requirements are more precise than platform requirements.
Because the SoCs built from OMAP-2 are highly complex, it's not possible to analyze performance satisfactorily using static calculations such as in spreadsheets. Therefore, simulation is used.
The requirements on the simulation technology are first and foremost ease in creating test cases and models and credibility of results. The emphasis on test-case creation is a consequence of the complexity of the devices and of the way in which an SoC platform such as OMAP-2 is used: because the whole motivation is to be able to move from marketing requirements to RTL freeze and tape-out in a very short time; and because in many cases large parts of the software will be written by the end customer and not by the SoC provider (Texas Instruments, in this article), the performance-area-power tradeoff of a proposed new SoC must be achieved without the aid of “the software.”
Secondary requirements are simulation speed, visibility of results and behavior, modularity and reusability, and the ability to integrate legacy and third-party models. We created a modeling technology based on:
- Standard cycle-based modeling technology for bus interfaces taken from the Open Core Protocol International Partnership (OCP-IP).
- Privately-developed technology for test-case specification, module configuration, run-time control, and results extraction.
We used cycle-based interfaces throughout, because cycle accuracy is required in some areas and use of a single interface technology throughout the platform was essential. Cycle-accurate interfaces do not necessarily imply cycle-accurate functionality, and in general the OMAP-2 simulations can be described as timing-approximate.
The aim is to move to public domain technology in all areas as soon as appropriate solutions become available. We never use this modeling technology for software development but independently create virtual SoC platforms for software development.
The challenges for the future lie in making this technology usable outside the core OMAP-2 architecture team and in being able to import models from third-party suppliers. Achievement of these goals is currently hampered by the lack of public standards for specifying test cases and configuring and controlling modules.OMAP-2 overview
The OMAP-2 platform provides the basic building blocks to create a general-purpose computer system on-a-chip.1 It's designed for application and modem processors for mobile telephones.
Click on image to enlarge.
The principal shared characteristics of the modules to be found in the OMAP-2 platform (shown in Figure 1 ) are:
- Bus interfaces. All OMAP-2 modules use the same protocol, namely OCP (Open Core Protocol).2
- Power management functionality and interfaces.
- Interrupt/direct memory access (DMA)-request interfaces.
- Synthesis scripts assuring timing closure at common frequencies at common process nodes.
- Programming models derived from a common base and common principles.
- Security- and debug-related functionality.
One vital element of the OMAP-2 platform that is not shown in Figure 1 is the interconnect or NOC (network on chip) technology. It's absent because it's largely transparent to the user, whether software developer or hardware integrator. The NOC enables the processors and DMA controllers to access the memories and peripherals, using a common SoC-wide memory map. The NOC provides:
- Address-based routing of bus requests.
- Arbitration for concurrent access to shared memories.
- Adaptation of OCP features between incompatible initiators and targets, for example bus-width conversion, conversion of burst types to those supported at the target, or conversion from single-request-multiple-data to multirequest-multiple-data bursts.
- Clock-frequency conversion between modules running at different rates.
- Programmable address-based connectivity control for enhanced security.
- Detection, routing, and logging of error events.
Typically multiple levels of NOC are in an OMAP device, and different NOC technologies are available within OMAP-2, optimized for the different levels. Superficially these provide the same functionality but are very different in terms of performance and area. The connectivity between the principal processors and the principal memories is critical to system performance and is allowed to consume more silicon area and power than the paths to rarely-used peripherals.
OCP and the modular multilevel NOC are cornerstones of the OMAP-2 platform. They permit the rapid creation of SoC products. The architects of the new product are able to select the processors and peripherals they desire and be confident that these are compatible with each other and that they can be connected as required. This is in some ways an intrinsically bottom-up process, with apparently little scope for optimization except through selection of the modules from the library, if the development timescales are to be held. One module, however, in every SoC is created specifically for that SoC—the NOC. By playing with the topology, the level of concurrency, and the level of pipelining in the NOC, it's possible to create SoCs from the same basic modules with quite different capabilities.
This approach to SoC creation puts product performance analysis in the critical path. The product architects are able to fashion an SoC rapidly from existing material and to know immediately how big it will be, how fast it will run, and (to a first approximation) how much power it will consume; but they must also know whether it meets its performance requirements. For this, architecture-level simulation is used, based on transaction-level modeling (TLM) concepts and the SystemC language.3 This simulation capability is also a part of the OMAP-2 platform. The basic requirement on it is to be able to provide feedback on questions of product performance in the timescales for definition of an OMAP-2-based SoC, in the project phase before development resources are allocated and development starts. During this definition phase, the SoC architecture is not stable and the performance analysis technology must live with this fact.
The OMAP-2 performance modeling technology is used for the following:
- Support of OMAP-2 platform development and maintenance.
- Support of SoC product definition: validation of the SoC's performance before RTL development starts.
- Validation of details of SoC implementation, in particular the NOC configurations, during development.
- Provision of reference performance data to RTL and silicon validation teams.
- Response to queries from marketing and customers when new applications of an exiting SoC design are proposed.
- Support of customers wishing to optimize the implementation of their application on the SoC (which DMA is better to use, what size of burst should be used, what are the best arbitration options).
Figure 2 shows a simplified representation of the top level of an OMAP-2 SoC performance model.
Click on image to enlarge.
All the boxes are SystemC modules connected by SystemC channels. The modules fall into a small number of different categories:
- Subsystems, which are just hierarchical divisions and contain further modules of the same sorts, connected in the same way. The hierarchy in general matches the hardware hierarchy of the SoC. Typically a subsystem is composed of one or more processors and one or more DMA controllers or traffic generators.
- Processors: Three different styles of processor model are used:
Stochastic , in which the processor generates random instructions, pretends to fetch them, then executes them. External memory accesses for fetch, load, and store are filtered by stochastic cache models, in which the decision whether an access hits or misses is made through comparison of a random number with a cache miss ratio parameter. The power of such models is that with a small number of parameters, representative bus activity can be created, even of the most complex software. Cache miss ratio and code profiling statistics are available for many classes of software, and so a large range of tests (such as protocol stack, signal processing, high-level operating system with user applications, games) can be run without significant software development or porting effort. Furthermore, because no actual software is run, the parameters can be slightly degraded to test the sensitivity of the SoC to potential software variations. The models provide estimates of processor MIPS and simulated NOC and memory traffic.
Such models have been developed for RISC and DSP processors, Harvard and unified-memory, with L1 and L2 caches. Although they do not implement the function of processors, they may be said to be cycle-accurate. CPU and cache pipelines are modeled correctly, write buffers are implemented, and so on.
Trace-driven. Where the performance of the processor for a specific software is the primary consideration, a more detailed model is required that takes into consideration not only the statistics of the software but also the order of instructions executed. To achieve this, cycle-accurate processor and cache models are available, which replay a trace of the software execution rather than executing it afresh. The advantage of this is that software from another platform, for example a previous generation OMAP, may be tested without first being ported. Software porting, especially where OSes are involved, is a major task and is not attempted during the product-definition phase. Furthermore, the use of traces, which include the effects of user I/O, provide repeatability—hard to achieve if the software is actually executed, for example in a game environment.
Instruction-set simulators. Although used less than the other processor types, it's possible to instantiate a cycle-accurate instruction-set simulator (ISS) in the OMAP-2 performance simulation. This is at the moment restricted to DSPs. Such processor models are heavily used in DSP software optimization, and instantiation in the SoC model allows the effects of the overall system (for example, increased latency caused by congestion on external memory) to be taken into consideration. The SoC model is not in general safe to use for the processor—there is no guarantee that memory exists where it should or that memory will not be over-written by some random interference traffic, so the processor I/O is usually taken from the host filesystem and the external memory is fully cached inside the processor model.
There is no requirement for an ISS model of the main RISC processor of the OMAP SoC. The cost of implementing a SystemC SoC model capable of supporting the interesting applications is too high, even before the cost of software porting and maintenance is considered. Configuration of the SoC, a task done by the RISC CPU in reality, is more easily accomplished in performance simulation by direct configuration of the modules.
- Memory controllers and memories: The memory controllers are modeled in a fully-cycle-accurate way. However, they aren't normally connected to memories. Although a read operation will return data at the correct time, it will not be the correct data. In general, the whole OMAP-2 performance-simulation platform can be described as dataless.
- DMAs and peripherals: Similarly to the memory controllers, DMAs and peripherals are modeled cycle-accurately, but certain aspects of their functionality are not implemented. In the case of a DMA, it is the run-time programmability that is not present. Whereas in reality a processor writes to DMA registers to provoke a transfer, in the simulation the transfer is simply requested from the DMA model through a SystemC interface. This may be done at elaboration time, optionally with a delayed start, or at any time during the simulation by some other process.
Peripherals of interest to the performance simulation include serial-port controllers, cryptographic accelerators, and so on. A serial-port controller model would be cycle-accurate on its bus and DMA/CPU interfaces, but the serial data would not exist. Likewise a cryptographic accelerator would not encrypt the data given it, but it would act as though it had, making dummy data available at the correct time. In both cases any configuration (such as baud rate) would be done using a high-level SystemC interface and not by writing to simulated registers.
- Generic traffic generators: Many of the main bandwidth consumers in an OMAP SoC have relatively simple and repetitive traffic patterns. The best examples of this are the display and camera controllers. In the OMAP-2 performance model such things are represented by simple traffic generators. These generators have a range of addressing modes and traffic types (such as burst/non-burst and SRMD). They generate traffic at a constant rate (with optional jitter) and may have real-time requirements and internal pipelining limitations. By combining several such generators, relatively complex traffic flows may be created. They may also be configured to behave in a highly randomized way, to create a sort of background load in the system.
- NOCs: The networks-on-chip are the only fully cycle- and functional-accurate parts of the SoC model. The NOC technology used in OMAP-2 is based on generation of an NOC from a configuration file, which contains details of all the initiators and targets and the desired topology of the NOC. It is possible to generate both RTL code and SystemC code from the same input.
Interfaces and channels
The modules just described all support the same basic set of SystemC interfaces. A small set of SystemC channels is used to connect them together.
- OCP TL1: The OCP-IP has proposed a method for SystemC modeling of OCP interfaces.3 Documentation and SystemC code (interfaces, channels, and data types) are available. The proposals cover a wide range of abstraction levels: TL1 being cycle-accurate; TL2 being protocol-specific with approximate timing, and so on. In the OMAP performance model the OCP TL1 technology is used exclusively. It can be argued that many of the simulations do not require cycle-accuracy and certainly many of the traffic generators or peripheral models are not in any way cycle-accurate in their functionality. However the advantages of having a single interface and a single channel to deal with outweigh the simulation speed gains that might be available in a mixed TL1/TL2 simulation platform.
The OCP-TL1 channel includes a monitor interface, and a simple monitor that dumps a trace to a text file is available. A TI-developed statistics-gathering monitor is also used in the OMAP simulations to enable bandwidths and latencies to be extracted as simulation outputs. Any OCP interface in the SoC may be monitored in this way.
OCP is a synchronous protocol. A clock is associated with every point-to-point OCP connection. In the SystemC model, the synchronisation is accomplished using sc_clock() objects. All modules with OCP ports also have sc_clock input ports.
- Interrupts and DMA requests: TI has developed a simple TL1 interface for DMA requests and interrupts, and a SystemC channel for connecting interrupt generators to interrupt consumers. The main point of interest in this technology is that a single channel is instantiated in the whole SoC simulation. This allows the routing of interrupts and DMA requests to be done at run time, based on a configuration file, rather than being hard-wired as in reality.
- Static configuration: All the modules and channels in an OMAP SoC performance model support an elaboration-time/end-of-elaboration configuration procedure. This is used for:
- Providing hardware parameters to generic modules, for example cache sizes, filenames for trace or executable binaries, FIFO depths, bus widths, and clock frequencies.
- Providing modules with configuration information that would in the hardware be provided through register writes, for example baud rates, arbitration parameters, and FIFO trigger thresholds.
- Providing behavioral parameters to autonomous initiators, for example cache miss ratios, display refresh rates and screen size, and DMA transfer parameters.
Each module or channel has a set of parameters that may be set, and the parameter values are passed to it in the form of an stl map, templated with a pair of stl strings. The first string is the parameter name and the second includes a letter to indicate the type and then the parameter value. Such maps may be read from text files created by a user in a text editor. For example:
- Run-time control: Most of the simulations that run on the OMAP performance model need only static configuration of the initiators in order to produce the desired behavior. However, in such simulations, initiators do not interact. A process completing on the DSP can not trigger the start of a DMA transfer, for example. In order to address this limitation, the modules (mainly the initiators) also support a second interface, which allows dynamic control of their behavior during the simulation. Basically a secondary pure-functional simulation runs: it may start tasks on the OMAP initiators and is informed when these tasks complete. Preemption of tasks is possible, so the secondary simulation can model multiple tasks using the same CPU under control of an RTOS. This is illustrated in Figure 3 .
Click on image to enlarge.
The top half of Figure 3 shows a simplified representation of a video conference application. It's purely functional, simply a chain of functions that have to be completed, each one triggering others. A complex application like this is hierarchical. In implementation, each function is mapped to some hardware, for example a DMA controller or a CPU. The black arrows show such a mapping. In our simulation technology, two separate simulations are within the same SystemC sc_main(). The pure-functional simulation provides the interactions between functions. For example, it waits until the MPEG compression is complete before starting a DMA transfer of the compressed data to mass storage.
The other simulation is the OMAP performance model as described earlier. Each function in the pure-functional simulation includes a set of parameters for one of the OMAP modules, on which it will be executed. The two simulations are linked by a set of schedulers, which allow multiple functions to be active at the same time even if they share the same CPU. For example several functions can run in turn on a stochastic CPU, with its own cache miss ratios and a number of instructions to execute before it is done.
The interface implemented by the OMAP-2 initiator models to enable this dynamic control is:
Use cases (applications) defined in this way can also be executed standalone and can be reused from a model of one SoC to a model of another, without requiring the same function-to-hardware mapping.
Other OMAP-2 simulation platforms
The architecture-level SoC performance model described here is not the only simulation model of an OMAP-2 SoC that is created. All of the following models are available:
- SystemC architecture-level performance model.
- Virtual platform model for software development.
- RTL simulator, including the options to substitute fast ISSs or simple traffic generators for the processors.
- FPGA model.
The different models serve different purposes, require different levels of effort to use, and become available at different times during the project. The SystemC performance model is always available first and is always the simplest to create and use. The virtual platform is the next to become available. It is used for software development and has very little timing accuracy. TI uses Virtio technology to create this model rather than SystemC.5 The lack of accurate timing in this model means that low-level software has to be validated on another platform—the motivation behind the FPGA model. The FPGA model can also be used for performance investigations. It complements the SystemC model, being much less flexible and requiring software but having a degree of completeness and accuracy that is not attempted in SystemC. RTL simulations are in general too slow for either software development or performance investigations but are the final reference in cases of doubt and have the advantage of complete visiblity into the SoC behavior.
It would appear the choice of two different technologies for the virtual platform and the performance model is inefficient, wasting potential code reuse. However, the two have completely different (almost fully orthogonal) requirements, and at module level almost no code reuse is possible. This is illustrated in Figures 4 and 5 .
Figure 4 shows a breakdown of a module into different aspects, whose importances vary depending on the level of abstraction. This example is an OCP slave, a peripheral of some kind, with a register file, some functionality triggered by writes to the registers, and some timing associated with execution of the function. Furthermore the peripheral has a bus interface that is compliant with some protocol, in this case OCP. A complete model of the peripheral would implement all this. And a generic model architecture following this breakdown has often been proposed in the ESL industry.
Click on image to enlarge.
Figure 5 shows that the model needed for the SoC architect's performance analysis is completely orthogonal to that needed by the software developer's virtual platform. On the left, we see that the architect needs the timing and the bus interface. The architect is concerned that the parameters of the bus interface are correctly chosen (such that the function can be implemented) and needs to be able to look at the cycle-by-cycle behaviour on that interface. But the functionality is of no interest to the architect. Suppose this is a cryptographic accelerator: it doesn't matter to the architect whether the encryption is done properly or not. In fact, it may be a hindrance if apparently random data is visible on the bus, making it much more difficult to correlate input and output. Furthermore, the architect will want to be able to trigger the function without writing into the registers via the bus, because that would require writing or modifying software, a time-consuming activity that may not even be possible.
Click on image to enlarge.
On the right, by contrast, we see that in the virtual platform only the encryption function and the registers are important. The software engineer does not care at all about the bus interface, and the virtual platform generally discards timing to improve simulation speed.
For NOCs and memory controllers, the difference between the architect's view and the programmer's view is yet starker. These modules are the central core of the architect's model, but do not even exist in the programmer's view, except maybe as a few register stubs, their main functionality being fully software-transparent.
Usage example of model
The performance model exists because a generic and rapid-deployment performance model is essential for a platform-based SoC factory, as we've discussed in this article. But more than this, certain things can not be achieved without this kind of technology. Here is an example in which the performance limits of an OMAP-2 SoC have been probed using the model.
The use case is a videoconference. This is easy to say, but when it comes to the details many choices need to be made. Development of software able to implement all these choices as run-time or even compile-time options is practically impossible. On FPGA or silicon each videoconference to be analysed requires development resources. On the SystemC performance model, on the other hand, a regression of 144 videoconferences has been created that can be applied to any OMAP application processor. The results of this are available to OMAP marketing for understanding the limits of each platform in advance of specific customer queries. Figure 6 shows the results.
Click on image to enlarge.
Some of the parameters that are varied in order to create the 144 scenarios include:
- Display size, refresh rate, and orientation.
- Configuration of windows on the display, including rescaling and rotation requirements for the video.
- Size of the base image used in the videoconference.
- Compression algorithm used (MPEG or other, with stabilization or without, and so on).
- Mapping of videoconference functional elements to OMAP hardware.
- Configuration of SoC, including external memory size and performance, arbitration options, burst usage, and clock frequencies.
The figure shows some of the results generated. These are bandwidths measured on the external memories. Similar bar charts exist for latencies, CPU occupancies, FIFO occupancies for hard-real-time functions, and so on.
The need for standards
In this article, I've described the OMAP-2 platform and its SystemC-based performance-modeling infrastructure. This infrastructure is one of the technologies essential if real benefits are to be drawn from a platform-based SoC methodology
My emphasis has been more on the platform-user's requirements and workflow than on those of the platform-supplier. Within TI's OMAP organization, the distinction between platform-user and platform-supplier is relatively small and most of the issues raised apply to both.
The use of SystemC for performance modeling must fit into a broader methodology for SoC definition and development. Electronic System Level [Design], or ESL, is generally used as an umbrella term for this kind of thing. ESL encompasses themes as diverse as synthesis from sequential C code to RTL and virtual platform use for early software development. Within TI, a number of tools and technologies have been or are being adopted and SystemC is seen as a part of the overall ESL puzzle rather than as a central uniting theme. Non-SystemC ESL activity includes use of executable specifications, requirements and use-case capture, top-level SoC integration automization, and memory-map and register-map capture. The performance-analysis modeling environment must interwork with these tools, and therefore it's important that the SystemC TLM technology not restrict itself to a SystemC-for-everything worldview. It's not expected that the OMAP performance model will be used to generate RTL. Rather, it's expected that the same tool will generate the top-level RTL and the performance model.
Work on the modeling platform is continuing but in some areas there is a strong desire for public standards, to replace the ad-hoc technology developed within TI, making it cleaner, available to third-party suppliers, and supportable by EDA vendors. Also there needs to be widespread agreement on the types of model that are required. So far, it seems that the types of model used in OMAP-2 performance modeling have not been widely proposed.
James Aldis is co-chair for OCP-IP system-level design working group as well as a senior member of Group Technical Staff at Texas Instruments,where he works on the architecture of OMAP SoCs, specifically on-chip networking and SoC performance modeling. He joined TI in 2002. Previously he worked at Ascom AG in Switzerland on specification and implementation of wireless LAN, cellular and powerline communications modems. He has many academic publications and has made contributions to standardisation of GSM, UMTS, 802.11, and the language SystemC. His degree is in pure mathematics from the University of Liverpool and his PhD is from the University of York, on the subject of coded modulation and multidimensional geometry.
- OMAP Platform, http://focus.ti.com/general/docs/wtbu/wtbugencontent.tsp?templateId=6123&navigationId=11988&path=templatedata/cm/general/data/wtbovrvw/omap
- Open Core Protocol (OCP), www.ocpip.org
- For OCP transaction-level modeling, see: Haverinen, Anssi, Maxime Leclercq, Norman Weyrich, and Drew Wingard. “White Paper for SystemC based SoC Communication Modeling for the OCP Protocol,” V1.0, October 14, 2002, www.ocpip.org/uploads/documents/ocpip_wp_SystemC_Communication_Modeling_2002.pdf
- SystemC, www.systemc.org
- OMAP Code Development Tools, http://focus.ti.com/general/docs/wtbu/wtbugencontent.tsp?templateId=6123&navigtionId=12013&path=templatedata/cm/general/data/wtbmiddl/omap_development