Functional TLM simplifies heterogeneous multiprocessor software development - Embedded.com

Functional TLM simplifies heterogeneous multiprocessor software development

Multi-processor architectures are becoming prevalent in today’s embedded systems to keep up with growing computational requirements, throughput and integrated system features.

As an example, high-end smartphones already contain a plethora of micro-processors (MPUs) and digital signal processors (DSPs) to provide advanced (2.75G and 3G) modem and application processing, as well as WiFi, GPS and Bluetooth functionality.

Current embedded software design practices are not especially well equipped to deal with the complexities of developing inter-processor communication (IPC) software for these heterogeneous architectures.

However, virtual prototyping technology is emerging that allows the creation of a high-performance, functional software model of an embedded system that is so complete that it fully mirrors the hardware functionality.

Based on functional transaction-level modeling (F-TLM), a virtual platform can be generated by combining high-speed processor instruction-set simulators and high-level, fully functional C/C++ models of the hardware building blocks. The result is a high-level model of hardware, sophisticated enough for a software developer to substitute for the physical device.

With such platforms (Figure 1 below ), software teams can start developing, integrating and testing software code long before silicon is available. This technology enables concurrent software development at all levels, including ROM code, firmware code, device driver, OS porting, middleware and application development.

F-TLM balances requirements for multiprocessor software
F-TLM based virtual platform modeling provides the right balance between instruction-set simulators, hardware/peripheral models and system I/O that allow early and concurrent software development. F-TLMs focus on the hardware aspects that are relevant to the software developer, and typically avoid hardware details that are not exposed through the software programming model. The right F-TLM environment combines three essential elements:

1. Instruction-accurate ISSs : the F-TLM-based CPU models generated for the heterogeneous devices in a typical multiprocessor design are capable of modeling CPU state and executing target program binaries. They include MMU models, which can be complemented with functional cache models, giving run-time statistics about cache hit/miss numbers. By instruction-accurate, we refer to the fact that they are capable of executing target program binaries on an instruction-by-instruction by basis, thus being binary compatible. However, no model of the CPU’s pipeline is included and no cycle detail is maintained.

2. High-level transaction-level bus models : the complex bus traffic in a multiprocessor design is simplified to simple read and write transactions, and abstraction is made of various aspects, like the different bus phases, chip selects, re-tries, bus turn-around after arbitration, etc. Transactions are implemented as read and write function calls on simple bus models, which model the address decoding, and the control registers of various bus elements (e.g. bus bridges, arbiters, etc.).

3. Functional peripheral models : The register interface, programming model, functionality and communication with other peripherals or its system I/O is modeled for each hardware peripheral in a multiprocessor design. Modeling focus is on the interaction and impact between the software and the functionality of the peripheral. For example, programming of a certain control/command register of a camera interface controller might start another data flow from a secondary phone camera.

This type of interaction and functionality can be included in the functional model. Hardware aspects like internal pipelining, arbitration on internal hardware accelerators, flow control to access system busses, etc. are typically not relevant from the perspective of the running software, and hence not included in the platform model.

Table 1, above shows an example of achievable execution times for F-TLMs (both in absolute clock time as well in MIPS normalized per GHz of the host PC) for a complete board-level simulator, consisting of a complex multiprocessor-based System-on-Chip (SoC) and several board-level discrete components. Note the slower effective MIPS per GHz during the OS boot phase, characterized by the OS programming the various peripherals and waiting for the peripheral to return after initialization, hence the effective lower MIPS rating.

Cycle-accurate versus F-TLM models
In contrast to F-TLM models, cycle-accurate models, though providing great level of detail and timing metrics, are notoriously hard to validate, take considerably more time to develop and provide slower execution speeds typically in the order of 500 kilocycles per second. This performance level, though acceptable for low-level software development tasks like ROM code and firmware where the amount of software to be run is small, is too slow for effective high-level OS porting, middleware integration and application development.

An F-TLM-based virtual prototyping approach has several advantages in the complex multiprocessor environment typical of many embedded consumer devices.

Deeply integrated SoCs at the heart of such systems contain several dozen peripherals and multiple on- and off-chip busses, some of which might not be made visible on the physical target because of the pin or JTAG limitations, making programming and debugging these devices increasingly complex.

Providing increased visibility and control over the target was very desirable to keep software development productivity high for these new SoCs. The goal of high overall execution speed, and the observation that 95 percent of simulation speed is spent in the CPU ISS(s) and the memory models, drove the simulator development towards the use of native, compiled C++ code for the critical pieces like CPU ISSs and memory models.

In such an F-TLM environment, to capture peripherals and their functionality, a graphical finite-state machine (FSM) language – such as ‘Magic-C’ – is required. It combines some of the graphical depiction capabilities of the Specification and Description Language (SDL) with the execution power of ANSI-C.

The concurrent, communicating FSM execution paradigm (see Figure Two, below )allows for easy description of concurrent hardware entities, whilst its graphical nature enables a graphical hardware debugger, which can be attached to the running simulation and which supports novel features like hardware breakpoints and hardware single-stepping.

With this graphical language in place, a developer can simultaneously debug hardware, through the Magic-C hardware debugger, and software, through a classical software debugger attached to the running simulation.

The simulation framework should also leverage standard model interfaces and APIs as much as possible, to promote re-use of component models. Mostly, this means a set of standard transaction-level interfaces and a standard level of abstraction, which all peripheral models use to connect to busses.

Building a multiprocessor SoC with F-TLM
The power of an F-TLM approach in modeling multiprocessor architectures is apparent in how it has been used in the development of applications based on Texas Instrument’s OMAP Platform. OMAP is a robust software infrastructure and comprehensive support network for the rapid development of Internet appliances, 2.5G and 3G wireless handsets and PDAs, and other multimedia-enhanced devices.

To achieve this, these SoCs leverage an advanced heterogeneous RISC / DSP architecture, combined with dedicated 2D/3D graphics accelerators and Imaging Video Accelerators (IVA), some of them containing an additional RISC core, resulting in a network of different concurrent on-chip CPUs.

Simulator Use Model – To accelerate both internal and external software development for the OMAP architecture, TI joined with Virtio to produce several OMAP virtual platforms, months ahead of silicon availability. A screenshot of the OMAP virtual platform is depicted in Figure 3, below .

By aligning internal software development phases with OMAP platform deliveries, software development started as early as four weeks after the beginning of the platform development. All of this occurred while the architectural spec and hardware design were still being determined, realizing true parallel development.

The F-TLM-based virtual platform enabled early development of first, the open OS boot loader using low-level ARM (debug) tools, subsequent development of the OS HAL, kernel porting and finally further development of the extended set of device drivers, all through use of the actual target development tools, with no change requirement in the development flow.

At the same time, the DSP teams worked on the porting and bring-up of the DSP/BIOS real-time operating system. In a later phase, the team developed the inter-processor communication (IPC) layer and components between the RISC core and DSP, and added these to the high-level OS and DSP/BIOS sides.

Once the Board Support Packages (BSPs) were developed and certified, TI delivered to initial customers a desktop development environment consisting of the OMAP virtual platform and the BSP, enabling customers to start integrating device middleware and critical applications and accelerate their device development.

Inter-Processor Communication (IPC) Benefits – Development of the IPC software for a network of heterogeneous processors is typically one of the most challenging tasks of the overall software development for these SoCs. Several key features of virtual platforms provided significant contributions to the acceleration of this development and the increase in development productivity.

Virtual platforms provide increased system visibility, which enables the developer to more simply isolate and debug IPC problems. Magic-C hardware breakpoints and debugging features provide visibility into the state of on-chip IPC hardware (for example on-chip mailboxes and semaphores), and the overall system and CPU state at any point in time.

In contrast to a physical development target, when a CPU is stopped through a JTAG debugger connection, the complete system including the hardware peripheral clocks (rather than only the specific CPU instance) is stopped.

This causes the IPC hardware (and all other peripherals) to stop, synchronized with the CPUs. Thus, at any point in time, the system state is intact and not corrupted by firing of interrupts, for example. Features like intact system state preservation whenever the CPU is stopped contribute to improved development productivity.

Deterministic simulation execution and multi-core scheduling algorithms ensure predictability, which enables easy reproduction of bugs, repeatable run after run of the simulation.

In addition, simulation supports tight multi-core debugging by providing low-level multi-core JTAG-type control over the target. For example, any debugger can halt the complete platform execution whenever an interesting event happens in either the DSP or RISC domain, and the engineer can inspect the both domains at that event point.

With advanced simulation technology, all software development tasks, including DSP/BIOS operating system porting and IPC software development, were completed prior to silicon tape-out. An initial internal software productivity survey indicates productivity increased by a measurable factor of 2x to 5x, compared with development using a physical target.

Filip Thoen is Chief Technical Officer, Virtio Corp. Campbell, Ca.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.