Picking the right MPSoC-based video architecture: Part 3 - Embedded.com

Picking the right MPSoC-based video architecture: Part 3

Multimedia SoCs frequently house multiple processor cores that share the task of running the operating system and controlling the critical and noncritical on-chip functional-unit resources.

In this context, an efficient bus architecture and arbitration (for reducing contention) play important roles in maximizing system performance. Besides, for many applications, the performance of multiprocessor systems relies heavily on an efficient communication between the processors and a balanced load distribution (of computing tasks) among them.

With multiple CPUs and a plethora of functional units, on-chip communication poses a critical design problem that is often solved using multilevel hierarchical busses connected via local bridges, the bridges primarily serving as protocol converters between different bus systems and/or connectors between busses with different speeds (e.g., high-speed processor bus and low-speed peripheral bus).

CoreConnect, for example, has three levels of hierarchy: processor local bus (PLB), on-chip peripheral bus (OPB), and device control register (DCR). PLB provides a high-performance and low-latency processor bus with separate read and write transactions, whereas OPB provides low speed with separate read and write data busses to reduce bottlenecks caused by slow I/O devices; the daisy-chained DCR offers a relatively low-speed datapath for communicating status and configuration information.

The advanced microcontroller bus architecture (AMBA) from ARM has two levels of hierarchy: the advanced high performance bus (AHB), similar to PLB, and the advanced peripheral bus (APB), similar to OPB. CoreConnect and AMBA are both pipelined busses with bridges to increase the communication efficiency between the high- and low-speed busses (for data transfer between them).

CoreFrame from Palmchip Company, on the other hand, is a nonpipelined bus that also has two independent bus types: Mbus for memory transfer and Palmbus for I/O devices. The PNX-8500 SoC from Philips is no exception; it also features, as shown in Figure 14-8, an on-chip hierarchical bus structure supported by local bridges.

Even though the bus structures across various SoCs (mentioned above) resemble one another at a coarse level of comparison, they do differ in their characteristics and implementation details (e.g., function, width, delay, throughput, pipelining, utilization, number of segments, number of busses, number and type of bridges, number of bus agents, and so on); an in-depth analysis of the target application, the desired timing, and the on-chip communication pattern determine the exact implementation.

To this end, the next few sections offer insights into the analysis that guided the final bus implementation in the PNX-8500 SoC.

PNX-8500 Structure
In PNX-8500, the multimedia processing and the control processing functions are split between two CPUs—the TriMedia CPU (TM32) and the MIPS RISC CPU (MIPS32). Thus, each CPU is responsible for the peripherals that belong to its task domain. This leads to the concept of separate processor busses, whereby each CPU controls all the devices on its local bus.

Not all the peripheral devices, however, can be owned by one CPU in all user cases (applications), and so provisions have been made so that every peripheral is still accessible from both the CPUs, but with a preference. If a peripheral is indeed shared at run time, the CPUs must negotiate its availability through the use of Semaphores.

Combining both CPUs in one system (from an SoC point of view) lowers the overall cost of computing by sharing system resources such as main memory, disk, and network interfaces.

All the on-chip functional units (peripherals) are programmable via CPU writes to their control registers. These control registers being memory-mapped, programmable reads from and/or writes to the registers are commonly referred to as memory mapped input output (MMIO) or programmable input output (PIO) transactions. Even though each peripheral can be addressed by both of the CPUs, it is “usually” read or written by the CPU to whose local bus it is connected.

Bus System Requirements
A variety of bus-architecture options were explored for PNX-8500 before deciding on the exact bus structure. The implementation was guided by the following architectural requirements:

* the cache traffic of a CPU must be separated from its register-access traffic.

* the register-access traffic from the two CPUs should be separated.

* the CPUs must have a high-performance and low-latency path to memory (relative to the peripherals).

* each CPU must have a low-latency access to the peripherals on its local bus.

* all the registers in the various peripheral units must be accessible from the two CPUs, the PCI block, the BOOT block, and the EJTAG block.

With the bus requirements nailed down, the next step was to decide on the bus topology. However, before exploring different topologies, a study of both tristate and point-to-point bus implementations was conducted in order to find out which of the two better suited the needs.

Tristate vs Point-to-Point Bus Design
A comparative study of tristate versus point-to-point bus implementations, as outlined in Table 14-1 below , shows that a point-to-point bus architecture is desirable for designs requiring high performance, simple testability, and reduced layout.

Table 14.1. Comparison of Tristate and Point-to-point busses.

However, one main problem with the point-to-point bus architecture is that it does not easily allow for multiple-master access to peripherals. For example, if a peripheral requires access by four masters, it is required to have four slave interfaces; adding an additional master will require changes to the peripheral to support five masters. Thus, the point-to-point bus is not very modular or scalable.

For PNX-8500, it was decided to use a high-performance point-to-point memory bus—the MMI bus—for bandwidth- and latency-critical access to the external SDRAM.

For access to modules' slave control registers and for lower bandwidth direct memory access (DMA) peripherals, a tristated bus—the PI bus—seemed more appropriate.

Another advantage of using the PI tristate bus was that the PI bus had been used extensively throughout the company and, therefore, a large portfolio of IPs with this interface was already available.

Figure 14.9. Bus topology options.

Comparison of Bus Topologies,
As mentioned before, high-bandwidth peripherals and peripherals requiring low-latency access to the external memory clearly call for a direct interface to the point-to-point memory bus.

For the other peripherals that do not require very high DMA bandwidth, the following three architectural options, as illustrated in Figure 14-9 above were evaluated:

* shared PIO and DMA on a common PI bus.
* split PIO and DMA on separate PI busses.
* split PIO and DMA on the PI and the MMI bus, respectively.

A comparison of the different options is shown in Table 14-2 below .

For PNX-8500, it was decided to go for Option 1, with shared PIO and DMA on a common PI bus, for the modules that are not bandwidth-hungry. The main reason is that most of the existing portfolio of IPs already supported this topology and, therefore, this option had the least risk.

Table 14.2. Comparison of bus topologoes.

The Final Structure
The backbone of the on-chip communication infrastructure in PNX-8500 was finally provided by two separate bus systems: a 64-bit point-to-point high-performance MMI bus, also called the memory bus or the DMA bus, and a 32-bit tristated PI bus (Peripheral Interconnect Open Processor Initiative Standard 324).

The MMI bus provides high-speed memory access to those on-chip units that require high bandwidth and low latency. There is no PIO traffic on this bus. The bus connects to the external memory (SDRAM) via a 64-bit, 143-MHz memory management interface that generates the required SDRAM protocol but isolates the on-chip resources by using the proprietary DVP protocol.

The MMI also controls access (by the on-chip components) to the memory highway via a round-robin arbitration algorithm with programmable bandwidths.

Unlike the memory bus, the PI bus is not only used for MMIO reads and writes to memory-mapped control registers of the various peripherals, but it also provides a medium-bandwidth DMA path via a bridged gateway connection to the MMI bus. The PI bus itself is divided into three different segments:

F-PI (Fast PI) bus: used for low-latency access to the memory and selected peripherals by the MIPS CPU.

M-PI (MIPS-PI) bus: used for access to the peripherals typically controlled by the MIPS CPU.

T-PI (TriMedia-PI) bus: used for access to the peripherals typically controlled by the TriMedia TM32 CPU.

The MMI bus and the various PI bus segments are connected via a number of bridges, as shown in Figure 14-8. The FC-Bridge, the MC-Bridge, and the TC-Bridge act as gateways providing memory access to the corresponding PI segments, whereas the C-Bridge acts as a crossover PI-to-PI MMIO bridge that allows memory-mapped I/O access from each processor to control and/or observe the status of all peripheral modules.

The M-Bridge basically bridges transactions between the fast PI bus and the not-so-fast peripheral PI bus segments on the MIPS side.

Designing for Testability
With the push toward increased product performance and higher design complexity, the current “giant” SoCs not only incorporate multiple IP cores, they also mix diverse circuits such as digital random logic, functional units, processor cores, static memories, embedded dynamic random access memories (DRAMs), and analog circuits on a single chip.

Also, new circuit types such as FPGAs, flash memories, radio frequency (RF) devices, and microwave devices are also almost at the point of becoming a regular feature of the on-chip circuitry.

It is only a matter of time before we move beyond the realm of conventional circuits and “electronics-only” ICs and start integrating optical devices and micro-electro-mechanical (MEM) elements in a regular SoC.

Today's core-based SoC designs, unfortunately, pose multiple testability problems. For one, the nonhomogenous circuit types and cores exhibit different defect behaviors and require different test solutions. Second, the cores may originate from widely different sources and therefore have varying degrees of “test friendliness.”

An easy, cost-effective, and widely accepted method to alleviate these problems is to embed in the design, besides the traditional scan and built-in-self-test (BIST) circuitry, special hardware, Sources and Sinks [617] (that provide easy accessibility—controllability and observability—of internal points) corresponding to each circuit type and/or each IP core.

To guarantee adequate testability of cores in an SoC, as well as their test interoperability and test reusability, a new test standard—the IEEE P1500 embedded core test standard—has emerged in recent years.

The P1500 standard suggests using module-level boundary-scan structures, called wrappers, that allow intercore and intracore test functions to be carried out via a test access mechanism (TAM). The wrapper isolates an IP core from its environment and ensures that:

1) the IP core itself can be tested after it has been instantiated in the SoC.

2) the interconnect structures between the cores can also be tested.

In case of the PNX-8500, it was not only a very large suite of functional tests, a full-scan design methodology, and a BIST of the larger memories and the caches (in the CPUs), but also a test-shell isolation of each IP core that finally led to a very high test coverage. The test-shell isolation guarantees that every IP core is completely testable.

Test isolation is obtained by ensuring that each IP core input and output is both controllable and observable. Controllable inputs and observable outputs facilitate stand-alone as well as parallel testing of a core.

Observable inputs and controllable outputs, on the other hand, allow development of interconnect tests that verify bus connections between different IP cores. Busses and all interconnectivity between the on-chip peripherals and the CPUs in PNX-8500 were tested using interconnect tests; these tests are relatively straightforward to implement when every peripheral contains a specially designed test shell.

To read Part 1, go to Architectural approaches to video processing .

To read Part 2, go to Optimal CPU configurations and interconnections

Next in Part 4: Application-driven architecture design .

This series ofarticles is based on copyrighted material submitted by Sanatanu Dutta,Jenns Rennert, Tiehan Lv and Guang Yang to “ MultiprocessorSystems-On-Chips edited by Wayne Wolf and Ahmed Amine Jerraya. It is used with the permission of thepublisher, Morgan Kaufmann, an imprint of Elsevier. The book can bepurchased on-line .

SantanuDutta is a design engineering manager and technical lead in theconnected multimedia solutions group at Philips Semiconductor, now NXPSemiconductor. Jenns Rennert is senior systems engineer at Micronas GmBH. Tiehan Lv attended PrincetonUniversity where he received a PhD in electrical engineering. He alsohas B.S. and M.Eng. degrees from Peking University. Guang Yang is a research scientistat the Philips Reserch Laboratories.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.