For the ever increasing set of media-processing applications, improving the performance of the execution of a single instruction stream often results in only a limited overall gain in the system performance. Intuition and experiments suggest that for these applications, much better performance can be achieved by employing multiple processors that share the burden of controlling the necessary real-time and non-real-time tasks.
In addition to integrating various audio, video, and peripheral interface units, an emerging trend for multimedia SoCs is to muster enough processing power by utilizing multiple CPU cores.
In fact, quite a few of the currently available ICs already feature two CPUs. For a typical digital television, digital video set-top box, or DVD recorder system, Philips provides silicon that integrates a MIPS core and a TriMedia processor.
For a mobile handset baseband processor, Philips offers a combination of an ARM core or multiple ARM cores and an Adelante digital signal processor (DSP). The OMAP5910 SoC from TI also incorporates two CPU cores—a TMS320C55x digital signal processing core and a TI-enhanced ARM core.
For the PNX-8500, an architectural decision was made quite early in the design cycle to use the TM32 TriMedia core together with the standard MIPS32 reduced instruction set computing (RISC) core.
The choice was guided by the pre-existing (external) software stacks for the target application and Philips' own portfolio of processor cores with existing applications and standard compilers. The requirements on the RISC processor were high performance, capability to run popular embedded operating systems, and efficient control of infrastructure peripherals.
The VLIW processor, on the other hand, was required to have a very high performance and a multimedia-enhanced instruction set suitable for audio and video processing. This allows balancing of the system and distribution of tasks among the two CPUs.
The scheduling task in PNX-8500 is usually assigned to TM32—the CPU with the faster response time. Besides running all the audio decoding and processing functions, the TriMedia core also implements other nontrivial multimedia algorithms that are not supported directly by the hardware functional units.
The MIPS core, on the other hand, runs the operating system and, on top of it, the software application provided by the service provider. The application software deals with the accessibility of the service (conditional access) and the general control functions. All graphics-related functions are also handled by the MIPS processor.
Today's SoC architectures, as mentioned above, frequently exhibit only a few more or less powerful embedded processors that, depending on the application, might be RISC processors, VLIW cores, or DSPs. Recent trends in the embedded CPU IP market, however, show an increasing preference toward reconfigurable computing and, therefore, toward compile-time-configurable CPUs like Tensilica and ARC.
Extensible and configurable (tailored) processors offer many of the benefits of hardware accelerators (adding hardware for specific processing problems), while solving some of the problems associated with the design of hardware accelerators.
These reconfigurable CPUs are generated with software tools and customized with regard to not only the cache and memory size but also the number and kind of peripherals that need to be supported.
Another very helpful feature of these CPUs is the capability to add custom operations associated with hardware extensions such as lookup tables, x-y memories, add-compare-select units, and multiply-accumulate (MAC) units. These extensions make the CPUs very “DSP-like” and therefore well suited for baseband communication or media-processing tasks.
Reconfigurable hardware is believed by many to be an ideal candidate for use in SoCs because it offers a level of flexibility not available with more traditional circuitry.
Hardware reuse, for example, allows a single configurable architecture to implement many potential applications. Another flexibility offered is easy postfabrication modification that allows alterations in the target applications, bug fixes, and reuse of the SoC across multiple similar deployments to amortize design costs.
Reconfigurable logic, it has been argued, should, however, be used for the execution of tasks with moderate timing constraints, in order to benefit from the saving of silicon area owing to a more efficient utilization. It has also been argued that general CPUs are equipped with many functional units, thereby making it very difficult to obtain an optimal exploitation of the hardware resources.
With reconfigurable hardware, however, it is possible to “synthesize” the required units at the right time and to occupy only the chip area that is needed for the execution of the current task(s). Besides, dynamically reconfigurable systems also permit adaptive adjustments during run time.
The choice of the CPU architecture is determined not only by the raw processing power required but also by the chosen software architecture and the set of use cases that need to be fulfilled.
A few big monolithic CPUs may be very suitable for traditional, high-performance, general-purpose processing tasks, but they are less capable of efficiently performing many small (and special) tasks, as required, for instance, in the realm of mobile multimedia communications.
The task-switch overhead and the special functions needed to perform the multitude of tasks (e.g., in baseband processing) can impact and reduce the performance of a single monolithic general-purpose CPU.
In such a situation, it might be a better choice to disassemble the required algorithm into its base components and then map them individually onto more specialized hardware. A network of small DSPs or DSP-like CPUs with special extensions targeted toward the specifics of the application provides a more flexible platform for the mapping process, but it comes with its own set of challenges.
Problems to be solved in the new scenario comprise sharing the available storage and automating the dataflow between the CPUs, without having to introduce a new bottleneck in the form of a shared-memory resource.
The usage of a shared memory among multiple CPUs, although often welcomed by software programmers, is probably best avoided wherever possible, because it is generally associated with high-throughput demands on the off-chip memory subsystem.
Data streaming is a much more desirable alternative from a hardware point of view and is acceptable as long as such streaming is assisted by the hardware components and rendered semitransparent to the software drivers.
In fact, Philips is advocating, via the Nexperia platform, the use of multiple smaller streaming embedded processors controlled by another embedded processor to ensure the speed and reliability of a given design while saving silicon costs.
Data streaming usually assumes use-case-dependent flexible interconnections between the software and the hardware components. The software view of these connections should be abstracted to the concept of pipelines (blocking reads and writes).
Tree interconnect structures, which loosely connect highly concentrated clusters of CPUs, are an efficient way to keep the wiring within reason. However, to offload the CPUs from the interprocessor communication tasks, smart interconnects are desirable. These interconnects can automatically manage the buffers and dataflows associated with two partners in the network without any CPU intervention.
Data-triggered software tasks, via interrupt mechanisms, result in very efficient data-driven communication and processing within such a system. Note that the network processing elements, or processing nodes as they are called, are usually of a heterogeneous nature.
Depending on the use cases, however, a mix of general-purpose CPUs to perform control tasks, DSP-like compute elements with specific extensions for signal processing, and highly specialized hardware functions for high-performance computations (e.g., large fast Fourier transforms [FFTs] or filters for high data rates) is usually desirable.
The mix can be further augmented by field programmable gate arrays (FPGAs) and/or highly programmable array-configurable hardware in order to perform functions such as interfacing to off-chip peripherals. Size and power considerations become extremely important for such heterogeneous multiprocessors, and so does a detailed use-case analysis to identify the algorithmic requirements.
The processing networks mentioned earlier are very demanding with respect to the choice and use of software tools. Scheduling of resources and programming of the processing chain is a particularly nontrivial task. To alleviate the problem, a static mapping of a given algorithm to the hardware can be initially performed.
However, as the tools (software) get more sophisticated and as the hardware becomes capable of automatic resource scheduling and buffer management, new functions such as dynamic hardware-resource allocation at run time, real-time task switching, and time sharing of the hardware resources between different applications become possible.
Advanced software tools can potentially ease the task of mapping a certain algorithm onto a computing (processor) array. A configurable custom library of common functions (e.g., FFT, finite impulse response filtering, Viterbi decoding, Reed Solomon decoding, modulation, and so on) can also assist the mapping process.
Flexibility—programmability, adaptability, and upgradability—in multimedia systems mandate that the system-level functionality be implemented more in software than hardware. Market data indicate that this is already the case today, with more than 80% of the system development efforts being in software.
Software productivity, however, is rapidly becoming a bottleneck in multimedia SoC designs because even though the amount of software is increasing exponentially, the efficiency of software design is unable to keep pace. Efficient software design environments, effective software reuse standards, easy software portability, and widespread software compatibility among products and product generations are, therefore, absolutely essential for the successful design of future multimedia systems.
The challenge of SoC integration and IP reuse
As the electronics industry demands ever more powerful and cheaper products, the latest chip design trend, thanks to the fabulous growth in the semiconductor industry, is to build all the circuitry needed for the complete electronic system on a single chip.
Remarkable advances in manufacturing technology have blurred the traditional separation between component design and system design by allowing the merger of various components—pre-designed modules, hard or soft IP blocks, and so on—on the same silicon substrate. The buzzword used to describe such designs is SoC design.
|Figure 14.6. SoC design example|
As depicted in Figure 14-6 above, these chips are no longer stand-alone components but complete silicon boards comprising nonhomogenous circuit components and encapsulating complex system knowledge. SoC designs require all functions from the front-end to the back-end design to be integrated into a seamless flow. Figure 14-7 shows a high-level abstraction of such a design flow.
Successful SoC designs today are assembled at the IP block level, and they demand concurrent hardware-software design (co-design) and careful IP integration. To this end, reusability or recycling of an IP core, with an easy-to-use and/or an easy-to-modify interface, becomes extremely important.
A reusable IP core, often called a virtual component, typically refers to a preimplemented, optimized, and reusable module with a standardized interface that can be quickly adapted and reliably integrated, with little or no modification, to build single-chip systems.
Reusable hard (placed, routed, and verified logic), firm (a synthesized netlist with floorplanning or physical-placement guidance), or soft (synthesizable RTL description) IP cores are usually obtained from independent vendors, or they get exchanged and/or sold between various departments and cost centers of a company.
The latest design trend is to implement true systems on chips that depend on realizing the system functionality by integrating and composing preexisting, plug-and-play IP cores (each of which may potentially be designed in an ASIC flow) from different vendors and/or design groups, and possibly porting them to advanced technologies with smaller geometries.
A digital television or a digital set-top box SoC can, for example, make use of IP cores for video processing (e.g., video capture, picture enhancement and scaling, picture artifact and noise reduction, MPEG encoding and decoding, display processing, and so on), IP cores for audio processing, memory controllers, on-chip processors, and bus interfaces (e.g., USB, 1394, UART, and so on).
The design of such a chip is primarily a design-integration or a design-composition process, where a co-design methodology is followed to identify various virtual components, the components are stitched together by designing the necessary glue logic between them, and the assembled logic is pushed through a tightly coupled set of electronic design automation (EDA) tools in an integrated synthesis environment.
With the ability to integrate millions of gates on a single chip, the SoC design bottleneck is no longer in obtaining higher density on the chip—the bottleneck is more in IP selection, design planning, design optimization, hardware-software co-design, design verification, and the low-level electrical problems which one needs to cope with.
Some of the electrical problems, which did not pose that serious a threat in traditional designs but now need to be carefully considered, are signal integrity, signal noise, harmonic frequencies, transmission-line effects, thermal and voltage gradients (across the chip), clock speeds, self and mutual inductances, and delay uncertainties.
Besides, an SoC implemented using IP cores can potentially lead to a design with multiple clock domains that pose difficulties for accurate timing analysis and DFT implementation. Therefore, some of the main (and new) requirements of a successful SoC design are:
1) a well-defined on-chip interconnect architecture and communication subsystem.
2) adequate supply of high-quality, reusable (i.e., should have standardized interfaces and should allow easy creation of derivatives), retargetable (i.e., should easily be mapped to different processes and architectures), and reconfigurable (i.e., should be possible to tailor different parameters to meet functionality and performance requirements) IP cores.
3) IP-core compliance to certain rules (for architecture, design, verification, packaging, and testing) that enable integration with minimal effort.
4) availability of IP-evaluation frameworks to evaluate a third-party IP before committing to use it.
5) efficient use of design plan synthesis methods for early exploration of alternative design topologies in order to determine an optimal design flow.
6) hardware-software partitioning and co-design.
7) logic synthesis linked with physical design.
8) performance-driven place-and-route integrated with timing/power analysis.
9) a well-crafted verification and emulation environment that facilitates hardware-software coverification and reuse of drivers and tests across simulation, emulation, and validation.
10) an in-depth system-level design and analysis.
The Panacea/Promise Of Platform-Based Design
In order to emphasize systematic reuse, lower development cost, minimize development risks, and reduce time-to-market, by spinning off quick derivatives, another recent trend is to exploit the benefits of a platform-based design.
The basic idea behind the platform-based approach is to avoid designing a chip from scratch; some portion of the chip's architecture, as Richard Goering  points out, is predefined for a specific type of application.
Usually, there is the hardware architectural platform comprising one or more processors, programmable IP cores, a memory subsystem, a communication network, and an input output (I/O) sub-system; an application programming interface (API) then provides the required software abstraction by wrapping the essential parts (of the architectural platform) via device drivers and a real-time operating system (RTOS) . Depending on the platform type, users might customize by adding hardware IP, programming FPGA logic, or writing embedded software.
When a company designs a variety of similar systems, it is really advantageous to incorporate the commonalities between different designs into a template from which the designers can derive individual designs; a platform, for Claasen, is, therefore, a restrictive set of rules and guidelines for the hardware and the software architecture, together with a suite of building blocks which fit into that architecture.
The Virtual Socket Interface Alliance's (VSIA) Platform-Based Design Development Working Group (PBD DWG) defines an SoC platform as “a library of virtual components and an architectural framework, consisting of a set of integrated and pre-qualified software and hardware virtual components (VCs), models, EDA and software tools, libraries and methodology, to support rapid product development through architectural exploration, integration and verification.”
Another common definition is the “creation of a stable microprocessor-based architecture that can be rapidly extended, customized for a range of applications, and delivered to customers for quick deployment.”
Whatever may be the definition, the primary goal is to hide the details of the design through a layered system of abstractions and allow maximum reuse of architecture, hardware components, software drivers, and subsystem functionality, whereby different hardware and/or software components can be added easily to the base platform, and several derivatives can be created quickly.
In this manner, a family of products can be delivered instead of a single product, and future generations can build on the original hardware and software investments.
There are essentially four basic steps to using a platform-based design methodology: defining the platform design methodology, creating the platform, defining the derivative design methodology, and creating the derivative SoC product(s).
Since it is usually easier to de-configure than configure, the starting point in a platform-based approach is often a de-configurable and extensible prototype reference design built from reusable IP components; some of the existing IP cores are then modified (extended), some are removed (de-configured), and some new ones added during the integration process.
One of the best known examples of a full SoC applications platform is the Nexperia Digital Video Platform (DVP). It is a scalable platform architecture for a wide range of digital video applications.
Time-to-market savings was a strong motivation for the design team that created both the platform and the first iteration of that architecture, the PNX-8500 media-processing chip. The chip provides a highly integrated multimedia SoC solution for the integrated digital TV (iDTV), the home gateway, the set-top box market, and the emerging connected-home market.
As illustrated in Figure 14-8 below , the PNX-8500 media-processing chip, in the true spirit of a platform IC, incorporates various IP blocks and entails a high level of IP and design reuse.
|Figure 14.8. Architectural blocks of PNX-8500 digital television processor|
Besides featuring multiple processors—a 32-bit VLIW TriMedia core (TM32) together with a standard 32-bit MIPS RISC core (MIPS32)—the PNX()-8500 SoC integrates separate audio, video, peripheral, and infrastructure subsystems that were assembled from IP blocks designed or available in-house or obtained from outside.
The peripheral subsystem comprises general-purpose on-chip peripheral IPs such as a universal serial bus (USB) host controller, three universal asynchronous receive and transmit (UART) interfaces, two multimaster inter-integrated circuit (I2C) interfaces, one synchronous serial interface (SSI) for implementing soft modems, a general purpose I/O (GPIO) module with infrared remote-receive capability, a 1394 link-layer controller that provides a PHY-LINK interface, and a PCI/XIO expansion-bus interface unit to connect to a variety of board-level memory components.
The audio subsystem is centered around an audio I/O (AIO) block featuring three audio input and three audio output IP modules and an SPDIO IP block that provides the Sony Philips Digital Interface (SPDIF) input and output capabilities.
The video subsystem is quite pronounced, with well-defined IP blocks for high-quality video processing and display, and boasts of two instances of a video input processor (VIP) to capture standard definition video (NTSC, PAL), a memory based scaler (MBS) that not only deinterlaces, scales, and converts video but also filters graphics.
Also included are two instances of an advanced image composition processor (AICP) that is responsible for combining images from the main memory and composing the final picture that is displayed, a slice-level MPEG2 video decoder (MPEG) that is suitable for HD decoding, an MPEG system processor (MSP) that, besides filtering, demultiplexing, and processing transport streams (packets), also provides conditional access, and a 2D rendering engine (2D) that is used to accelerate graphics functions.
The infrastructure subsystem comprises a memory management interface (MMI) that provides and controls access to the external memory and a hierarchical on-chip bus system whose segments are connected by local bridges.
To read Part 1, go to Architectural approaches to video processing .
Next in Part 3: The ever-critical bus structure .
This series ofarticles is based on copyrighted material submitted by Sanatanu Dutta,Jenns Rennert, Tiehan Lv and Guang Yang to “ MultiprocessorSystems-On-Chips edited byWayneWolf and Ahmed Amine Jerraya. It is used with the permission of thepublisher, Morgan Kaufmann, an imprint of Elsevier. The book can bepurchased on-line .
SantanuDutta is a design engineering manager and technical lead in theconnected multimedia solutions group at Philips Semiconductor, now NXPSemiconductor. Jenns Rennert is senior systems engineer at Micronas GmBH. Tiehan Lv attended PrincetonUniversity where he received a PhD in electrical engineering. He alsohas B.S. and M.Eng. degrees from Peking University. Guang Yang is a research scientistat the Philips Reserch Laboratories.