Multicore systems-on-chip can handle embedded designs -

Multicore systems-on-chip can handle embedded designs

Many design problems are conveniently concurrent and are easy to attack with multiple processor cores, though not necessarily using a symmetric multiprocessing (SMP) architecture. Big semiconductor and server vendors currently offer SMP multicore processors, which are good for solving certain kinds of design problems. And large servers and server farms support applications such as Web query requests that follow a SAMD model: single application, multiple data (an over- simplification, perhaps, but a useful one).

SAMD applications date to early, proprietary mainframe networks used for certain big applications such as real-time airline reservation and check-in systems or real-time banking. These applications are particularly suitable for SMP multicore processors: They essentially run the same kind of code, they do not exhibit data locality, and the number of cores running the application makes no material difference other than speed.

A large number of homogeneous, cache-coherent processors organized into SMP clusters seems a reasonable technology to apply to such applications, and multicore chips and servers from Intel, Advanced Micro Devices, Sun, IBM and others seem a reasonable way to exploit the inherent parallelism needed to satisfy many simultaneous user requests. Other applications, such as packet processing, may also be able to exploit SMP multicore or multiprocessor systems. SMP is simply not a good processing model for many embedded systems, however, because it eases the processor designer's task of creating a multiple-processor system but makes poor compromises in terms of power consumption, heat dissi- pation and software development requirements.

Graphics processing may also be “embarrassingly parallel.” Such applications can be cut up into multiple threads, each acting in parallel on part of the data. Special-purpose graphics engines such as the IBM-Toshiba-Sony Cell processor (interestingly, not really an SMP multi- core machine) and other graphics chips offered by Nvidia and others are attracting some interest. Embarrassingly parallel applications get harder to find beyond graphics and scene rendering and can be ex- tremely hard to program, despite the availability of special multicore API libraries.

However, very few real-world applications are “embarrassingly parallel.” For every Google, bank or airline, millions of ordinary users exist whose computers are already migrating to multicore processors. Big service providers may eat all the cores they can get, but desktops and laptops may top out at two to four processors. The future economics of SMP multicore chips remain perplexing. We may be facing a de facto solution in profound need of the right desktop problem.

Personal Video Recorder (PVR) illustrates 'compositional concurrency'. Seven identified processing blocks (shown in gray), have clearly defined tasks.

But don't lose heart. Expanding architectural thinking beyond SMP multicores uncovers at least two kinds of concurrency that easily exploit multiple processors via heterogeneous, not homogeneous, concurrency. Many embedded systems exhibit such “convenient concurrency.”

Compositional concurrency
The first such system architecture exists in many consumer devices, including mobile phones, portable multimedia players and multifunction devices. This sort of parallelism can be called “compositional concurrency”: Various subsystems, each containing one or more processors optimized for a particular set of tasks, are woven together into a product. Communications are structured so that subsystems communicate only when needed. For example, a user-interface subsystem running on a controller may need to turn audio processing on or off; control the digital camera imaging functions; or interrupt video processing to stop, pause or change the video stream. In this kind of concurrent system, many subsystems operate simultaneously but have been designed to interact at only a high level without clashing.

Some engineers might criticize this sort of architecture because of its theoretical inefficiency in terms of gate and processor count. Ten, 20 or even more processor cores could, in theory, be re- placed with just a few general-purpose cores running at much higher clock rates.

But that criticism is misguided. When processors were expensive, design styles favored the use of a few big, fast processors. When Denard scaling (also called classical scaling) ended at 90 nanometers, transistors continued to get much smaller at each IC fabrication node but no longer got much faster or dissipated less power. In fact, static leakage current has started to increase.

A Super 3G mobile phone chip uses 18 separate processing blocks. Blocks divide and conquer communications/processingat a high level.

As a result, the big processors' power dissipation and energy consumption have become unmanageable at high clock rates, and system designers are forced to adopt design styles that reduce system clock rates before their chips burn to cinders under normal operating conditions.

There are some decided system-level advantages of compositionally concurrent system design:

•Distributing computing tasks over more on-chip processors trades transistors for clock rate, reducing overall system power and energy consumption. In light of Moore's Law and the end of Denard scaling, this is a very good engineering trade-off to make.

•Subsystems can be more easily powered down when not used, instead of keeping all the cores in a multicore SMP system running. Subsystems can be shut off completely and restarted quickly or can be throttled back by using complex dynamic voltage- and frequency-scaling algorithms based on predicted task loads.

•Because these subsystems are task-specific, they run more efficiently on application-specific instruction-set pro- cessors (ASIPs), which are far more area- and power-efficient than are general-purpose processors. That means the gate advantage of fewer general-purpose cores may be much less than it might seem upon first consideration.

• Compositionally concurrent system designs avoid complex interactions and synchronizations among subsystems. Shutting down the camera subsystem on a compositional product is a trivial task in software and ensures that such a task can safely be suspended in a cooperative, multitasking environment running on an SMP system, which can be much more complex. Ensuring that a four-core SMP system running a mobile phone and its audio, video and camera functions will not drop a 911 emergency response call while other applications are running, or that low-priority applications will be properly suspended when a high-priority task interrupts, can be a nightmare of analysis involving “death by simulation.” Reasonably independent subsystems interacting at a high level are far easier to validate individually and compositionally.

Pipelined data flow
Pipelined data flow, the second kind of concurrency, complements compositional concurrency. Computation often can be divided into a pipeline of individual task engines. Each task engine processes and then emits data blocks (frames, samples and so forth). Once a task is completed, the processed data block passes to the next engine in the chain. Such asymmetric multiprocessing algorithms appear in many signal- and image-processing applications that range from cell phone baseband processing to video and still-image processing. Pipelining permits substantial concurrent processing and even sharper application of ASIP principles: Each of the heterogeneous processors in the pipeline can be highly tuned to just one part of the task.

For example, Tensilica's Diamond Standard 388VDO Video Engine mates two appropriately but differently configured 32-bit processor cores with a DMA controller to create a digital-video codec subsystem. One processor core in the subsystem is configured as a stream processor and the other as a pixel processor. The stream processor accelerates serial processing, such as bitstream parsing, en- tropy decoding and control functions. The pixel processor works on the video data plane and performs parallel computations on pixel data using a single-instruction, multiple-data (SIMD) instruction architecture. Both processors have different local-memory and data-width configurations, as required by their functional partition. This configuration decodes H.264 D1 main profile video while running at 200 MHz–something that is easily achieved with 130-nm technology and is even easier to fabricate with more-advanced IC fabrication processes.

A Pentium-class processor decodes H.264 D1 main-profile video running at a clock rate of between 1 and 2 GHz while dissipating several tens of watts. A paper presented at the recent International Conference on Consumer Electronics discussed decoding H.264 D1 Main Profile video using 125 percent of a 600-MHz TI TMS320DM642 DSP, putting the required clock rate at 720 MHz. Unfortunately, SoC processors that run at 720 MHz, much less at 1 to 2 GHz, using available ASIC foundry technologies cannot be synthesized. In this case, pipeline processing drops the required clock frequency considerably over the “one big, fast processor” design approach and allows the video decoder to be fabricated by conventional ASIC manufacturing technology.

Combining the compositional subsystem style of design with asymmetric multiprocessing (AMP) in each subsystem makes it apparent that products in the consumer, portable and media spaces may need 10 to 100 processors, each optimized to a specific task in the product's function set. Programming each AMP application is easier than programming each multithreaded SMP application, because fewer intertask dependencies are involved. This design approach is eminently practical; by using it, many optimization headaches associated with multiple application threads running on a limited set of identical processors in an SMP system can be avoided. n

Grant Martin ( is chief scientist at Tensilica and holds graduate and postgraduate degrees in mathematics from the University of Waterloo, Canada.
Steven Leibson () is Tensilica's technology evangelist. He has a BSEE from Case Western Reserve University and worked at HP and Cadnetix before becoming a journalist at EDN.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.