For many applications, allocating performance among all of the tasks in a system-on-chip (SoC) design is much easier and provides greater design flexibility with multiple CPUs than with just one control processor and multiple blocks of logic. Just using bigger control processors will not satisfy the widely varying computational demands of many of today’s designs because bigger processors often require too much power, especially for consumer devices.
Multiple processor design changes the role of processors, allowing programmability to be designed into many functions while keeping power budgets under control. Multiple processor designs are now found in a number of applications ranging from cellular phones and ink-jet printers all the way up to huge network routers.
The biggest advantage of using multiple processors as SoC task blocks is that they’re programmable, so changes can be made in software after the chip design is finished. This means that complex state machines can be implemented in firmware running on the processor, significantly reducing verification time. And one SoC can often be used for multiple products, turning features on and off as necessary.
Multiple processor design promotes much more efficient use of memory blocks. A multiple processor-based approach makes most of the memories processor-visible, processor-controlled, processor-managed, processor-tested, and processor-initialized. Additionally, this reduces overall memory requirements while promoting the flexible sharing and reuse of on-chip memories.
But how do you pick the right embedded processors for multiple CPU designs? How do you partition your design to take maximum advantage of multiple processors? How do you manage the software between all of the processors? How do you connect them and manage communications in the hardware?
Partitioning the multiple processor SoC design
At the conceptual level, the entire system can be treated as a constellation of concurrent, interacting subsystems or tasks. Each task communicates with other subsystems and shares common resources (memory, shared data structures, network points). Developers start from a set of tasks for the system and exploit the parallelism by applying a spectrum of techniques, including four basic actions:
1. Allocate (mostly) independent tasks to different processors, with communications among tasks expressed via shared memory and messages.
2. Speed up each individual task by optimizing the processor on which it runs using a configurable processor.
3. For particularly performance-critical tasks, decompose the task into a set of parallel tasks running on a set of optimized, inter-communicating processors.
4. Combine multiple low-bandwidth tasks on one processor by time-slicing. This approach degrades parallelism, but may improve SoC cost and efficiency if the processor has enough available computation cycles.
These methods interact with one another, so iterative refinement is often essential, particularly as the design evolves.
When a system’s functions are partitioned into multiple interacting function blocks, there are several possible organizational forms or structures including:
- Heterogeneous tasks Distinct, loosely coupled subsystems that can be implemented largely independently of each other. Figure 1 shows a system where networking, video and audio processing tasks are implemented in separate processors, sharing common memory, bus and I/O resources.

Figure 1- Simple heterogeneous system partitioning.
- Parallel tasks Communications equipment, for example, often supports multiple communications ports, voice channels, or wireless frequency-band controllers, as shown in Figure 2. Even when the parallelism isn’t obvious, many system applications still lend themselves to parallel implementation. For example, in an image-processing system the operations on one part of a frame may be largely independent of operations on another part of that same frame. Creating a two-dimensional array of sub-image processors may achieve high parallelism without substantial algorithm redesign.

Figure 2- Parallel task system partitioning.
- Pipelined tasks Phases of the algorithms can naturally be performed on one block of data while a subsequent phase is performed on an earlier block (also called a systolic-processing array). Figure 3 shows a pipelined architecture with multiple steps to produce the final decoded video stream.

Figure 3- Pipelined task system partitioning.
- Hybrids Real systems usually require a mixture of these partitioning styles.