Embedded DSP Software Design on a Multicore SoC Architecture: Part 1
Designing and building embedded systems is a difficult task, given the inherent scarcity of resources in embedded systems (processing power, memory, throughput, battery life, and cost). Various trade-offs are made between these resources when designing an embedded system.
Modern embedded systems are using devices with multiple processing units manufactured on a single chip, creating a sort of multicore system-on-a-chip (SoC) can increase the processing power and throughput of the system while at the same time increasing the battery life and reducing the overall cost.
One example of a DSP based SoC is shown in Figure 11.1 below. Multicore approaches keep hardware design in the low frequency range (each individual processor can run at a lower speed, which reduces overall power consumption as well as heat generation), offering significant price, performance, and flexibility (in software design and partitioning) over higher speed single-core designs.
|Figure 11.1. Block diagram of a DSP SoC|
There are several characteristics of SoC that we will discuss . I will use an example processor to demonstrate these characteristics and how they are deployed in an existing SoC.
1. Customized to the application " Like embedded systems in general, SoC are customized to an application space. As an example, I will reference the video application space. A suitable block diagram showing the flow of an embedded video application space is shown in Figure 11.2 below.
This system consists of input capture, real-time signal processing, and output display components. As a system there are multiple technologies associated with building a flexible system including analog formats, video converters, digital formats, and digital processing. An SoC processor will incorporate a system of components; processing elements, peripherals, memories, I/O, and so forth to implement a system such as that shown in Figure 11.2 below.
|Figure 11.2 Digital video system application model (courtesy of Texas Instruments)|
An example of an SoC processor that implements a digital video system is shown in Figure 11.3 below. This processor consists of various components to input, process, and output digital video information. More about the details of this in a moment.
2. SoCs improve power/performance ratio " Large processors running at high frequencies consume more power, and are more expensive to cool. Several smaller processors running at a lower frequency can perform the same amount of work without consuming as much energy and power.
In Figure 11.1, the ARM processor, the two DSPs, and the hardware accelerators can run a large signal processing application efficiently by properly partitioning the application across these four different processing elements.
3. Many apps require programmability " SoC contains multiple programmable processing elements. These are required for a number of reasons:
New technology " Programmability supports upgradeability and
than nonprogrammable devices. For example, as new video codec
developed, the algorithms to support these new standards can be
a programmable processing element easily. New features are also easier
Support for multiple standards and algorithms " Some digital video applications require support for multiple video standards, resolutions, and quality. Its easier to implement these on a programmable system.
Full algorithm control " A programmable system provides the designer the ability to customize and/or optimize a specific algorithm as necessary which provides the application developer more control over differentiation of the application.
Software reuse in future systems " By developing digital video software as components, these can be reuse/repackaged as building blocks for future systems as necessary.
4. Constraints such as real-time, power, cost " There are many constraints in real-time embedded systems. Many of these constraints are met by customizing to the application.
|Figure 11.3. A SoC processor customized for Digital Video Systems (courtesy of Texas Instruments)|
5. Special instructions - SoCs have
special CPU instructions to speed up the application. As an example,
the SoC in Figure 11.3 above
contains special instructions on the DSP to accelerate operations such
32-bit multiply instructions for extended precision computation
Expanded arithmetic functions to support FFT and DCT algorithms
Improve complex multiplications
Double dot product instructions for improving throughput of FIR loops
Parallel packing Instructions
Enhanced Galois Field Multiply
Each of these instructions accelerate the processing of certain digital video algorithms. Of course, compiler support is necessary to schedule these instructions, so the tools become an important part of the entire system as well.
6. Extensible " Many SoCs are extensible in ways such as word size and cache size. Special tooling is also made available to analyze systems as these system parameters are changes.
7. Hardware acceleration " There are several benefits to using hardware acceleration in an SoC. The primary reason is better cost/performance ratio. Fast processors are costly. By partitioning into several smaller processing elements, cost can be reduced in the overall system. Smaller processing elements also consume less power and can actually be better at implementing real-time systems as the dedicated units can respond more efficiently to external events.
Hardware accelerators are useful in applications that have algorithmic functions that do not map to a CPU architecture well. For example, algorithms that require a lot of bit manipulation require a lot of registers. A traditional CPU register model may not be suited to efficiently execute these algorithms.
A specialized hardware accelerator can b built that performs bit manipulation efficiently which sits beside the CPU and used by the CPU for bit manipulation operations. Highly responsive I/O operations are another area where a dedicated accelerator with an attached I/O peripheral will perform better.
Finally, applications that are required to process streams of data, such as many wireless and multimedia applications, do not map well to the traditional CPU architecture, especially those that implement caching systems.
|Figure 11.4 Block diagram of the video processing subsystem acceleration module of the SoC in Figure 11.3 (courtesy of Texas Instruments)|
Since each streaming data element may have a limited lifetime, processing will require the constant thrashing of cache for new data elements. A specialized hardware accelerator with special fetch logic can be implemented to provide dedicated support to these data streams.
Hardware acceleration is used on SoCs as a way to efficiently execute classes of algorithms. We mentioned in the chapter on power optimization, how the use of accelerators if possible can lower overall system power since these accelerators are customized to the class of processing and, therefore, perform these calculations very efficiently.
The SoC in Figure 11.3 has hardware acceleration support. In particular, the video processing sub-system (VPSS) as well as the Video Acceleration block within the DSP subsystem are examples of hardware acceleration blocks used to efficiently process video algorithms.
Figure 11.4 above shows a block diagram of one of the VPSS. This hardware accelerator contains:
A front end module containing:
CCDC (charge coupled device)
Resizer (accepts data from the previewer or from external memory and resizes from ¼x to 4x)
And a back end module containing:
Color space conversion
This VPSS processing element eases the overall DSP/ARM loading through hardware acceleration. An example application using the VPSS is shown in Figure 11.5 below.
|Figure 11.5 A Video phone example using the VPSS acceleration module (courtesy of Texas Instruments)|
8. Heterogeneous memory systems " Many SoC devices contain separate memories for the different processing elements. This provides a performance boost because of lower latencies on memory accesses, as well as lower power from reduced bus arbitration and switching.
This programmable coprocessor is optimized for imaging and video applications. Specifically, this accelerator is optimized to perform operations such as filtering, scaling, matrix multiplication, addition, subtraction, summing absolute differences, and other related computations.
Much of the computation is specified in the form of commands which operate on arrays of streaming data. A simple set of APIs can be used to make processing calls into this accelerator. In that sense, a single command can drive hundreds or thousands of cycles.
As discussed previously, accelerators are used to perform computations that do not map efficiently to a CPU. The accelerator in Figure 11.6 below is an example of an accelerator that performs efficient operations using parallel computation.
|Figure 11.6 A hardware accelerator example; video and imaging coprocessor (courtesy of Texas Instruments)|
This accelerator has an 8-parallel multiply accumulate (MAC) engine which significantly accelerates classes of signal processing algorithms that requires this type of parallel computation. Examples include:
The variable length code/decode (VLCD) module in this accelerator supports the following fundamental operations very efficiently:
Quantization and inverse quantization (Q/IQ)
Variable length coding and decoding (VLC/VLD)
Zigzag scan flexibility
The design of this block is such that it operates on a macroblock of data at a time (max 6 8x8 blocks, 4:2:0 format). Before starting to encode or decode a bitstream, the proper registers and memory in the VLCD module must first be initialized by the application software.
This hardware accelerator also contains a block called a sequencer which is really just a 16-bit microprocessor targeted for simple control, address calculation, and loop control functions. This simple processing element offloads the sequential operations from the DSP.
The application developer can program this sequencer to coordinate the operations among the other accelerator elements including the iMX, VLCD, System DMA, and the DSP. The sequencer code is compiled using a simple macro using support tools, and is linked with the DSP code to be later loaded by the CPU at run time.
One of the other driving factors for the development of SoC technology is the fact that there is an increasing demand for programmable performance. For many applications, performance requirements are increasing faster than the ability of a single CPU to keep pace.
The allocation of performance, and thus response time, for complex realtime systems is often easier with multiple CPUs. And dedicated CPUs in peripherals or special accelerators can offload low-level functionality from a main CPU, allowing it to focus on higher-level functions.Next in Part 2: Software architecture for a SoC
Robert Oshana is an engineering manager in the Software Development Organization of Texas Instruments DSP Systems business. He is responsible for the development of hardware and software debug technology for many of TI's programmable devices. He has 25 years of real-time embedded development experience.
Used with the permission of the publisher, Newnes/Elsevier this series of two articles is based on material from DSP Software Development Techniques for Embedded and Real Time Systems, by Robert Oshana.
1. Multiprocessor systems-on-chips, by Ahmed Jerraya, Hannu Tenhunen and Wayne Wolf, page 36, IEEE Computer, July 2005.
2. Embedded Software in Real-Time Signal Processing Systems: Design Technologies, Proceedings of the IEEE, vol. 85, no. 3, March 1997.
3. A Software/Hardware Co-design Methodology for Embedded Microprocessor Core Design, IEEE 1999.
4. Component-Based Design Approach for Multicore SoCs, Copyright 2002, ACM.
5. A Customizable Embedded SoC Platform Architecture, IEEE IWSOC'04 <- International Workshop on System-on-Chip for Real-Time Applications.
6. How virtual prototypes aid SoC hardware design, Hellestrand, Graham. EEdesign.com May 2004.
7. Panel Weighs Hardware, Software Design Options, Edwards, Chris. EETUK.com Jun 2000.
8. Back to the Basics: Programmable SoCs. Zeidman, Bob. Embedded.com July 2005.