Embedded DSP Software Design on a Multicore SoC Architecture: Part 1 - Embedded.com

Embedded DSP Software Design on a Multicore SoC Architecture: Part 1


Designing and building embeddedsystems is a difficult task, given the inherent scarcity of resourcesin embedded systems (processing power, memory, throughput, batterylife, and cost). Various trade-offs are made between these resourceswhen designing an embedded system.

Modern embedded systems are using devices withmultiple processing units manufactured on a single chip, creating asort of multicore system-on-a-chip (SoC) can increase the processingpower and throughput of the system while at the same time increasingthe battery life and reducing the overall cost.

One example of a DSP based SoC is shown in Figure 11.1 below .Multicore approaches keep hardware design in the low frequency range(each individual processor can run at a lower speed, which reducesoverall power consumption as well as heat generation), offeringsignificant price, performance, and flexibility (in software design andpartitioning) over higher speed single-core designs.

Figure11.1. Block diagram of a DSP SoC

There are several characteristics of SoC that we willdiscuss [1]. I will use an example processor to demonstrate thesecharacteristics and how they are deployed in an existing SoC.

1. Customized tothe application ” Like embedded systems in general, SoC arecustomized to an application space. As an example, I will reference thevideo application space. A suitable block diagram showing the flow ofan embedded video application space is shown in Figure 11.2 below.

This system consists of input capture, real-timesignal processing, and output display components. As a system there aremultiple technologies associated with building a flexible systemincluding analog formats, video converters, digital formats, anddigital processing. An SoC processor will incorporate a system ofcomponents; processing elements, peripherals, memories, I/O, and soforth to implement a system such as that shown in Figure 11.2 below.

Figure11.2 Digital video system application model (courtesy of TexasInstruments)

Anexample of an SoC processor that implements a digital video system isshown in Figure 11.3 below. Thisprocessor consists of various components to input, process, and outputdigital video information. More about the details of this in a moment.

2. SoCs improvepower/performance ratio ” Largeprocessors running at high frequencies consume more power, and are moreexpensive to cool. Several smaller processors running at a lowerfrequency can perform the same amount of work without consuming as muchenergy and power.

In Figure 11.1, the ARM processor, the two DSPs, andthe hardware accelerators can run a large signal processing applicationefficiently byproperly partitioning the application across these four differentprocessing elements.

3. Many apps require programmability “SoC contains multiple programmable processing elements. These arerequired for a number of reasons:

New technology ” Programmability supports upgradeability andchangeability easierthan nonprogrammable devices. For example, as new video codectechnology isdeveloped, the algorithms to support these new standards can beimplemented ona programmable processing element easily. New features are also easierto add.
Support for multiple standards and algorithms ” Some digital videoapplicationsrequire support for multiple video standards, resolutions, and quality.Its easierto implement these on a programmable system.
Full algorithm control ” A programmable system provides the designerthe abilityto customize and/or optimize a specific algorithm as necessary whichprovides theapplication developer more control over differentiation of theapplication.
Software reuse in future systems ” By developing digital videosoftware as components, these can be reuse/repackaged as buildingblocks for future systems as necessary.

4. Constraints such as real-time, power, cost ” There are many constraints in real-time embedded systems. Many ofthese constraints are met by customizing to theapplication.

Figure11.3. A SoC processor customized for Digital Video Systems (courtesy ofTexas Instruments)

5. Special instructions – SoCs havespecial CPU instructions to speed up the application. As an example,the SoC in Figure 11.3 above contains special instructions on the DSP to accelerate operations suchas:
32-bit multiply instructions for extended precision computation
Expanded arithmetic functions to support FFT and DCT algorithms
Improve complex multiplications
Double dot product instructions for improving throughput of FIR loops
Parallel packing Instructions
Enhanced Galois Field Multiply

Each of these instructions accelerate the processingof certain digital video algorithms. Of course, compiler support isnecessary to schedule these instructions, so the tools become animportant part of the entire system as well.

6. Extensible ” Many SoCs are extensible in wayssuch as word size and cache size. Special tooling is also madeavailable to analyze systems as these system parameters are changes.

7. Hardware acceleration ” There are several benefits tousing hardware acceleration in an SoC. The primary reason is bettercost/performance ratio. Fast processors are costly. By partitioninginto several smaller processing elements, cost can be reduced in theoverall system. Smaller processing elements also consume less power andcan actually be better at implementing real-time systems as thededicated units can respond more efficiently to external events.

Hardware accelerators are useful in applications thathave algorithmic functions that do not map to a CPU architecture well.For example, algorithms that require a lot of bit manipulation requirea lot of registers. A traditional CPU register model may not be suitedto efficiently execute these algorithms.

A specialized hardware accelerator can b built thatperforms bit manipulation efficiently which sits beside the CPU andused by the CPU for bit manipulation operations. Highly responsive I/Ooperations are another area where a dedicated accelerator with anattached I/O peripheral will perform better.

Finally, applications that are required to processstreams of data, such as many wireless and multimedia applications, donot map well to the traditional CPU architecture, especially those thatimplement caching systems.

Figure11.4 Block diagram of the video processing subsystem accelerationmodule of the SoC in Figure 11.3 (courtesy of Texas Instruments)

Since each streaming data element may have a limitedlifetime, processing will require the constant thrashing of cache fornew data elements. A specialized hardware accelerator with specialfetch logic can be implemented to provide dedicatedsupport to these data streams.

Hardware acceleration is used on SoCs as a way toefficiently execute classes of algorithms. We mentioned in the chapteron power optimization, how the use of accelerators if possible canlower overall system power since these accelerators are customized tothe class of processing and, therefore, perform these calculations veryefficiently.

The SoC in Figure 11.3 has hardware accelerationsupport. In particular, the video processing sub-system (VPSS) as wellas the Video Acceleration block within the DSP subsystem are examplesof hardware acceleration blocks used to efficiently process videoalgorithms.

Figure 11.4 above showsa block diagram of one of the VPSS. This hardware accelerator contains:

A front end module containing:
CCDC (charge coupled device)
Resizer (accepts data from the previewer or from external memory andresizes from ¼x to 4x)

And a back end module containing:
Color space conversion
Digital output
On-screen display

This VPSS processing element eases the overallDSP/ARM loading through hardware acceleration. An example applicationusing the VPSS is shown in Figure11.5 below.

Figure11.5 A Video phone example using the VPSS acceleration module (courtesyof Texas Instruments)

8. Heterogeneous memory systems ” Many SoC devices contain separate memories for the differentprocessing elements. This provides a performance boost because of lowerlatencies on memory accesses, as well as lower power from reduced busarbitration and switching.

This programmable coprocessor is optimized forimaging and video applications. Specifically, this accelerator isoptimized to perform operations such as filtering, scaling, matrixmultiplication, addition, subtraction, summing absolute differences,and other related computations.

Much of the computation is specified in the form ofcommands which operate on arrays of streaming data. A simple set ofAPIs can be used to make processing calls into this accelerator. Inthat sense, a single command can drive hundreds or thousands of cycles.

As discussed previously, accelerators are used toperform computations that do not map efficiently to a CPU. Theaccelerator in Figure 11.6 below is an example of an accelerator that performs efficient operationsusing parallel computation.

Figure11.6 A hardware accelerator example; video and imaging coprocessor(courtesy of Texas Instruments)

This accelerator has an 8-parallel multiplyaccumulate (MAC) engine which significantly accelerates classes ofsignal processing algorithms that requires this type of parallelcomputation. Examples include:

JPEG encode and decode
MPEG-1/2/4 encode and decode
H.263 encode and decode
WMV9 decode
H.264 baseline profile decode

The variable length code/decode (VLCD) module in thisaccelerator supports the following fundamental operations veryefficiently:

Quantization and inverse quantization (Q/IQ)
Variable length coding and decoding (VLC/VLD)
Huffman tables
Zigzag scan flexibility

The design of this block is such that it operates ona macroblock of data at a time (max 6 8×8 blocks, 4:2:0 format). Beforestarting to encode or decode a bitstream, the proper registers andmemory in the VLCD module must first be initialized by the applicationsoftware.

This hardware accelerator also contains a blockcalled a sequencer which is really just a 16-bit microprocessortargeted for simple control, address calculation, and loop controlfunctions. This simple processing element offloads the sequentialoperations from the DSP.

The application developer can program this sequencerto coordinate the operations among the other accelerator elementsincluding the iMX, VLCD, System DMA, and the DSP. The sequencer code iscompiled using a simple macro using support tools, and is linked withthe DSP code to be later loaded by the CPU at run time.

One of the other driving factors for the developmentof SoC technology is the fact that there is an increasing demand forprogrammable performance. For many applications, performancerequirements are increasing faster than the ability of a single CPU tokeep pace.

The allocation of performance, and thus responsetime, for complex realtime systems is often easier with multiple CPUs.And dedicated CPUs in peripherals or special accelerators can offloadlow-level functionality from a main CPU, allowing it to focus onhigher-level functions.

Next in Part 2: Softwarearchitecture for a SoC

Robert Oshana isan engineering manager in the Software Development Organization of Texas Instruments DSP Systems business.He is responsible for the development of hardware and software debugtechnology for many of TI's programmable devices. He has 25 years ofreal-time embedded development experience.

Used with the permission of thepublisher, Newnes/Elsevier this series of two articles is based onmaterial from DSPSoftware Development Techniques for Embedded and Real Time Systems, by Robert Oshana.

1. Multiprocessorsystems-on-chips, by Ahmed Jerraya, Hannu Tenhunenand Wayne Wolf, page 36, IEEE Computer , July 2005.
2. EmbeddedSoftware in Real-Time Signal Processing Systems: DesignTechnologies, Proceedings of the IEEE , vol. 85, no. 3,March1997.
3. A Software/Hardware Co-design Methodology for EmbeddedMicroprocessor Core Design, IEEE 1999.
4. Component-BasedDesign Approach for Multicore SoCs,Copyright 2002, ACM.
5. ACustomizable Embedded SoC Platform Architecture, IEEE IWSOC'04 <- International Workshop on System-on-Chip for Real-TimeApplications.
6. Howvirtual prototypes aid SoC hardware design, Hellestrand, Graham.EEdesign.com May 2004.
7. PanelWeighs Hardware, Software Design Options, Edwards,Chris. EETUK.com Jun 2000.
8. Backto the Basics: Programmable SoCs. Zeidman, Bob.Embedded.com July 2005.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.