Designing and building embedded
systems is a difficult task, given the inherent scarcity of resources
in embedded systems (processing power, memory, throughput, battery
life, and cost). Various trade-offs are made between these resources
when designing an embedded system.
Modern embedded systems are using devices with
multiple processing units manufactured on a single chip, creating a
sort of multicore system-on-a-chip (SoC) can increase the processing
power and throughput of the system while at the same time increasing
the battery life and reducing the overall cost.
One example of a DSP based SoC is shown in Figure 11.1 below.
Multicore approaches keep hardware design in the low frequency range
(each individual processor can run at a lower speed, which reduces
overall power consumption as well as heat generation), offering
significant price, performance, and flexibility (in software design and
partitioning) over higher speed single-core designs.
 |
| Figure
11.1. Block diagram of a DSP SoC |
There are several characteristics of SoC that we will
discuss [1]. I will use an example processor to demonstrate these
characteristics and how they are deployed in an existing SoC.
1.
Customized to
the application " Like embedded systems in general, SoC are
customized to an application space. As an example, I will reference the
video application space. A suitable block diagram showing the flow of
an embedded video application space is shown in Figure 11.2 below.
This system consists of input capture, real-time
signal processing, and output display components. As a system there are
multiple technologies associated with building a flexible system
including analog formats, video converters, digital formats, and
digital processing. An SoC processor will incorporate a system of
components; processing elements, peripherals, memories, I/O, and so
forth to implement a system such as that shown in Figure 11.2 below.
 |
| Figure
11.2 Digital video system application model (courtesy of Texas
Instruments) |
An
example of an SoC processor that implements a digital video system is
shown in Figure 11.3 below. This
processor consists of various components to input, process, and output
digital video information. More about the details of this in a moment.
2.
SoCs improve
power/performance ratio " Large
processors running at high frequencies consume more power, and are more
expensive to cool. Several smaller processors running at a lower
frequency can perform the same amount of work without consuming as much
energy and power.
In Figure 11.1, the ARM processor, the two DSPs, and
the hardware accelerators can run a large signal processing application
efficiently by
properly partitioning the application across these four different
processing elements.
3. Many apps require programmability "
SoC contains multiple programmable processing elements. These are
required for a number of reasons:
New technology " Programmability supports upgradeability and
changeability easier
than nonprogrammable devices. For example, as new video codec
technology is
developed, the algorithms to support these new standards can be
implemented on
a programmable processing element easily. New features are also easier
to add.
Support for multiple standards and algorithms " Some digital video
applications
require support for multiple video standards, resolutions, and quality.
Its easier
to implement these on a programmable system.
Full algorithm control " A programmable system provides the designer
the ability
to customize and/or optimize a specific algorithm as necessary which
provides the
application developer more control over differentiation of the
application.
Software reuse in future systems " By developing digital video
software as components, these can be reuse/repackaged as building
blocks for future systems as necessary.
4. Constraints such as real-time, power, cost
" There are many constraints in real-time embedded systems. Many of
these constraints are met by customizing to the
application.
 |
| Figure
11.3. A SoC processor customized for Digital Video Systems (courtesy of
Texas Instruments) |
5. Special instructions - SoCs have
special CPU instructions to speed up the application. As an example,
the SoC in Figure 11.3 above
contains special instructions on the DSP to accelerate operations such
as:
32-bit multiply instructions for extended precision computation
Expanded arithmetic functions to support FFT and DCT algorithms
Improve complex multiplications
Double dot product instructions for improving throughput of FIR loops
Parallel packing Instructions
Enhanced Galois Field Multiply
Each of these instructions accelerate the processing
of certain digital video algorithms. Of course, compiler support is
necessary to schedule these instructions, so the tools become an
important part of the entire system as well.
6. Extensible " Many SoCs are extensible in ways
such as word size and cache size. Special tooling is also made
available to analyze systems as these system parameters are changes.
7. Hardware acceleration " There are several benefits to
using hardware acceleration in an SoC. The primary reason is better
cost/performance ratio. Fast processors are costly. By partitioning
into several smaller processing elements, cost can be reduced in the
overall system. Smaller processing elements also consume less power and
can actually be better at implementing real-time systems as the
dedicated units can respond more efficiently to external events.
Hardware accelerators are useful in applications that
have algorithmic functions that do not map to a CPU architecture well.
For example, algorithms that require a lot of bit manipulation require
a lot of registers. A traditional CPU register model may not be suited
to efficiently execute these algorithms.
A specialized hardware accelerator can b built that
performs bit manipulation efficiently which sits beside the CPU and
used by the CPU for bit manipulation operations. Highly responsive I/O
operations are another area where a dedicated accelerator with an
attached I/O peripheral will perform better.
Finally, applications that are required to process
streams of data, such as many wireless and multimedia applications, do
not map well to the traditional CPU architecture, especially those that
implement caching systems.
 |
| Figure
11.4 Block diagram of the video processing subsystem acceleration
module of the SoC in Figure 11.3 (courtesy of Texas Instruments) |
Since each streaming data element may have a limited
lifetime, processing will require the constant thrashing of cache for
new data elements. A specialized hardware accelerator with special
fetch logic can be implemented to provide dedicated
support to these data streams.
Hardware acceleration is used on SoCs as a way to
efficiently execute classes of algorithms. We mentioned in the chapter
on power optimization, how the use of accelerators if possible can
lower overall system power since these accelerators are customized to
the class of processing and, therefore, perform these calculations very
efficiently.
The SoC in Figure 11.3 has hardware acceleration
support. In particular, the video processing sub-system (VPSS) as well
as the Video Acceleration block within the DSP subsystem are examples
of hardware acceleration blocks used to efficiently process video
algorithms.
Figure 11.4 above shows
a block diagram of one of the VPSS. This hardware accelerator contains:
A front end module containing:
CCDC (charge coupled device)
Previewer
Resizer (accepts data from the previewer or from external memory and
resizes from ¼x to 4x)
And a back end module containing:
Color space conversion
DACS
Digital output
On-screen display
This VPSS processing element eases the overall
DSP/ARM loading through hardware acceleration. An example application
using the VPSS is shown in Figure
11.5 below.
 |
| Figure
11.5 A Video phone example using the VPSS acceleration module (courtesy
of Texas Instruments) |
8.
Heterogeneous memory systems
" Many SoC devices contain separate memories for the different
processing elements. This provides a performance boost because of lower
latencies on memory accesses, as well as lower power from reduced bus
arbitration and switching.
This programmable coprocessor is optimized for
imaging and video applications. Specifically, this accelerator is
optimized to perform operations such as filtering, scaling, matrix
multiplication, addition, subtraction, summing absolute differences,
and other related computations.
Much of the computation is specified in the form of
commands which operate on arrays of streaming data. A simple set of
APIs can be used to make processing calls into this accelerator. In
that sense, a single command can drive hundreds or thousands of cycles.
As discussed previously, accelerators are used to
perform computations that do not map efficiently to a CPU. The
accelerator in Figure 11.6 below
is an example of an accelerator that performs efficient operations
using parallel computation.
 |
| Figure
11.6 A hardware accelerator example; video and imaging coprocessor
(courtesy of Texas Instruments) |
This accelerator has an 8-parallel multiply
accumulate (MAC) engine which significantly accelerates classes of
signal processing algorithms that requires this type of parallel
computation. Examples include:
JPEG encode and decode
MPEG-1/2/4 encode and decode
H.263 encode and decode
WMV9 decode
H.264 baseline profile decode
The variable length code/decode (VLCD) module in this
accelerator supports the following fundamental operations very
efficiently:
Quantization and inverse quantization (Q/IQ)
Variable length coding and decoding (VLC/VLD)
Huffman tables
Zigzag scan flexibility
The design of this block is such that it operates on
a macroblock of data at a time (max 6 8x8 blocks, 4:2:0 format). Before
starting to encode or decode a bitstream, the proper registers and
memory in the VLCD module must first be initialized by the application
software.
This hardware accelerator also contains a block
called a sequencer which is really just a 16-bit microprocessor
targeted for simple control, address calculation, and loop control
functions. This simple processing element offloads the sequential
operations from the DSP.
The application developer can program this sequencer
to coordinate the operations among the other accelerator elements
including the iMX, VLCD, System DMA, and the DSP. The sequencer code is
compiled using a simple macro using support tools, and is linked with
the DSP code to be later loaded by the CPU at run time.
One of the other driving factors for the development
of SoC technology is the fact that there is an increasing demand for
programmable performance. For many applications, performance
requirements are increasing faster than the ability of a single CPU to
keep pace.
The allocation of performance, and thus response
time, for complex realtime systems is often easier with multiple CPUs.
And dedicated CPUs in peripherals or special accelerators can offload
low-level functionality from a main CPU, allowing it to focus on
higher-level functions.
Next in Part 2:
Software
architecture for a SoC
Robert Oshana is
an engineering manager in the Software Development Organization of Texas Instruments DSP Systems business.
He is responsible for the development of hardware and software debug
technology for many of TI's programmable devices. He has 25 years of
real-time embedded development experience.
Used with the permission of the
publisher, Newnes/Elsevier this series of two articles is based on
material from DSP
Software Development Techniques for Embedded and Real Time Systems, by Robert Oshana.
References
1.
Multiprocessor
systems-on-chips, by Ahmed Jerraya, Hannu Tenhunen
and Wayne Wolf, page 36,
IEEE Computer, July 2005.
2.
Embedded
Software in Real-Time Signal Processing Systems: Design
Technologies,
Proceedings of the IEEE, vol. 85, no. 3,
March
1997.
3. A Software/Hardware Co-design Methodology for Embedded
Microprocessor Core Design,
IEEE 1999.
4.
Component-Based
Design Approach for Multicore SoCs,
Copyright 2002, ACM.
5.
A
Customizable Embedded SoC Platform Architecture,
IEEE IWSOC'04
<- International Workshop on System-on-Chip for Real-Time
Applications.
6.
How
virtual prototypes aid SoC hardware design, Hellestrand, Graham.
EEdesign.com May 2004.
7.
Panel
Weighs Hardware, Software Design Options, Edwards,
Chris. EETUK.com Jun 2000.
8.
Back
to the Basics: Programmable SoCs. Zeidman, Bob.
Embedded.com July 2005.