No one digital signal processor is right for every application. Knowing what to
expect from digital signal processing is the first step in finding a chip that fits your needs.
by Don Morgan
View list of DSPs
When the digital signal processor (DSP) was first introduced in the early 1980s, it was considered a specialty processor for certain exotic, leading-edge applications. Since then, it has found its way into home computers, electric automobiles, motion controllers, music, videothe list is endless.
Its uses are so widespread
because the DSPisnt a specialty item; rather, its a mathematical facilitator. We find that by applying the mathematics of a subjectthe same math we use to design and simulate a product or applicationwe can create a product thats more compact and consistent, easier to maintain and change than one we cobble out of materials of the physical universe.
The DSP combines an arithmetic-specific instruction set with efficiency and speed. Any system that can be described as linear time
invariant (LTI) can probably be approximated by the DSP and benefit from its facility. Of course, applications exist that are still beyond the throughput of the generic DSP chip, and there are solutions for these cases too. Currently, the speed and efficiency of the DSPs themselves are improving as DSPs become even more capable.
What is digital signal processing?
The concept of a signal is difficult to define. On an intuitive level, it may be as simple as a wiggle, or lack thereof in the
physical universe. From an engineering viewpoint, a signal may be defined as a function of one or more variables that contain information about the behavior or nature of some phenomenon. A data sequence may be such a function. If we modify this sequence or respond to it in any way, we have performed signal processing. Digital signal processing occurs in the digital domain and enjoys the benefits and burdens of that environment.
There is no single definitive DSP environment. Digital signal processing can occur
in microcontrollers, application-specific ICs with no intelligent controller, CISC CPUs, RISC CPUs, PLDs, FPGAs, and digital signal processors. Processing can take place in real-time, as with motion control, or be time-constrained, as with many audio/video applications, or completely off-line, as with rendering. The needs of the application determine the theater, and therefore, the devices used. It is for this reason that we have so many DSPs to choose from.
Mathematics is the primary tool used in
signal processing and certainly in DSP. The capacitors, resistors, inductors, and active elements of previous technologies are replaced by the math used to design them. And, though signal processing can occur anywhere, it is performed most efficiently on a machine designed to perform mathematical operations.
What do we expect of the DSP?
The DSP isnt designed to perform the same operations as the CPU in your computer. Its instruction set is tailored to those operations necessary for fast
arithmetic processing. The operations it needs are entirely numericoften array operations involving infinite matrices. The core of the processing is the multiply/accumulate, the heart of the dot product, the convolution, correlation, and the integral. These are the operations necessary to describe an LTI systemthe focus of all engineering applications.
Often when we think of the DSP, we think of a kind of CPU, which isnt always correct. Digital signal processing can be performed with
dedicated chips that perform one function. This option can be beneficial from the standpoint of economics as well as efficiency. Having an entire DSP with concomitant memory, programming, and interface isnt always necessaryespecially when the job can be performed by a single chip requiring little more than a simple setup. Realize that DSPs, even the fastest, may not be fast enough for the operation you have in mind, which is still the case for many radio and cell phone applications.
Components
that do DSP functions
Many of the functions normally associated with DSPs are available in individual ICs without the cost, generality, or the need for specialized programming. These devices also provide something that the standard DSP cannotthe unusual speed requirements of systems involved in high frequency processing such as radio and cell phones. These COTS devices perform such tasks as FIR filters, quadrature decoders, multipliers, half-band filters, numerically controlled oscillators,
histogrammers, video image filters, and convolvers.
FPGAs and PLDs can also be used to perform DSP functions. A generalized product can be built that is programmed for a particular purpose just before it is shipped. As is the case with programmable parts, upgrades and modifications are more easily made in the field, and these parts can run at sample rates well in excess of even the fastest DSPs available today.
This approach can prove quite an advantage in some systems. For more detail on ICs, PLDs, and
FPGAs in DSP, see Digital Signal Processing With or Without a DSP, p. 93.
What do we want from a DSP?
Many applications require functions or programmability that cannot be supplied by stand-alone parts. What should we look for in a DSP? There is no way to answer that question definitivelythe application determines what is needed.
The nature of the DSP is to perform numerical processing at high speeds. It isnt only math, however, that the architecture of a DSP
supports. It also supports a specialized approach to addressing that allows access to data and program in one instruction cycle. It supports special addressing schemes that allow for automatic indexing through rotary buffers and nonlinear addressing of this data in operations that control the buffers. And it supports operations such as bit-reverse addressing.
Often a DSP will be designed into a system as a coprocessor. The control processor would be responsible for the handling of system interfaces that involve
bit banging, peripherals, and human interface, while the DSP would be solely responsible for processing streams of data.
The choice of the DSP is based on the application. As an aid in determining what is best for your application, Table 1 describes some of the more important aspects of DSP architecture.
Three more items are often talked about but their meanings tend to be more subjective. They are higher clock speeds, ease of hardware implementation, and easy coding.
DSPs come in many forms
The first DSPs were integer units with a multiplier and an ALU on board to implement the multiply/accumulate instruction that is the core of LTI systems. Through the years, the complexity and requirements increased, floating point was added, and the clock speeds got higher. The drive toward more efficient processing produced single-cycle instructions, pipelined jumps, conditional execution, dual pipelines, and SIMD.
Many companies manufacture DSPs, but probably the best known and most popular are Texas
Instruments, Analog Devices, and Motorola. Each produces both basic and highly complex DSPs. Following is a small sampling of DSPs offered by some of these manufacturers.
DSP56K family
This family has many components, all of them originating with Motorolas DSP56001/2. This core, with its integer arithmetic unit, became the CPU for a number of derivative processors including the DSP56004/7/9, and 11 that were dedicated to audio, and the DSP56005 and 6, which were mainly motion control
chips.
The architecture of this chip made it a nice fit for audio applications. Motorola recognized this fact, and in the DSP56004, DSP56007, and DSP56011, two serial input lines and three output lines (I2S) interface easily to the standard audio chips. This easy fit made it a natural for a number of audio implementations. Currently, AC-3, Prologic, and DTS algorithms are masked onto this chip.
The 24-bit word actually makes it a good fit for many other applications, as well. It is easily proved that the
longer the accumulator, the less the quantization error will be. So for many applications from motion control to audio, this DSP has found a home.
Most instructions (except jumps, compares, tests, and so on) execute in one instruction cycle, which consists of two clocks. So for a 20MHz clock, we have approximately 10 MIPS. Of course, few algorithms execute without jumps, compares, or tests.
The device has a Harvard architecture, allowing dual data moves combined with concurrent arithmetic operations
such as multiplies and accumulates, which really means that its possible to approach high throughputs. A 1,024-point FFT will take 3.39ms using 24-bit arithmetic.
The instruction set includes no-overhead looping, which allows an FIR filter to be coded with only two instructions and executed in 2(
n
+1) clock cycles.
In addition, the addressing modes available include bit reversal for the FFT butterflies and flags in the status register for block floating point. This processor has a
good deal of arithmetic power for computing FFTs.
Motorola has since introduced a newer family of DSPs: the DSP56300 series. Many of the problems in the earlier DSP56K family were fixed in this series. This chip has genuine single-cycle operations, as well as a barrel shifter, but there is still no pipelined jump.
The SHARC
The ADSP-21065 (also known as the SHARC), was introduced several years ago by Analog Devices, and immediately became very popular for a number of applications including
instrumentation, motion control, and audio processing. Its architecture is attractive in that it boasts four data buses and single-cycle operation that includes two data fetches, one program fetch, and an I/O access. It provides for both integer and floating-point arithmetic, with 32 bits for the integer and 32 and 40 bits for the floating-point. It also has a pipelined branch, a large amount of configurable on-board (dual-port) RAM, serial support for popular A/D and D/As, a number of DMA channels, and
multiprocessor support. When this chip was released, it was expensive. Since then, a number of versions of the DSP have been introduced that are quite affordable.
This processor seems to have been designed to facilitate transform type processing used in scientific and multimedia applications. Some instructions even incorporate features used in the butterfly additions and subtractions that are part of these transforms.
Recently, Analog Devices announced that it would be coming out with another addition to
the SHARC family, the ADSP-21160. This new part is substantially the same as the ADSP-21065, with the addition of a parallel and identical processing unit incorporating a shifter, ALU, and multiplier. Unlike a fully parallel processing device, the ADSP-21160 is not a dual pipeline part. The second unit is only used in SIMD mode, which is enabled by setting a bit in a control register.
In SIMD mode, most of the instructions will act on both processing units instead of just one. With a 100MHz clock, this
chip is capable of performing an FFT in 90µs. Each tap of an FIR filter will take 5ns and an IIR Biquad will take 20ns.
The TMS320C6000 series
Texas Instruments probably has the longest history in the DSP business. Since the early 1980s, they have produced a series of parts that covers almost any application.
If your application requires muscle and speed, the TMS320C6xx from TI is probably the best COTS general-purpose device available. Many impressive aspects of the construction of
this processor contribute to its efficiency and speed. Not only does it have a fully pipelined branch, but each instruction is also conditional. Combined with a single-cycle operation and a fast clock, this can make for some very fast processing. But there is one more addition: the device possesses a dual pipeline. All three of these factors contribute to a leap in efficiency that has Texas Instruments touting anywhere from 1,200 MIPS to 2,000 MIPS, depending on the clock speeds involved. Of course, results
vary depending upon the application and the care with which the code is written.
This processor uses a technique known as VLIW (very long instruction word), which can allow up to eight instructions to be executed in parallel, each proceeding through the pipeline in parallel. The key to efficient programming in a multiple-pipeline environment is the scheduling of instructions so that no pipe is stalled. This makes hand-coding such a device an extremely arduous task.
To help, TI provides a C compiler, an
assembler, and a new form called linear assembly. Linear assembly is similar to standard assembly, except that it allows the compiler to optimize the code for you. This can help achieve efficient code in a much shorter time. It doesnt remove the responsibility from the software engineer for creating strong and efficient algorithms; it does aid in the tedious and problematic process of instruction scheduling.
The core consists of thirty-two 32-bit general purpose registers, two multipliers, and six
ALUs. Currently, the floating-point core runs at 167MHz and the integer core at 200MHz. The device supports eight-, 16-, and 32-bit data types, and has 40-bit arithmetic capability. Integer and floating-point versions of the product are available. The floating-point core has support for 32- (single precision) or 64-bit (double precision) results that are fully compliant with IEEE floating-point operations.
SIMD vs. multiple pipelines
Two basic methods exist for achieving parallel operation in
a CPU: SIMD and multiple pipelines. SIMD increases the amount of data processed by a single instruction. Multiple pipelines make it possible to execute more instructions in the same cycle.
Multi-pipeline architecture offers several advantages because the instructions need not be the same. The problem is (to get the full benefit of this construction) that they must pair. They must be compatible with one another. That is, they cannot access the same memory locations simultaneously; one unit cant
operate on the result of the others operation until it is complete, and so on. The architecture makes low-level programming more difficult.
The ADSP-21160 has no second pipeline and no problems with instruction pairing because only one instruction exists by definition for both unitsits the data that increases. By writing software for an SIMD machine such as this one, the throughput can be doubled on certain operations.
Long FIR filters can now be written without the fear of eating up the
duty cycle in the processor. In many cases, this can obviate the need for downsampling and subband coding (with all the complexity) in routines that previously required it. But it isnt only FIR filters that benefit. Transforms typically comprise iteration upon iteration of simple operations like the butterfly, or half-band FIR or IIR filters. The time required to execute these operations can be decreased substantially.
Specialized DSPs
Many manufacturers have DSPs aimed at certain
markets. Following Zorans lead with its AC-3 chip, Crystal Semiconductor, a maker of high-quality A/Ds for the audio world, and Motorola have introduced a series of DSPs pre-masked with the algorithms for DTS, AC-3, and Pro-Logic.
Analog Devices has a series of low-cost 16-bit DSPs with A/Ds built in and configurations approaching that of an MCU. These parts can be very handy for creating a self-sufficient system inexpensively. The bus is limited to 16-bits, 96 dB, but this is quite enough for many
applications in the voice band.
For the higher end, Sharp has the Butterfly DSP, a fast chip created for transform-based processing. This chip is designed specifically for applications requiring high FFT bandwidth and typically finds its way into radar, scientific, and medical systems.
Software for developing DSP applications
Besides the hardware available for performing DSP functions, a good deal of software is also available. Here are three of the most popular examples.
All of these packages
will perform the math necessary to develop the coefficients for filters and simulate algorithmssome of it will actually produce code for target DSPs.
Mathcad is the least expensive software package, but offers a wide range of functions. With some understanding of signal processing theory and mathematics, youll be able to produce coefficients and strategies through simulation for your application.
Matlab is a rich software package with many extensions for different areas of mathematics. It
also has a compiler that allows an engineer to write a software package that will execute as an independent application on a user machine. Matlab has a signal processing package that can help model any of the popular filter forms, as well as derive the coefficients. In addition, it has an excellent simulation facility and can be used to produce code directly for certain DSPs.
Elanix offers a software package that will model systems and produce the numbers you need for the design of DSP applications.
Besides producing code for some popular DSPs, it will also do the same for Xilinx FPGAs.
If DSPs are not already invading your engineering, they will be soon. Good luck and have fun.
You will find a table listing DSPs at
www.embedded.com/1999/9904/9904srtable.htm.
Don Morgan is senior engineer at Ultra Stereo Labs and a consultant with 25 years experience in signal processing, embedded systems, hardware, and software. Morgans most recent book is
Numerical Methods for DSPSystems in C
.
| TABLE 1 DSP checklist
|
| ALU and Bus Width
|
| Because the DSP is designed for arithmetic processing, the bus width must be adequate to accommodate
the result of any double precision multiply operation and subsequent additions. Multiplies cannot result in overflows but additions can. A multiply/accumulate operation can continue for some time depending on the length of a given filter. If you're after high precision and accuracy, you won't want to quantize your result until the very end-the bus must be adequate.
|
| Saturation
|
| When the result of an operation becomes greater than can be expressed within the precision of the device, you'll want it to saturate rather than roll over. Common saturation implementations include the most positive value available on your machine and most negative. The MMX instruction set on Pentium II-compatible chips also provides for saturation to zero; this is a feature that would be nice in standard DSPs.
|
| Division
|
| Division is difficult; it can be a complex and time-consuming operation. It's usually implemented on a DSP as a nonrestoring division, using a primitive that must be executed iteratively until you reach your target precision. If your application requires some sort of division, check to see that it has some sort of primitive for doing so. Not all DSPs include a division primitive.
|
| Barrel Shifter
|
| A barrel shifter performs a multi-bit shift in one cycle. This is important for floating-point normalization and many other operations. Without this feature, a simple normalization required for floating-point operations will require as many instruction cycles as shifts. Beware, not all DSPs include a barrel shifter.
|
| Logical Operations
|
| Normally this aspect is not a problem, but it's important to know that the part you choose has the set of logical operations you will need, including
AND
,
OR
,
EXCLUSIVE OR
, and
NEGATION
. Most of these functions are available in some form or another, though not always in the form you wish. The most frequently missed is the conditional. This facility can turn a simple subtraction
into a divide primitive and will generally increase the efficiency of the machine.
|
| Addressing
|
| If you do transform processing, the butterfly or bit-reverse addressing capability is important to you. Dot product and matrix operation can require long circular buffers; multi-rate processing will want these buffers to have different indices.
|
| Data Paths
|
| The number of internal data paths directly influences the number of instructions that may be executed in a single cycle.
|
| Harvard Architecture
|
| Harvard architecture allows for data and instructions to be accessed within the
same cycle. As I've pointed out, it is perfectly possible to perform signal processing on a Von Neumann machine but it will take longer.
|
| Single-Cycle Operations
|
| The processor may say that it has single-cycle operation, but how many clocks per cycle does it require? What are the instruction latencies (how many pipeline states)?
|
| Parallel and Pipelined Operations
|
| These are techniques for improving the efficiency of the DSP. You won't want to suffer pipeline problems for loops or for jumps. Look for pipelined jumps that do not require the pipeline to drain and refill every time it must branch.
|
| On-chip Cache
|
| Clearly, this can make a difference in the efficiency of the operations, especially when it can mean less-expensive, off-board memory. Check to see if it caches data as well as program code.
|
| Special Operations
|
| Look for any operations particular to the DSP you're looking at that would make your job easier. If you do a lot
of transform processing, look for instructions that make it easier, and so on.
|
| Special Peripherals
|
| Would your project benefit from more on-board I/O, an A/D, or a dedicated interface to certain buses? Is it going to be communicating with other DSPs? Would more DMA be beneficial? How about dual-port RAM on the chip?
|