CMP EMBEDDED.COM

Login | Register     Welcome Guest   IPS  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

A DSP for Every Application

No one digital signal processor is right for every application. Knowing what to expect from digital signal processing is the first step in finding a chip that fits your needs.

by Don Morgan
View list of DSPs

When the digital signal processor (DSP) was first introduced in the early 1980s, it was considered a specialty processor for certain exotic, leading-edge applications. Since then, it has found its way into home computers, electric automobiles, motion controllers, music, video—the list is endless.

Its uses are so widespread because the DSPisn’t a specialty item; rather, it’s a mathematical facilitator. We find that by applying the mathematics of a subject—the same math we use to design and simulate a product or application—we can create a product that’s more compact and consistent, easier to maintain and change than one we cobble out of materials of the physical universe.

The DSP combines an arithmetic-specific instruction set with efficiency and speed. Any system that can be described as linear time invariant (LTI) can probably be approximated by the DSP and benefit from its facility. Of course, applications exist that are still beyond the throughput of the generic DSP chip, and there are solutions for these cases too. Currently, the speed and efficiency of the DSPs themselves are improving as DSPs become even more capable.

What is digital signal processing?
The concept of a signal is difficult to define. On an intuitive level, it may be as simple as a wiggle, or lack thereof in the physical universe. From an engineering viewpoint, a signal may be defined as a function of one or more variables that contain information about the behavior or nature of some phenomenon. A data sequence may be such a function. If we modify this sequence or respond to it in any way, we have performed signal processing. Digital signal processing occurs in the digital domain and enjoys the benefits and burdens of that environment.

There is no single definitive DSP environment. Digital signal processing can occur in microcontrollers, application-specific ICs with no intelligent controller, CISC CPUs, RISC CPUs, PLDs, FPGAs, and digital signal processors. Processing can take place in real-time, as with motion control, or be time-constrained, as with many audio/video applications, or completely off-line, as with rendering. The needs of the application determine the theater, and therefore, the devices used. It is for this reason that we have so many DSPs to choose from.

Mathematics is the primary tool used in signal processing and certainly in DSP. The capacitors, resistors, inductors, and active elements of previous technologies are replaced by the math used to design them. And, though signal processing can occur anywhere, it is performed most efficiently on a machine designed to perform mathematical operations.

What do we expect of the DSP?
The DSP isn’t designed to perform the same operations as the CPU in your computer. Its instruction set is tailored to those operations necessary for fast arithmetic processing. The operations it needs are entirely numeric—often array operations involving infinite matrices. The core of the processing is the multiply/accumulate, the heart of the dot product, the convolution, correlation, and the integral. These are the operations necessary to describe an LTI system—the focus of all engineering applications.

Often when we think of the DSP, we think of a kind of CPU, which isn’t always correct. Digital signal processing can be performed with dedicated chips that perform one function. This option can be beneficial from the standpoint of economics as well as efficiency. Having an entire DSP with concomitant memory, programming, and interface isn’t always necessary—especially when the job can be performed by a single chip requiring little more than a simple setup. Realize that DSPs, even the fastest, may not be fast enough for the operation you have in mind, which is still the case for many radio and cell phone applications.

Components that do DSP functions
Many of the functions normally associated with DSPs are available in individual ICs without the cost, generality, or the need for specialized programming. These devices also provide something that the standard DSP cannot—the unusual speed requirements of systems involved in high frequency processing such as radio and cell phones. These COTS devices perform such tasks as FIR filters, quadrature decoders, multipliers, half-band filters, numerically controlled oscillators, histogrammers, video image filters, and convolvers.

FPGAs and PLDs can also be used to perform DSP functions. A generalized product can be built that is programmed for a particular purpose just before it is shipped. As is the case with programmable parts, upgrades and modifications are more easily made in the field, and these parts can run at sample rates well in excess of even the fastest DSPs available today.

This approach can prove quite an advantage in some systems. For more detail on ICs, PLDs, and FPGAs in DSP, see “Digital Signal Processing With or Without a DSP,” p. 93.

What do we want from a DSP?
Many applications require functions or programmability that cannot be supplied by stand-alone parts. What should we look for in a DSP? There is no way to answer that question definitively—the application determines what is needed.

The nature of the DSP is to perform numerical processing at high speeds. It isn’t only math, however, that the architecture of a DSP supports. It also supports a specialized approach to addressing that allows access to data and program in one instruction cycle. It supports special addressing schemes that allow for automatic indexing through rotary buffers and nonlinear addressing of this data in operations that control the buffers. And it supports operations such as bit-reverse addressing.

Often a DSP will be designed into a system as a coprocessor. The control processor would be responsible for the handling of system interfaces that involve bit banging, peripherals, and human interface, while the DSP would be solely responsible for processing streams of data.

The choice of the DSP is based on the application. As an aid in determining what is best for your application, Table 1 describes some of the more important aspects of DSP architecture.

Three more items are often talked about but their meanings tend to be more subjective. They are higher clock speeds, ease of hardware implementation, and easy coding.

DSPs come in many forms
The first DSPs were integer units with a multiplier and an ALU on board to implement the multiply/accumulate instruction that is the core of LTI systems. Through the years, the complexity and requirements increased, floating point was added, and the clock speeds got higher. The drive toward more efficient processing produced single-cycle instructions, pipelined jumps, conditional execution, dual pipelines, and SIMD.

Many companies manufacture DSPs, but probably the best known and most popular are Texas Instruments, Analog Devices, and Motorola. Each produces both basic and highly complex DSPs. Following is a small sampling of DSPs offered by some of these manufacturers.

DSP56K family
This family has many components, all of them originating with Motorola’s DSP56001/2. This core, with its integer arithmetic unit, became the CPU for a number of derivative processors including the DSP56004/7/9, and 11 that were dedicated to audio, and the DSP56005 and 6, which were mainly motion control chips.

The architecture of this chip made it a nice fit for audio applications. Motorola recognized this fact, and in the DSP56004, DSP56007, and DSP56011, two serial input lines and three output lines (I2S) interface easily to the standard audio chips. This easy fit made it a natural for a number of audio implementations. Currently, AC-3, Prologic, and DTS algorithms are masked onto this chip.

The 24-bit word actually makes it a good fit for many other applications, as well. It is easily proved that the longer the accumulator, the less the quantization error will be. So for many applications from motion control to audio, this DSP has found a home.

Most instructions (except jumps, compares, tests, and so on) execute in one instruction cycle, which consists of two clocks. So for a 20MHz clock, we have approximately 10 MIPS. Of course, few algorithms execute without jumps, compares, or tests.

The device has a Harvard architecture, allowing dual data moves combined with concurrent arithmetic operations such as multiplies and accumulates, which really means that it’s possible to approach high throughputs. A 1,024-point FFT will take 3.39ms using 24-bit arithmetic.

The instruction set includes “no-overhead” looping, which allows an FIR filter to be coded with only two instructions and executed in 2( n +1) clock cycles.

In addition, the addressing modes available include bit reversal for the FFT butterflies and flags in the status register for block floating point. This processor has a good deal of arithmetic power for computing FFTs.

Motorola has since introduced a newer family of DSPs: the DSP56300 series. Many of the problems in the earlier DSP56K family were fixed in this series. This chip has genuine single-cycle operations, as well as a barrel shifter, but there is still no pipelined jump.

The SHARC
The ADSP-21065 (also known as the SHARC), was introduced several years ago by Analog Devices, and immediately became very popular for a number of applications including instrumentation, motion control, and audio processing. Its architecture is attractive in that it boasts four data buses and single-cycle operation that includes two data fetches, one program fetch, and an I/O access. It provides for both integer and floating-point arithmetic, with 32 bits for the integer and 32 and 40 bits for the floating-point. It also has a pipelined branch, a large amount of configurable on-board (dual-port) RAM, serial support for popular A/D and D/As, a number of DMA channels, and multiprocessor support. When this chip was released, it was expensive. Since then, a number of versions of the DSP have been introduced that are quite affordable.

This processor seems to have been designed to facilitate transform type processing used in scientific and multimedia applications. Some instructions even incorporate features used in the butterfly additions and subtractions that are part of these transforms.

Recently, Analog Devices announced that it would be coming out with another addition to the SHARC family, the ADSP-21160. This new part is substantially the same as the ADSP-21065, with the addition of a parallel and identical processing unit incorporating a shifter, ALU, and multiplier. Unlike a fully parallel processing device, the ADSP-21160 is not a dual pipeline part. The second unit is only used in SIMD mode, which is enabled by setting a bit in a control register.

In SIMD mode, most of the instructions will act on both processing units instead of just one. With a 100MHz clock, this chip is capable of performing an FFT in 90µs. Each tap of an FIR filter will take 5ns and an IIR Biquad will take 20ns.

The TMS320C6000 series
Texas Instruments probably has the longest history in the DSP business. Since the early 1980s, they have produced a series of parts that covers almost any application.

If your application requires muscle and speed, the TMS320C6xx from TI is probably the best COTS general-purpose device available. Many impressive aspects of the construction of this processor contribute to its efficiency and speed. Not only does it have a fully pipelined branch, but each instruction is also conditional. Combined with a single-cycle operation and a fast clock, this can make for some very fast processing. But there is one more addition: the device possesses a dual pipeline. All three of these factors contribute to a leap in efficiency that has Texas Instruments touting anywhere from 1,200 MIPS to 2,000 MIPS, depending on the clock speeds involved. Of course, results vary depending upon the application and the care with which the code is written.

This processor uses a technique known as VLIW (very long instruction word), which can allow up to eight instructions to be executed in parallel, each proceeding through the pipeline in parallel. The key to efficient programming in a multiple-pipeline environment is the scheduling of instructions so that no pipe is stalled. This makes hand-coding such a device an extremely arduous task.

To help, TI provides a C compiler, an assembler, and a new form called linear assembly. Linear assembly is similar to standard assembly, except that it allows the compiler to optimize the code for you. This can help achieve efficient code in a much shorter time. It doesn’t remove the responsibility from the software engineer for creating strong and efficient algorithms; it does aid in the tedious and problematic process of instruction scheduling.

The core consists of thirty-two 32-bit general purpose registers, two multipliers, and six ALUs. Currently, the floating-point core runs at 167MHz and the integer core at 200MHz. The device supports eight-, 16-, and 32-bit data types, and has 40-bit arithmetic capability. Integer and floating-point versions of the product are available. The floating-point core has support for 32- (single precision) or 64-bit (double precision) results that are fully compliant with IEEE floating-point operations.

SIMD vs. multiple pipelines
Two basic methods exist for achieving parallel operation in a CPU: SIMD and multiple pipelines. SIMD increases the amount of data processed by a single instruction. Multiple pipelines make it possible to execute more instructions in the same cycle.

Multi-pipeline architecture offers several advantages because the instructions need not be the same. The problem is (to get the full benefit of this construction) that they must pair. They must be compatible with one another. That is, they cannot access the same memory locations simultaneously; one unit can’t operate on the result of the other’s operation until it is complete, and so on. The architecture makes low-level programming more difficult.

The ADSP-21160 has no second pipeline and no problems with instruction pairing because only one instruction exists by definition for both units—it’s the data that increases. By writing software for an SIMD machine such as this one, the throughput can be doubled on certain operations.

Long FIR filters can now be written without the fear of eating up the duty cycle in the processor. In many cases, this can obviate the need for downsampling and subband coding (with all the complexity) in routines that previously required it. But it isn’t only FIR filters that benefit. Transforms typically comprise iteration upon iteration of simple operations like the butterfly, or half-band FIR or IIR filters. The time required to execute these operations can be decreased substantially.

Specialized DSPs
Many manufacturers have DSPs aimed at certain markets. Following Zoran’s lead with its AC-3 chip, Crystal Semiconductor, a maker of high-quality A/Ds for the audio world, and Motorola have introduced a series of DSPs pre-masked with the algorithms for DTS, AC-3, and Pro-Logic.

Analog Devices has a series of low-cost 16-bit DSPs with A/Ds built in and configurations approaching that of an MCU. These parts can be very handy for creating a self-sufficient system inexpensively. The bus is limited to 16-bits, 96 dB, but this is quite enough for many applications in the voice band.

For the higher end, Sharp has the Butterfly DSP, a fast chip created for transform-based processing. This chip is designed specifically for applications requiring high FFT bandwidth and typically finds its way into radar, scientific, and medical systems.

Software for developing DSP applications
Besides the hardware available for performing DSP functions, a good deal of software is also available. Here are three of the most popular examples.

All of these packages will perform the math necessary to develop the coefficients for filters and simulate algorithms—some of it will actually produce code for target DSPs.

Mathcad is the least expensive software package, but offers a wide range of functions. With some understanding of signal processing theory and mathematics, you’ll be able to produce coefficients and strategies through simulation for your application.

Matlab is a rich software package with many extensions for different areas of mathematics. It also has a compiler that allows an engineer to write a software package that will execute as an independent application on a user machine. Matlab has a signal processing package that can help model any of the popular filter forms, as well as derive the coefficients. In addition, it has an excellent simulation facility and can be used to produce code directly for certain DSPs.

Elanix offers a software package that will model systems and produce the numbers you need for the design of DSP applications. Besides producing code for some popular DSPs, it will also do the same for Xilinx FPGAs.

If DSPs are not already invading your engineering, they will be soon. Good luck and have fun.

You will find a table listing DSPs at www.embedded.com/1999/9904/9904srtable.htm.

Don Morgan is senior engineer at Ultra Stereo Labs and a consultant with 25 years experience in signal processing, embedded systems, hardware, and software. Morgan’s most recent book is Numerical Methods for DSPSystems in C .

TABLE 1 DSP checklist
ALU and Bus Width
Because the DSP is designed for arithmetic processing, the bus width must be adequate to accommodate the result of any double precision multiply operation and subsequent additions. Multiplies cannot result in overflows but additions can. A multiply/accumulate operation can continue for some time depending on the length of a given filter. If you're after high precision and accuracy, you won't want to quantize your result until the very end-the bus must be adequate.
Saturation
When the result of an operation becomes greater than can be expressed within the precision of the device, you'll want it to saturate rather than roll over. Common saturation implementations include the most positive value available on your machine and most negative. The MMX instruction set on Pentium II-compatible chips also provides for saturation to zero; this is a feature that would be nice in standard DSPs.
Division
Division is difficult; it can be a complex and time-consuming operation. It's usually implemented on a DSP as a nonrestoring division, using a primitive that must be executed iteratively until you reach your target precision. If your application requires some sort of division, check to see that it has some sort of primitive for doing so. Not all DSPs include a division primitive.
Barrel Shifter
A barrel shifter performs a multi-bit shift in one cycle. This is important for floating-point normalization and many other operations. Without this feature, a simple normalization required for floating-point operations will require as many instruction cycles as shifts. Beware, not all DSPs include a barrel shifter.
Logical Operations
Normally this aspect is not a problem, but it's important to know that the part you choose has the set of logical operations you will need, including AND , OR , EXCLUSIVE OR , and NEGATION . Most of these functions are available in some form or another, though not always in the form you wish. The most frequently missed is the conditional. This facility can turn a simple subtraction into a divide primitive and will generally increase the efficiency of the machine.
Addressing
If you do transform processing, the butterfly or bit-reverse addressing capability is important to you. Dot product and matrix operation can require long circular buffers; multi-rate processing will want these buffers to have different indices.
Data Paths
The number of internal data paths directly influences the number of instructions that may be executed in a single cycle.
Harvard Architecture
Harvard architecture allows for data and instructions to be accessed within the same cycle. As I've pointed out, it is perfectly possible to perform signal processing on a Von Neumann machine but it will take longer.
Single-Cycle Operations
The processor may say that it has single-cycle operation, but how many clocks per cycle does it require? What are the instruction latencies (how many pipeline states)?
Parallel and Pipelined Operations
These are techniques for improving the efficiency of the DSP. You won't want to suffer pipeline problems for loops or for jumps. Look for pipelined jumps that do not require the pipeline to drain and refill every time it must branch.
On-chip Cache
Clearly, this can make a difference in the efficiency of the operations, especially when it can mean less-expensive, off-board memory. Check to see if it caches data as well as program code.
Special Operations
Look for any operations particular to the DSP you're looking at that would make your job easier. If you do a lot of transform processing, look for instructions that make it easier, and so on.
Special Peripherals
Would your project benefit from more on-board I/O, an A/D, or a dedicated interface to certain buses? Is it going to be communicating with other DSPs? Would more DMA be beneficial? How about dual-port RAM on the chip?

Embedded.com Career Center
Ready for a change?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :