CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS


Optimization Techniques for High-Performance DSPs

by Rob Oshana

As applications grow, so does the stress on available CPU and memory resources. This article summarizes some of the techniques used in practice to gain speed increases from high-performance DSPs.

Many of today’s digital signal processing applications are subject to real-time constraints. And it seems many applications eventually grow to a point where they stress the available CPU and memory resources. Many of these applications seem like trying to fit 10 pounds of algorithms into a five-pound sack. Understanding the architecture of the digital signal processor (DSP), as well as the compiler, can speed up applications, sometimes by an order of magnitude. In this article I will summarize some of the techniques commonly used to gain orders of magnitude performance increases from high-performance DSPs. The examples I use are based on the TI 62xx family of DSPs, but the techniques can be applied to other DSPs as well.

Make the common case fast
The fundamental rule in computer design, as well as in programming real-time systems, is to “make the common case fast, and favor the frequent case.” It’s Amdahl’s Law that says the performance improvement you gain by using some faster mode of execution is limited by how often you use that faster mode of execution. So don’t spend time trying to optimize a piece of code that will hardly ever run. You won’t get much overall performance improvement, no matter how innovative you are. Instead, if you can eliminate just a few cycles from a loop that executes thousands of times, you will experience a more significant performance improvement.

Architecture styles
Hardware architecture techniques enhance the performance of processors through the use of pipelining . The concept of pipelining isn’t much different from an automobile assembly line. Each car moves through the assembly line and is constructed step by step. There are multiple cars in the assembly line at the same time, each car at a different point in the assembly process. At the end of the assembly line a new car emerges, followed closely by another new car, and so on. Putting a new car together before the previous one is completed is simply cost effective. The assembly line keeps the available workers busy doing more work and spending less time idle. The same is true in pipelined processors. A pipelined processor can start a new task before a previous task is completed. The completion rate depends on how often a new instruction can be introduced. As shown in Figure 1a and 1b, the completion time of an instruction doesn’t change. But the completion rate of instructions improves.

To further improve performance, you can use multiple pipelines. This approach is called superscalar and further exploits the concept of parallelism (as shown in Figure 1c). Some of the high-performance DSPs today, such as the Intel i860, have a superscalar design.

One way to control multiple execution units and other resources on the processor is to issue multiple instructions at the same time. Some of the latest DSPs, such as the TI C6200, are known as very long instruction word, or VLIW, machines. Each instruction in a VLIW machine can control multiple execution units on the processor ( Figure 2). For example, each VLIW instruction in the TI 6200 DSP is eight instructions long, one instruction for each of the eight potentially available execution units (L, S, M, D) shown in Figure 2. Again, the key is parallelism. In practice, however, keeping all of these execution units full all of the time is difficult because of various data dependencies. The possible performance improvement using a VLIW processor is excellent, especially for DSP applications that are designed to take advantage of this parallelism.

A superscalar or VLIW architecture offers more parallelism than a pipelined processor. But unless an algorithm or function is present that can exploit this parallelism, the extra pipe can go unused, thereby reducing the amount of parallelism that can be achieved. An algorithm that has been written to run quickly on a pipelined processor may not run nearly as efficiently on a superscalar processor. As an example, consider the algorithm in Figure 3a. This algorithm is written to take advantage of a pipelined processor. It represents a common way of computing a polynomial on a serial processor because it eliminates the need to compute p **8, p **7, and so on. This saves cycles as well as registers to hold the intermediate values.

For a VLIW device, however, this is not the most optimal way to calculate the expression. The parentheses in this algorithm unfairly constrain the compiler to compute this expression in order, thus eliminating any possibility of parallelism. If we instead decompose this expression into a number of independent expressions, the compiler can schedule these independent expressions on the parallel pipelines of the VLIW device in any convenient order. The resulting calculation uses fewer instruction cycles and a few more registers (see Figure 3b).

This approach is an example of why the programmer must have an understanding of the device architecture, the compiler, and the algorithm to determine the fastest way of executing any particular function.

Accessing memory
Memory can be a severe bottleneck in embedded systems architectures. You can reduce this problem by storing the most often referenced items in fast, on-chip memory and leave the rest in slower, off-chip memory. However, getting the data from external memory to on-chip memory takes a lot of time. If the CPU is busy moving data, it can’t be performing other, more important tasks.

Memories come in all flavors (see Figure 4). The fastest (and most expensive) memory is generally the registers found on-chip. These registers are used by the processor for various arithmetic and management functions. There never seems to be enough of it and management of this valuable resource is paramount to increasing performance. Internal memory, or on-chip memory, is also fast, residing right on the processor. The slowest memory is generally found off-chip and is referred to as external memory. As a real-time programmer, you want to reduce the number of times off-chip memory is accessed because the time to access this memory can be slow and cause huge delays in processing. The CPU pipeline must “stall,” or wait for the CPU to load this memory. Use of on-chip memory is one of the most effective ways of increasing performance. I like to think of on-chip memory as a sort of data cache, with the main difference being that I have to manage it and load what I want into it, instead of these activities being done automatically.

Direct memory access
Direct memory access (DMA) is a way of accessing memory without the intervention of the CPU. It’s accomplished using a peripheral DMA device to write data directly to and from memory, removing the burden from the CPU. The DMA is just another type of CPU whose only function is moving data around quickly. The advantage of using DMA is that the CPU can issue a few instructions to the DMA to move data, and then go back to what it was doing. The DMA moves the data in parallel with the CPU operation, as shown in Figure 5. This approach is just another way of exploiting the parallelism built into the DSP. The DMA is most useful for copying larger blocks of data. Smaller blocks of data don’t have the payoff because the setup and overhead time for the DMA make it worthwhile to just use the CPU. But when used smartly, the DMA can result in huge time savings.

Because of the large penalty associated with accessing external memory, and the cost of getting the CPU involved, the DMA should be used wherever possible. The code for this is not too overwhelming. The DMA requires a transfer control block (TCB) to tell it about the data it is going to access (where it is, where it’s going, how much there is, and so forth). A good portion of this structure can be built ahead of time. Then it is simply a matter of writing to the DMA enable register to start the operation. Starting the DMA operation well ahead of when the data is actually needed is the best approach. This gives the CPU something to do in the meantime and doesn’t force the application to wait for the data to be moved. Then, when the data is actually needed, it’s already there. The application should check to verify that the operation was successful, which requires checking a register. If the operation was done ahead of time, this should be a one-time poll of the register, rather than a spin on the register, which chews up valuable processing time.

An effective use for the DMA is to stage data on- and off-chip. The CPU can access on-chip memory much faster than off-chip memory. Having as much data as possible on-chip is the best way to improve performance. If the data being processed cannot all fit on-chip at the same time (large arrays, for example), then the data can be staged on- and off-chip in blocks using the DMA. All of the data transfers can occur in the background while the CPU is actually crunching the data.

Smart management and layout of on-chip memory can reduce the amount of times data has to be staged on- and off-chip. Developing a smart plan for how to use the on-chip memory is worth the time and effort. In general, the rule is to stage the data in and out of on-chip memory using the DMA and generate the results in on-chip memory. Because of cost and space reasons, most DSPs don’t have a lot of on-chip memory, which requires the programmer to coordinate the algorithms in such a way to efficiently use the memory available.

Instrumenting code to use the DMA does have cost penalties. Code size will go up, depending on how much of the application uses the DMA. I’ve seen code size grow by up to 30% using fully instrumented DMA. Using the DMA also adds increased complexity and synchronization to the application. You should use the DMA only in areas requiring high throughput. However, smart layout and utilization of on-chip memory, as well as judicious use of the DMA, can eliminate most of the penalty associated with accessing off-chip memory. Figure 6 is a template that shows how to use the DMA to stage blocks of data on- and off-chip, where the data can be processed quickly. This technique uses a double buffering mechanism to stage the data, so the CPU can be processing one buffer while the DMA is staging the other buffer. I have seen speed improvements of over 90% using this technique.

Parallelism is the key
The standard rule when programming superscalar and VLIW devices is to keep the pipelines full. A full pipe means efficient code. To determine how full the pipelines are, you need to spend some time inspecting the assembly language code generated by the compiler. You can usually spot inefficient code by the abundance of NOPs in the code, as they indicate inefficient use of the pipelines.

To demonstrate the advantages of parallelism in VLIW-based superscalar machines, let’s start with a simple looping program, shown in Listing 1. If we were to write a serial assembly language implementation of this program, the code would be similar to that in Listing 2. This loop uses one of the two available sides of the superscalar machine. By counting up the instructions and the NOPs, we see that it takes 26 cycles to execute each iteration of the loop. We should be able to do much better.

Note two points in this example. First, many of the execution units aren’t being used and are sitting idle. This is a waste of processor hardware. Second, this piece of assembly includes many delay slots (20 to be exact) where the CPU is stalled, waiting for data to be loaded or stored. When the CPU is stalled, nothing is happening. This is the worst thing you can do to a processor when trying to crunch large amounts of data.

You can keep the CPU busy while it is waiting for data to arrive. We can be doing other operations that aren’t dependent on the data we’re waiting for. We can also use both sides of the superscalar architecture to help us load and store other data values. The code in Listing 3 is an improvement over the serial version. We have reduced the number of NOPs from 20 to five. We are also performing some steps in parallel. Lines 4 and 5 are executing two loads at the same time into each of the two load units (D1 and D2) of the device. This code is also performing the branch operation earlier in the loop and then taking advantage of the delays associated with that operation to complete operations on the current cycle. Notice a new column is present in this code, one that specifies which execution unit you want to use for a particular operation. This flexibility to specify execution units allows you to manage your operations better.

Loop unrolling
Loop unrolling is a technique used to increase the number of instructions executed between executions of the loop branch logic. This reduces the number of times the loop branch logic is executed. Since the loop branch logic is overhead, reducing the number of times this has to execute reduces the overhead and makes the loop body, the important part of the structure, run faster. A loop can be unrolled by replicating the loop body a number of times and then changing the termination logic to comprehend the multiple iterations of the loop body (see Listing 4). The loops in 10a and 10b each take four cycles to execute, but the loop in 10b is doing four times as much work. This ratio is illustrated in Figure 7. The assembly language kernel of this loop is shown in 11a. The mapping of variables from the loop to the processor are shown in 11b. The compiler is able to structure this loop such that all required resources are stored in the register file, and the work is spread across several of the execution units. The work done by cycle for each of these units is shown in Figure 9c.

Now look at the implementation of the loop unrolled four times in Figure 8. Again, only the assembly language for the loop kernel is shown. Notice that more of the register file is being used to store the variables needed in the larger loop kernel. An additional execution unit is also being used, as well as several bytes from the stack in external memory (see Figure 8b). Also, the execution unit utilization shown in Figure 8c indicates that the execution units are being used more efficiently while still maintaining a four-cycle latency to complete the loop. This approach is an example of using all the available resources of the device to gain significant speed improvements. Although the code size looks bigger, it actually runs faster than the loop in Figure 7. (The || indicates all those instructions are executed in a single cycle.)

But if you try to get too greedy, you can hurt yourself. In Figure 9, I tried to unroll the loop eight times. At this point, the compiler couldn’t find enough registers to map all the required variables. When this happens, variables start getting stored on the stack, which is usually in external memory somewhere. This result is expensive because instead of a single-cycle read, it can now take many cycles to read each of the variables each time one is needed. This causes things to break down, as shown in Figure 9. The obvious problems are the number of bytes that are now being stored in external memory (88 vs. our previous eight) and the lack of parallelism in the assembly language kernel. The actual kernel assembly language was long and inefficient. A small part of it is shown in Figure 9. Notice the lack of || instructions and the new NOP instructions. These instructions are, effectively, stalls to the CPU when nothing else can happen. The CPU can only wait for data to arrive from external memory.

The drawback to loop unrolling is that it uses more registers in the register file as well as execution units. Different registers need to be used for each iteration. Once the available registers are used, the processor starts going to the stack to store required data. Going to the off-chip stack is expensive and may wipe out the gains achieved by unrolling the loop in the first place. Loop unrolling should only be used when the operations in a single iteration of the loop don’t use all of the available resources of the processor architecture. Check the assembly language output if you’re not sure of this. Another drawback is the code size increase. As you can see, unrolled loops, albeit faster, require more instructions and therefore, more memory.

Software pipelining
One of the best optimization strategies is to write code that can be pipelined efficiently by the compiler. Software pipelining is an optimization strategy to schedule loops and functional units efficiently. In the case of the C6200, eight functional units can be used at the same time if the compiler can figure out how to schedule each of the units every clock cycle. Sometimes a subtle change in the way the C code is structured can trigger the compiler to pipeline the loop efficiently. (Certain conditions exist in which the compiler has a hard time pipelining the code. For example, the compiler doesn’t handle conditionals well. Branches within loops can’t be pipelined because there is no way to construct an efficient pipe with branch delays.) In software pipelining, multiple iterations of a loop are scheduled to execute in parallel. The loop is reorganized in a way that each iteration in the pipelined code is made from instruction sequences selected from different iterations in the original loop. In the example in Figure 10, a five-stage loop with three iterations is shown. An initial period (cycles n and n +1), called the prolog, is when the pipes are being “primed” or initially loaded with operations. Cycles n +2 to n +4 represent the actual pipelined section of the code. It is in this section that the processor is performing three different operations (C, B, and A) for three different loops (1, 2, and 3). An epilog section is present in which the last remaining instructions are performed before exiting the loop. This is an example of a fully utilized set of pipelines that produces the fastest, most efficient code.

We saw earlier how loop unrolling offers speed improvements over simple loops. Software pipelining can be faster than loop unrolling for certain sections of code because, with loop unrolling, the prolog and epilog are only performed once (as shown in Figure 11).

Listing 5 shows the same sample piece of C code and the corresponding assembly language output. In this case the compiler was asked to attempt to pipeline the code. This is evident by the piped loop prolog and piped loop kernel sections in the assembly language output. In this case, the pipelined code isn’t as good as it could be. You can spot inefficient code by looking for how many NOPs are in the piped loop kernel of the code. In this case the piped loop kernel has a total of five NOP cycles, two in Line 16 and three in Line 20. This loop takes a total of 10 cycles to execute. The NOPs are the first indication that a more efficient loop may be possible. But how short can this loop be? One way to estimate the minimum loop size is to determine what execution unit is being used the most. In this example, the D unit is used more than any other unit, a total of three times (Lines 14, 15, and 21). There are two sides to a superscalar device, enabling each unit to be used twice (D1 and D2) per clock for a minimum two-clock loop: two D operations in one clock and one D unit in the second clock. The compiler was smart enough to use the D units on both sides of the pipe (Lines 14 and 15), enabling it to parallelize the instructions and use only one clock. Performing other instructions while waiting for the loads to complete should be possible, instead of delaying with NOPs.

In the simple for loop, the inputs are apparently not dependent on the output. In other words, there are no dependencies. But the compiler does not know that. Compilers are generally pessimistic creatures. They will not optimize something if the situation is not totally understood. The compiler takes the conservative approach and assumes the inputs can be dependent on the previous output each time through the loop. If it is known that the inputs aren’t dependent on the output, we can hint to the compiler by declaring the input1 and input2 as const, indicating that these fields will not change. This declaration is a trigger for enabling software pipelining and saving throughput. This C code is shown as Listing 6, with the corresponding assembly language.

Notice a few things when looking at this assembly language. First, the piped loop kernel has become smaller. In fact, the loop is now only two cycles long. Lines 44 to 47 are all executed in one cycle (the parallel instructions are indicated by the || symbol) and Lines 48 to 50 are executed in the second cycle of the loop. The compiler, with the additional dependency information supplied by the const declaration, has been able to take advantage of the parallelism in the execution units to schedule the inner part of the loop efficiently. But this efficiency comes at a price: the prolog and epilog portions of the code are much larger now. Tighter piped kernels will require more priming operations to coordinate all of the execution based on the various instruction and branching delays. But once primed, the kernel loop executes extremely quickly, performing operations on various iterations of the loop. The goal of software pipelining is, as I mentioned earlier, to make the common case fast. The kernel is the common case in this example, and we have made it very fast. Pipelined code may not be worth doing for loops with a small loop count. But for loops with a large loop count, executing thousands of times, software pipelining is the only way to go.

In the two cycles that the piped kernel takes to execute, a lot of things are happening. The right-hand column in the assembly listing indicates what iteration is being performed by each instruction. Each @ symbol is an iteration count. So in this kernel, Line 44 is performing a branch for iteration n +2, Lines 45 and 46 are performing loads for iteration n +4, Line 48 is storing a result for iteration n , Line 49 is performing a multiply for iteration n +2, and Line 50 is performing a subtraction for iteration n +3. And this all takes place in two cycles! The epilog is completing the operations once the piped kernel stops executing. The compiler was able to make the loop two cycles long, which is what we predicted by looking at the inefficient version of the code.

The code size for a pipelined function becomes larger, as is obvious by looking at the code produced. This is one of the trade-offs for speed that the programmer must make.

Some simple rules
Real-time programmers have always had to employ some tricks to enable software to run as quickly as possible. As processors continue to become more complicated, this tactic becomes a more difficult endeavor. For superscalar and VLIW processors, managing two separate pipelines and ensuring the highest amount of parallelism requires tools support. Optimizing compilers are helping overcome many of the obstacles of these powerful new processors, but even the compilers have limitations. Real-time programmers should not trust the compiler to perform all of the necessary optimizations for you; you must help them. The main steps to follow are:

• Study the assembly language produced by the compiler. In many instances, subtle changes to the structure of the C code can make a huge difference in how the compiler generates the assembly language. This can make the difference in the real-time performance of the system

• Use the DMA capabilities, especially for data-intensive number crunching applications common in DSP systems. The DMA can take a huge burden off of the CPU and help manage data efficiently

• Keep the pipelines full. The whole reason superscalar and VLIW processors were invented was to take advantage of parallelism. Look for areas of inefficiency in the assembly language and make modifications to allow both pipelines to be used at their full efficiency. This requires an understanding of what the compiler looks for in terms of pipelining opportunities. It also requires an understanding of the application. Many times, just re-arranging the algorithm in a different way can make it run more efficiently on the processor.

Following these steps will help ensure that your software will run as quickly as possible.

Rob Oshana is a software team leader at Raytheon Systems Co., where he is managing a real-time DSP software development effort. He has a BSEE from Worcester Polytechnic Institute and masters degrees in electrical engineering, computer science, and business administration. He can be reached at oshana@cyberramp.net.

References Hennesey, John L. and David A Patterson. Computer Architecture, A Quantitative Approach . Palo Alto, CA: Morgan Kaufmann Publishers, Inc., 1990.

TMS320C62XX Programmers Guide , Austin, TX: Texas Instruments, 1997.

Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :