CMP EMBEDDED.COM

Login | Register     Welcome Guest IPS  Call for Abstracts
 

Code techniques for processor pipeline optimization: Part 3
Optimization for Control-Oriented Operations



Embedded.com
A typical application has a set of control-oriented operations, such as compare, branching, and looping. Certain applications, such as a data sort operation, are control heavy. Even in a more data-intensive application, such as video processing (discussed in Part 1 and Part 2), control operations can consume a significant part of the application's execution time.

The optimizing control-oriented operation has three key components: branch overhead reduction, conditional execution, and addressing techniques.

Reduce Branch Cost
In a deeply pipelined device, branches can be expensive since, during a branch, many stages of a pipeline can potentially be flushed, thus causing stalls of five cycles. XScale has branch prediction logic and branch target buffers to ensure that branch penalties are as small as possible.

From the application developer's point of view, avoiding unnecessary branches and ensuring that the branches are predictable helps improve performance. The following set of examples demonstrates how to eliminate potential branches, eliminate unnecessary loop overhead, and make control flow efficient.

The branch prediction and branch target buffers can reduce branch-related penalties a great deal. The operating system and platform level software must ensure that the branch target buffer (BTB) is enabled.

The BTB has only 32 entries. The size of the branch target buffer limits the number of correctly predictable branches. Because the total number of branches executed in a program is relatively large compared to the size of the branch target buffer, it is often beneficial to minimize the number of branches in a program. Consider the following C code segment:

int foo(int a) {
    if (a > 10)
        return 0;
    else
        return 1;
}

The code generated for the if...else portion of this code segment using branches is:

    cmp     r0,#10
    ble       L1
    mov     r0,#0
    b         L2
L1:
    mov     r0,#1
L2:

This code takes three cycles to execute the else statement and four cycles for the if statement assuming best-case conditions and no branch misprediction penalties. In the case of the XScale, a branch misprediction incurs a penalty of four cycles.

If the branch is mispredicted 50 percent of the time and if both the if statement and the else statement are equally likely to be taken, on an average the code above takes 5.5 cycles to execute.

Using the XScale to execute instructions conditionally, the code generated for the preceding if...else statement is:

cmp         r0,#10
movgt      r0,#0
movle      r0,#1

The preceding code segment would not incur any branch misprediction penalties and would take three cycles to execute assuming best-case conditions. Using conditional instructions helps to speed up execution significantly.

Use Conditional Instruction
Many embedded applications, such as parsing of a packet or looking for a peak in a spectrum, are loop intensive. Branch density - the number of instructions per branch - can be high, which may mean that the number of instructions in a loop or number of instructions between two branches is low. Each loop has management overhead - counting the loop pointer and comparing for the exit condition - associated with it.

The XScale provides the ability to execute instructions conditionally based on a set of conditional flags. This conditional execution feature, combined with the ability of the instructions to modify the condition codes, makes a wide array of optimizations possible.

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Ready to take that job and shove it?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS




 :