Code techniques for processor pipeline optimization: Part 3
Optimization for Control-Oriented Operations
By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Embedded.com
(07/29/08, 12:00:00 PM EDT)
A typical application has a set of control-oriented operations, such as compare, branching, and looping. Certain applications, such as a data sort operation, are control heavy. Even in a more data-intensive application, such as video processing (discussed in Part 1 and Part 2), control operations can consume a significant part of the application's execution time.

The optimizing control-oriented operation has three key components: branch overhead reduction, conditional execution, and addressing techniques.

Reduce Branch Cost
In a deeply pipelined device, branches can be expensive since, during a branch, many stages of a pipeline can potentially be flushed, thus causing stalls of five cycles. XScale has branch prediction logic and branch target buffers to ensure that branch penalties are as small as possible.

From the application developer's point of view, avoiding unnecessary branches and ensuring that the branches are predictable helps improve performance. The following set of examples demonstrates how to eliminate potential branches, eliminate unnecessary loop overhead, and make control flow efficient.

The branch prediction and branch target buffers can reduce branch-related penalties a great deal. The operating system and platform level software must ensure that the branch target buffer (BTB) is enabled.

The BTB has only 32 entries. The size of the branch target buffer limits the number of correctly predictable branches. Because the total number of branches executed in a program is relatively large compared to the size of the branch target buffer, it is often beneficial to minimize the number of branches in a program. Consider the following C code segment:

int foo(int a) {
    if (a > 10)
        return 0;
    else
        return 1;
}

The code generated for the if...else portion of this code segment using branches is:

    cmp     r0,#10
    ble       L1
    mov     r0,#0
    b         L2
L1:
    mov     r0,#1
L2:

This code takes three cycles to execute the else statement and four cycles for the if statement assuming best-case conditions and no branch misprediction penalties. In the case of the XScale, a branch misprediction incurs a penalty of four cycles.

If the branch is mispredicted 50 percent of the time and if both the if statement and the else statement are equally likely to be taken, on an average the code above takes 5.5 cycles to execute.

Using the XScale to execute instructions conditionally, the code generated for the preceding if...else statement is:

cmp         r0,#10
movgt      r0,#0
movle      r0,#1

The preceding code segment would not incur any branch misprediction penalties and would take three cycles to execute assuming best-case conditions. Using conditional instructions helps to speed up execution significantly.

Use Conditional Instruction
Many embedded applications, such as parsing of a packet or looking for a peak in a spectrum, are loop intensive. Branch density - the number of instructions per branch - can be high, which may mean that the number of instructions in a loop or number of instructions between two branches is low. Each loop has management overhead - counting the loop pointer and comparing for the exit condition - associated with it.

The XScale provides the ability to execute instructions conditionally based on a set of conditional flags. This conditional execution feature, combined with the ability of the instructions to modify the condition codes, makes a wide array of optimizations possible.

Optimizing Condition Checks
Core instructions can selectively modify the state of the condition codes. When generating code for if...else and loop conditions, it is often beneficial to make use of this feature to set condition codes, thereby eliminating the need for a subsequent compare instruction. Consider the following C statement.

if ((a + b) !=0)
    c = c + 1;

Code generated for the if condition without using an add instruction to set condition codes is:

add         r2, r0, r1
cmp        r2, #0
addne     r3, r3, #1

However, code can be optimized making use of an add instruction to set condition codes:

adds         r2, r0, r1
addne       r3, r3, #1

Condition checking for coprocessor registers can also be performed. SIMD flags in the wCASF register are updated during execution of Wireless MMX instructions. Then, using one of the three flag extraction operations - TANDC, TORC, or TEXTRC - flags for the XScale core can be updated.

This method allows checking of all or one of the SIMD fields for conditional execution. Called group conditional execution, this method is shown in the following example:

wsubhus     wR1, wR2, wR3
@ Saturating subtraction minimum of wR1
@ is zero
torch R15
@ Updating core flags with ORed
@ coprocessor flag values
addeq         r2, r2, #1
@ now executes conditional coprocessor flag

All preceding techniques of effectively using conditional execution can also be applied to the group conditional execution. For cases such as peak detection or finding a match in a vector, you can use group conditional techniques.

The instructions that increment or decrement the loop counter can also be used to modify the condition codes. Modifying the codes eliminates the need for a subsequent compare instruction. A conditional branch instruction can then be used to exit or continue with the next loop iteration.

Consider the following C code segment:

for (i = 10; i != 0; i--) {
    perform inner_kernel;
}

The optimized code generated for the preceding code segment would look like:

L6:
@equivalent to inner_kernel
    subs     r3, r3, #1
    bne     .L6

Using the above argument, it is also beneficial to rewrite loops whenever possible to make the loop exit conditions check against the value 0. For example, the code generated for the following code segment needs a compare instruction to check for the loop exit condition.

for (i = 0; i < 10; i++) {
    perform inner_kernel;
}

If the loop is rewritten as follows, the code generated avoids using a compare instruction to check for the loop exit condition.

for (i = 9; i >= 0; i--) {
    perform inner_kernel;
}

However, the use of conditional instructions should be considered carefully to ensure it improves performance. To decide when to use conditional instructions over branches, consider this hypothetical code segment:

< style="font-style: italic;">if (cond)
    if_stmt
else
    else_stmt

Using the following data:

N1Beta = number of cycles to execute the if_stmt, assuming the use of branch instructions
N2Beta = number of cycles to execute the else_stmt, assuming the use of branch instructions
P1 = percentage of times the if_stmt is likely to be executed
P2 = percentage of times likely to incur a branch misprediction penalty
N1c = number of cycles to execute the if...else portion using conditional instructions assuming the if condition to be true
N2c = number of cycles to execute the if...else portion using conditional instructions assuming the if condition to be false

Use conditional instructions when:

EQPage227

The following example illustrates a situation in which it is better to use branches instead of conditional instructions.

    cmp     r0, #0
    bne     L1
    add     r0, r0, #1
    add     r1, r1, #1
    add     r2, r2, #1
    add     r3, r3, #1
    add     r4, r4, #1
b            L2
L1:
    sub     r0, r0, #1
    sub     r1, r1, #1
    sub     r2, r2, #1
    sub     r3, r3, #1
    sub     r4, r4, #1
L2:

The CMP instruction takes one cycle to execute, the if statement takes seven cycles to execute, and the else statement takes six cycles to execute. If the code were changed to eliminate the branch instructions by using conditional instructions, the if...else statement would take 10 cycles to complete.

Assuming an equal probability of both paths being taken and that branch mispredictions occur 50 percent of the time, the cost of using conditional instructions is 11 cycles and the cost of branches is 9.5 cycles.

Optimizing Complex Expressions Using Conditional Execution
Using conditional instructions helps improve the code generated for complex expressions such as the C shortcut evaluation feature.

The use of conditional instructions in this fashion helps improve performance by minimizing the number of branches, thereby minimizing the penalties caused by branch mispredictions.

int foo(int a, int b) {
    if (a != 0 && b != 0)
        return 0;
else
        return 1;
}

The optimized code for the if condition is:

    cmp         r0,#0
    cmpne     r1,#0

This approach also reduces the utilization of branch prediction resources. With Wireless MMX technology, the flag registers can be set based on data values in the coprocessor registers or SIMD flag registers.

Use Addressing Modes Efficiently
XScale and Wireless MMX provide a variety of addressing modes that make indexing an array of objects highly efficient. The following code samples illustrate how various kinds of array operations can be optimized to make use of these addressing modes:

@ Set the contents of the word pointed to
@ by r0 to the value contained in r1 and
@ make r0 point to the next word

        wstrw wR1,[r0], #4

@ Increment the contents of r0 to make it
@ point to the next word and set the
@ contents of the word pointed to the
@ value contained in r1

        wstrw wR1, [r0, #4]!

@ Set the contents of the word pointed to
@ by r0 to the value contained in r1 and
@ make r0 point to the previous word

        wstrw wR1,[r0], #-4

@ Decrement the contents of r0 to make it
@ point to the previous word and set the
@ contents of the word pointed to the value
@ contained in r1

        wstrw wR1,[r0, #-4]!

Various addressing modes save you from explicitly spending an instruction on updating the pointer.

Miscellaneous Approaches
Apart from the techniques mentioned earlier, you might consider these tricks geared towards interesting use of the instructions. Consider the following two cases.

Optimizing the Use of Immediate Values. For programming purposes, constant values may need to be used. Constant values are created to be used as masks or known coefficients in different calculations.

The MOV or MVN instruction should be used when loading an immediate, or constant, value into a register. However, immediate move is restricted to a 12-bit number. One could load the constant from memory. Loading 32-bit or 64-bit constant values requires loading from the memory.

The compiler typically places all the constants in a literal pool close to the instructions. Literal pools are not likely to be in the data cache, which makes loading constants expensive - a main memory access. Also, LDR instruction has the potential to pollute the data cache.

It is possible to generate a whole set of constant values using a combination of MOV, MVN, ORR, BIC, and ADD instructions. Use a combination of the above instructions to set a register to a constant value. An example of this is shown in these code samples.

@Set the value of r0 to 127
        mov r0, #127
@Set the value of r0 to 0xfffffefb.
        mvn r0, #260
@Set the value of r0 to 257
        mov r0, #1
        orr r0, r0, #256
@Set the value of r0 to 0x51f
        mov r0, #0x1f
        orr r0, r0, #0x500
@Set the value of r0 to 0xf100ffff
        mvn r0, #0xff, LSL 16
        bic r0, r0, #0xe, LSL 8
@ Set the value of r0 to 0x12341234
        mov r0, #0x8d, LSL 2
        orr r0, r0, #0x1, LSL 12
        add r0, r0, r0, LSL #16
@ shifter delay of 1 cycle

<>It is possible to load any 32-bit value into a register using a sequence of four instructions. With  Wireless MMX technology, two such 32-bit values can be generated in core registers, and then transferred to coprocessor registers using TCMR, TMCRR, and TBCST instructions.

Bit Field Manipulation.
Different encryption algorithms such as Data Encryption Standard (DES), Triple DES (T-DES), and hashing functions (SHA) perform many bit-manipulation operations.

The shift and logical operations of the XScale provide a useful way of manipulating bit fields. Bit field operations can be optimized using regular instructions:

@ Set the bit number specified by
@ r1 in register r0

mov     r2, #1
orr      r0, r0, r2, asl r1

@ Clear the bit number specified by
@ r1 in register r0

mov     r2, #1
bic       r0, r0, r2, asl r1

@ Extract the bit value of the bit
@ number specified by r1 of the
@ value in r0 storing the value in r0

mov     r1, r0, asr r1
and      r0, r1, #1

@ Extract the higher order 8 bits of the
@ value in r0 storing
@ the result in r1

mov     r1, r0, lsr #24

This approach helps other applications such as video stream parsing. Wireless MMX supports 64-bit-wide bit-wise manipulation - for instance, shift, and, or - which can be effectively used for different bit-wise algorithms.

Conclusion
The methods described in this series of articles are intended for assembly language development but can also be applied during development using intrinsic functions and in-line assembly.

High-level language programming styles based on these techniques have also been presented. These programming styles demonstrate how best to use different instructions and, more specifically, how the sequence of instructions should be scheduled to reduce stalls. However, the list of methods described here is not exhaustive.

Finally, a few points to remember are:

1) Use the correct precision for the algorithm, and choose instructions accordingly.
2) Interleave instructions between the pipe to hide result and issue latency.
3) Schedule load and stores with the correct data-addressing mode.
4) Watch out for load-to-use penalty and shifter-processing latency.
5) Count down on loops to reduce loop control overhead.
6) Use conditional instructions to avoid branch costs.

To read Part 1, go to "Microarchitectural optimization philosophy."
To read Part 2, go to "Optimization for data processing-oriented operations."

This series of articles was excerpted from "Programming with Intel Wireless MMX Technology," by Nigel Paver, Bradley Aldrich and Moinul Khan. Copyright © 2004 Intel Corporation. All rights reserved.

Nigel Paver is an architect and design manager for Wireless MMX technology at Intel Corporation. Bradley Aldrich is a leading authority at Intel Corporation on image and video processing. Moinul Khan is a multimedia architect at Intel Corporation.