The optimizing control-oriented operation has three key components: branch overhead reduction, conditional execution, and addressing techniques.
Reduce Branch Cost
In a deeply pipelined device, branches can be expensive since, during a
branch, many stages of a pipeline can potentially be flushed, thus
causing stalls of five cycles. XScale has branch prediction logic and
branch target buffers to ensure that branch penalties are as small as
possible.
From the application developer's point of view, avoiding unnecessary branches and ensuring that the branches are predictable helps improve performance. The following set of examples demonstrates how to eliminate potential branches, eliminate unnecessary loop overhead, and make control flow efficient.
The branch prediction and branch target buffers can reduce branch-related penalties a great deal. The operating system and platform level software must ensure that the branch target buffer (BTB) is enabled.
The BTB has only 32 entries. The size of the branch target buffer limits the number of correctly predictable branches. Because the total number of branches executed in a program is relatively large compared to the size of the branch target buffer, it is often beneficial to minimize the number of branches in a program. Consider the following C code segment:
int
foo(int a) {
if (a > 10)
return 0;
else
return 1;
}
The code generated for the if...else portion of this code segment using branches is:
cmp r0,#10
ble
L1
mov
r0,#0
b
L2
L1:
mov
r0,#1
L2:
This code takes three cycles to execute the else statement and four cycles for the if statement assuming best-case conditions and no branch misprediction penalties. In the case of the XScale, a branch misprediction incurs a penalty of four cycles.
If the branch is mispredicted 50 percent of the time and if both the if statement and the else statement are equally likely to be taken, on an average the code above takes 5.5 cycles to execute.
Using the XScale to execute instructions conditionally, the code generated for the preceding if...else statement is:
cmp
r0,#10
movgt r0,#0
movle r0,#1
The preceding code segment would not incur any branch misprediction penalties and would take three cycles to execute assuming best-case conditions. Using conditional instructions helps to speed up execution significantly.
Use Conditional Instruction
Many embedded applications, such as parsing of a packet or looking for
a peak in a spectrum, are loop intensive. Branch density - the number
of instructions per branch - can be high, which may mean that the
number of instructions in a loop or number of instructions between two
branches is low. Each loop has management overhead - counting the loop
pointer and comparing for the exit condition - associated with it.
The XScale provides the ability to execute instructions conditionally based on a set of conditional flags. This conditional execution feature, combined with the ability of the instructions to modify the condition codes, makes a wide array of optimizations possible.
Optimizing Condition Checks
Core instructions can selectively modify the state of the condition
codes. When generating code for if...else and loop conditions, it is
often beneficial to make use of this feature to set condition codes,
thereby eliminating the need for a subsequent compare instruction.
Consider the following C statement.
if ((a
+ b) !=0)
c = c + 1;
Code generated for the if condition without using an add instruction to set condition codes is:
add
r2, r0, r1
cmp
r2, #0
addne r3, r3,
#1
However, code can be optimized making use of an add instruction to set condition codes:
adds
r2, r0, r1
addne
r3, r3, #1
Condition checking for coprocessor registers can also be performed. SIMD flags in the wCASF register are updated during execution of Wireless MMX instructions. Then, using one of the three flag extraction operations - TANDC, TORC, or TEXTRC - flags for the XScale core can be updated.
This method allows checking of all or one of the SIMD fields for conditional execution. Called group conditional execution, this method is shown in the following example:
wsubhus
wR1, wR2, wR3
@ Saturating subtraction minimum of
wR1
@ is zero
torch R15
@ Updating core flags with ORed
@ coprocessor flag values
addeq
r2, r2, #1
@ now executes conditional coprocessor
flag
All preceding techniques of effectively using conditional execution can also be applied to the group conditional execution. For cases such as peak detection or finding a match in a vector, you can use group conditional techniques.
The instructions that increment or decrement the loop counter can also be used to modify the condition codes. Modifying the codes eliminates the need for a subsequent compare instruction. A conditional branch instruction can then be used to exit or continue with the next loop iteration.
Consider the following C code segment:
for (i
= 10; i != 0; i--) {
perform
inner_kernel;
}
The optimized code generated for the preceding code segment would look like:
L6:
@equivalent to inner_kernel
subs
r3, r3, #1
bne
.L6
Using the above argument, it is also beneficial to rewrite loops whenever possible to make the loop exit conditions check against the value 0. For example, the code generated for the following code segment needs a compare instruction to check for the loop exit condition.
for (i
= 0; i < 10; i++) {
perform
inner_kernel;
}
If the loop is rewritten as follows, the code generated avoids using a compare instruction to check for the loop exit condition.
for (i
= 9; i >= 0; i--) {
perform
inner_kernel;
}
However, the use of conditional instructions should be considered carefully to ensure it improves performance. To decide when to use conditional instructions over branches, consider this hypothetical code segment:
< style="font-style: italic;">if (cond)Using the following data:
N1Beta = number of cycles
to execute the if_stmt, assuming the use of branch instructions
N2Beta = number of cycles to
execute the else_stmt, assuming the use of branch instructions
P1 = percentage of times the
if_stmt is likely to be executed
P2 = percentage of times likely
to incur a branch misprediction penalty
N1c = number of cycles to
execute the if...else portion using conditional instructions assuming
the if condition to be true
N2c = number of cycles to
execute the if...else portion using conditional instructions assuming
the if condition to be false
Use conditional instructions when:
EQPage227
The following example illustrates a situation in which it is better to use branches instead of conditional instructions.
cmp
r0, #0
bne L1
add r0, r0, #1
add r1, r1, #1
add r2, r2, #1
add r3, r3, #1
add r4, r4, #1
b
L2
L1:
sub r0, r0, #1
sub r1, r1, #1
sub r2, r2, #1
sub r3, r3, #1
sub r4, r4, #1
L2:
The CMP instruction takes one cycle to execute, the if statement takes seven cycles to execute, and the else statement takes six cycles to execute. If the code were changed to eliminate the branch instructions by using conditional instructions, the if...else statement would take 10 cycles to complete.
Assuming an equal probability of both paths being taken and that branch mispredictions occur 50 percent of the time, the cost of using conditional instructions is 11 cycles and the cost of branches is 9.5 cycles.
Optimizing Complex Expressions
Using Conditional Execution
Using conditional instructions helps improve the code generated for
complex expressions such as the C shortcut evaluation feature.
The use of conditional instructions in this fashion helps improve performance by minimizing the number of branches, thereby minimizing the penalties caused by branch mispredictions.
int
foo(int a, int b) {
if (a != 0
&& b != 0)
return 0;
else
return 1;
}
The optimized code for the if condition is:
cmp
r0,#0
cmpne r1,#0
This approach also reduces the utilization of branch prediction resources. With Wireless MMX technology, the flag registers can be set based on data values in the coprocessor registers or SIMD flag registers.
Use Addressing Modes Efficiently
XScale and Wireless MMX provide a variety of addressing modes that make
indexing an array of objects highly efficient. The following code
samples illustrate how various kinds of array operations can be
optimized to make use of these addressing modes:
@ Set the contents of
the word pointed to
@ by r0 to the value contained in r1 and
@ make r0 point to the next word
wstrw wR1,[r0], #4
@ Increment the
contents of r0 to make it
@ point to the next word and set the
@ contents of the word pointed to the
@ value contained in r1
wstrw wR1, [r0, #4]!
@ Set the contents of
the word pointed to
@ by r0 to the value contained in r1 and
@ make r0 point to the previous word
wstrw wR1,[r0], #-4
@ Decrement the
contents of r0 to make it
@ point to the previous word and set the
@ contents of the word pointed to the value
@ contained in r1
wstrw wR1,[r0, #-4]!
Various addressing modes save you from explicitly spending an instruction on updating the pointer.
Miscellaneous Approaches
Apart from the techniques mentioned earlier, you might consider these
tricks geared towards interesting use of the instructions. Consider the
following two cases.
Optimizing the Use of Immediate Values. For programming purposes, constant values may need to be used. Constant values are created to be used as masks or known coefficients in different calculations.
The MOV or MVN instruction should be used when loading an immediate, or constant, value into a register. However, immediate move is restricted to a 12-bit number. One could load the constant from memory. Loading 32-bit or 64-bit constant values requires loading from the memory.
The compiler typically places all the constants in a literal pool close to the instructions. Literal pools are not likely to be in the data cache, which makes loading constants expensive - a main memory access. Also, LDR instruction has the potential to pollute the data cache.
It is possible to generate a whole set of constant values using a combination of MOV, MVN, ORR, BIC, and ADD instructions. Use a combination of the above instructions to set a register to a constant value. An example of this is shown in these code samples.
@Set
the value of r0 to 127
mov r0, #127
@Set the value of r0 to 0xfffffefb.
mvn r0, #260
@Set the value of r0 to 257
mov r0, #1
orr r0, r0, #256
@Set the value of r0 to 0x51f
mov r0, #0x1f
orr r0, r0, #0x500
@Set the value of r0 to 0xf100ffff
mvn r0, #0xff, LSL 16
bic r0, r0, #0xe, LSL 8
@ Set the value of r0 to 0x12341234
mov r0, #0x8d, LSL 2
orr r0, r0, #0x1, LSL 12
add r0, r0, r0, LSL #16
@ shifter delay of 1 cycle
Bit Field Manipulation.
Different encryption algorithms such as Data Encryption Standard (DES),
Triple DES (T-DES), and hashing functions (SHA) perform many
bit-manipulation operations.
The shift and logical operations of the XScale provide a useful way of manipulating bit fields. Bit field operations can be optimized using regular instructions:
@ Set the bit number
specified by
@ r1 in register r0
mov
r2, #1
orr r0, r0, r2, asl r1
@ Clear the bit
number specified by
@ r1 in register r0
mov
r2, #1
bic r0, r0, r2, asl r1
@ Extract the bit
value of the bit
@ number specified by r1 of the
@ value in r0 storing the value in r0
mov
r1, r0, asr r1
and r0, r1, #1
@ Extract the higher
order 8 bits of the
@ value in r0 storing
@ the result in r1
mov r1, r0, lsr #24
This approach helps other applications such as video stream parsing. Wireless MMX supports 64-bit-wide bit-wise manipulation - for instance, shift, and, or - which can be effectively used for different bit-wise algorithms.
Conclusion
The methods described in this series of articles are intended for
assembly language development but can also be applied during
development using intrinsic functions and in-line assembly.
High-level language programming styles based on these techniques have also been presented. These programming styles demonstrate how best to use different instructions and, more specifically, how the sequence of instructions should be scheduled to reduce stalls. However, the list of methods described here is not exhaustive.
Finally, a few points to remember are:
1) Use the correct precision
for the algorithm, and choose instructions accordingly.
2) Interleave instructions
between the pipe to hide result and issue latency.
3) Schedule load and stores
with the correct data-addressing mode.
4) Watch out for load-to-use
penalty and shifter-processing latency.
5) Count down on loops to
reduce loop control overhead.
6) Use conditional
instructions to avoid branch costs.
To read Part 1, go to "Microarchitectural optimization philosophy."
To read Part 2, go to "Optimization for data processing-oriented
operations."
This series of articles was excerpted from "Programming with Intel Wireless MMX Technology," by Nigel Paver, Bradley Aldrich and Moinul Khan. Copyright © 2004 Intel Corporation. All rights reserved.
Nigel Paver is an architect and design manager for Wireless MMX technology at Intel Corporation. Bradley Aldrich is a leading authority at Intel Corporation on image and video processing. Moinul Khan is a multimedia architect at Intel Corporation.