By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
A typical application has a set of control-oriented operations, such as
compare, branching, and looping. Certain applications, such as a data
sort operation, are control heavy. Even in a more data-intensive
application, such as video processing (
discussed
in Part 1 and Part 2),
control
operations can consume a significant part of the application's
execution time.
The optimizing control-oriented operation has three key components:
branch overhead reduction, conditional execution, and addressing
techniques.
Reduce Branch Cost
In a deeply pipelined device, branches can be expensive since, during a
branch, many stages of a pipeline can potentially be flushed, thus
causing stalls of five cycles. XScale has branch prediction logic and
branch target buffers to ensure that branch penalties are as small as
possible.
From the application developer's point of view, avoiding unnecessary
branches and ensuring that the branches are predictable helps improve
performance. The following set of examples demonstrates how to
eliminate potential branches, eliminate unnecessary loop overhead, and
make control flow efficient.
The branch prediction and branch target buffers can reduce
branch-related penalties a great deal. The operating system and
platform level software must ensure that the branch target buffer (BTB)
is enabled.
The BTB has only 32 entries. The size of the branch target buffer
limits the number of correctly predictable branches. Because the total
number of branches executed in a program is relatively large compared
to the size of the branch target buffer, it is often beneficial to
minimize the number of branches in a program. Consider the following C
code segment:
int
foo(int a) {
if (a > 10)
return 0;
else
return 1;
}
The code generated for the if...else
portion of this code segment using branches is:
cmp r0,#10
ble
L1
mov
r0,#0
b
L2
L1:
mov
r0,#1
L2:
This code takes three cycles to execute the else statement and four
cycles for the if statement assuming best-case conditions and no branch
misprediction penalties. In the case of the XScale, a branch
misprediction incurs a penalty of four cycles.
If the branch is mispredicted 50 percent of the time and if both the
if statement and the else statement are equally likely to be taken, on
an average the code above takes 5.5 cycles to execute.
Using the XScale to execute instructions conditionally, the code
generated for the preceding if...else statement is:
cmp
r0,#10
movgt r0,#0
movle r0,#1
The preceding code segment would not incur any branch misprediction
penalties and would take three cycles to execute assuming best-case
conditions. Using conditional instructions helps to speed up execution
significantly.
Use Conditional Instruction
Many embedded applications, such as parsing of a packet or looking for
a peak in a spectrum, are loop intensive. Branch density - the number
of instructions per branch - can be high, which may mean that the
number of instructions in a loop or number of instructions between two
branches is low. Each loop has management overhead - counting the loop
pointer and comparing for the exit condition - associated with it.
The XScale provides the ability to execute instructions
conditionally based on a set of conditional flags. This conditional
execution feature, combined with the ability of the instructions to
modify the condition codes, makes a wide array of optimizations
possible.