By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Scheduling in the Addition and
Logical Pipeline
Most data-processing instructions for Intel XScale microarchitecture
technology and Intel Wireless MMX technology?including logical and
addition instructions?have a result latency of one cycle. Therefore,
the current instruction can use the result from the previous data
processing instruction without any penalty. For example, a series of
additions can be performed without any stalls, such as:
waddh
wR4, wR2, wR1
waddh wR5, wr4, wR1
waddh wR6, wR2, wR1
The preceding code segment does not incur any stall. The only
exception to the above is the saturation arithmetic operation. During
saturation, the result is generated one cycle later. Thus, subsequent
instructions using the result stall by a cycle, as in this instance:
waddhss
wR4, wR2, wR1
waddhss wR5, wR4,
wR1 @single cycle
stall
waddhss wR6, wR2,
wR1
In this example, the second saturating SIMD instruction stalls for
one cycle due to the read-after-write dependency on register wR4;
however, the third saturating SIMD instruction does not stall since the
two have no data dependency between each other. This code segment can
be easily modified via translation such that there is no stall.
To make this modification, case swapping the locations of the second
and the third WADDH is sufficient to remove the stall. The pipeline for
the XScale microarchitecture also has no stalls on its logical and
simple arithmetic operations. For many applications, this feature
offers high performance.
Shifting an operand by an immediate value during an arithmetic
operation is a feature of core processor instructions. This feature can
save an extra instruction for explicit shifting.
You need to be mindful of the subtle constraints posed by this
feature; if the current instruction uses the result of the previous
data processing instruction for a shift by immediate, the result
latency is two cycles. As a result, the following code segment incurs a
one-cycle stall for the MOV instruction:
sub
r6,
r7, r8
add r1, r2, r3
mov r4, r1, LSL #2
The following code removes the one-cycle stall:
add
r1,
r2, r3
sub
r6, r7, r8
mov
r4, r1, LSL #2
Similarly, you can use a register to specify the shift or rotate
amount for an operand. This instruction option can be very effective if
the shift amount is not known beforehand; however, a longer latency is
involved.
All data-processing instructions incur a two-cycle issue penalty and
a two-cycle result penalty when the shifter operand is shifted or
rotated based on a register. For instance, in the following code
sequence, the sub incurs a two-cycle stall since the add instruction
uses a register as a shift operand.
mov
r3,
#10
mul r4, r2, r3
add r5, r6, r2, LSL
r3
sub r7, r8, r4 @
Stalls for two cycles