By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Data-processing operations are at the heart of any multimedia
application. So expaniding on the discussion in
Part 1,
what we will be concerned with here is the impact of
the pipeline delay characteristics on the coding style in a variety of
operations such as Fast Multiply Operations, Fast Multiply and
Accumulation, Double-Word Loading and Storing, Scheduling Load and
Store Multiple, and Align and Shift.
Fast Multiply Operations
In the architecture under discussion, there are two sets of
multiplication units, one in the Intel XScale microarchitecture and the
other in the Intel Wireless MMX instructions. These two sets of
multipliers support different levels of precision of data-processing
capability.
The XScale microarchitecture supports half-word and word
multiplication with results of word and double-word width. Selecting
the correct precision for the algorithm under implementation helps
reduce the execution time; for example, SMULxy has a latency of one
cycle whereas SMULL has a latency of two cycles.
Multiply instructions can cause pipeline stalls due to resource
conflicts or result latencies. The following code segment incurs a
stall of zero to three cycles depending on the values in registers r1,
r2, r4, and r5 due to resource conflicts:
mul
r0,
r1, r2
mul r3, r4, r5 @0-3
stalls
The second multiply operation would stall by three cycles if r1 and
r2 did not have any trivial value and the S bit was set. Just as issue
latency depends on the values of the operands, the result latency can
vary between one and three cycles. In the following example, the mov
instruction incurs the result penalty:
mul
r0,
r1, r2
mov r4, r0 @stall
until previous mult
However, if an arithmetic operation follows the multiplication
operation, it does not stall as long as no register dependency exists.
Multiply instructions should be separated out from each other by the
worst-case latency, especially if you have no a priori knowledge of the
data value.
ARM instructions can set conditional flags so that following
instructions can execute conditionally based on the flags. A multiply
instruction that sets the condition codes blocks the multiply and
arithmetic pipeline. Blocking stalls any subsequent instructions. For
instance, in the following example, the add instruction waits three to
four cycles for the muls instruction to finish.
muls
r0, r1, r2 @mult that updates flags
add r3, r3, #1
@stalls until the mul
finish
sub r4, r4, #1
sub r5, r5, #1
Thus, it is not efficient to use the multiplication operation to
update the flags. The modified code is as follows:
mul
r0,
r1, r2
add r3, r3, #1
sub r4, r4, #1
sub r5, r5, #1
cmp r0, #0
The issue latency of the WMUL and WMADD instructions is one cycle;
the result and resource latency are two cycles. The second WMUL
instruction in the following example stalls for one cycle due to the
two-cycle issue latency.
WMULUM
wR0, wR1, wR2
WMULSL wR3, wR4, wR5 @one cycle stall
Hence, two WMUL instructions should be separated by one instruction.
The WADD instruction in the following example stalls for one cycle due
to the two-cycle result latency.
wmulm
wR0, wR1, wR2
waddhus wR1, wR0, wR2 @two cycle
stall
Thus, any instruction waiting on the result should be separated by
two other instructions. However, if the latter instruction is another
SIMD-multiplication instruction, then the stall is one cycle despite
data dependency.
Fast Multiply and Accumulation
For DSP and multimedia applications, multiply and accumulate (MAC) is
the most commonly used operation. In addition to multipliers, Intel
Wireless MMX technology offers accumulation capabilities. In the SIMD
coprocessor, any of the registers can be used as an accumulator.
Performing MAC Operations on Registers in Intel XScale Core
A MAC operation can be done using TMIA 32-bit and TMIAPH 16-bit
instructions. TMIA and TMIAPH instructions allow the use of two
registers in the Intel XScale core as two operands and produce the
result of multiplication and accumulation to any of the coprocessor
registers.
The issue latency of the TMIA instruction is one cycle; the result
and resource latency are two cycles. The second TMIA instruction in the
following example stalls for one cycle due to the two-cycle resource
latency.
tmia
wR0, r2, r3
tmia wR1, r4, r5
@stall 1 cycle
The WADD instruction in the following example stalls for one cycle
due to the two-cycle result latency.
tmia
wR0, r2, r3
waddhus wR1, wR0, wR2