Fast Multiply Operations
In the architecture under discussion, there are two sets of
multiplication units, one in the Intel XScale microarchitecture and the
other in the Intel Wireless MMX instructions. These two sets of
multipliers support different levels of precision of data-processing
capability.
The XScale microarchitecture supports half-word and word multiplication with results of word and double-word width. Selecting the correct precision for the algorithm under implementation helps reduce the execution time; for example, SMULxy has a latency of one cycle whereas SMULL has a latency of two cycles.
Multiply instructions can cause pipeline stalls due to resource conflicts or result latencies. The following code segment incurs a stall of zero to three cycles depending on the values in registers r1, r2, r4, and r5 due to resource conflicts:
mul
r0,
r1, r2
mul r3, r4, r5 @0-3
stalls
The second multiply operation would stall by three cycles if r1 and r2 did not have any trivial value and the S bit was set. Just as issue latency depends on the values of the operands, the result latency can vary between one and three cycles. In the following example, the mov instruction incurs the result penalty:
mul
r0,
r1, r2
mov r4, r0 @stall
until previous mult
However, if an arithmetic operation follows the multiplication operation, it does not stall as long as no register dependency exists. Multiply instructions should be separated out from each other by the worst-case latency, especially if you have no a priori knowledge of the data value.
ARM instructions can set conditional flags so that following instructions can execute conditionally based on the flags. A multiply instruction that sets the condition codes blocks the multiply and arithmetic pipeline. Blocking stalls any subsequent instructions. For instance, in the following example, the add instruction waits three to four cycles for the muls instruction to finish.
muls
r0, r1, r2 @mult that updates flags
add r3, r3, #1
@stalls until the mul
finish
sub r4, r4, #1
sub r5, r5, #1
Thus, it is not efficient to use the multiplication operation to update the flags. The modified code is as follows:
mul
r0,
r1, r2
add r3, r3, #1
sub r4, r4, #1
sub r5, r5, #1
cmp r0, #0
The issue latency of the WMUL and WMADD instructions is one cycle; the result and resource latency are two cycles. The second WMUL instruction in the following example stalls for one cycle due to the two-cycle issue latency.
WMULUM
wR0, wR1, wR2
WMULSL wR3, wR4, wR5 @one cycle stall
Hence, two WMUL instructions should be separated by one instruction. The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency.
wmulm
wR0, wR1, wR2
waddhus wR1, wR0, wR2 @two cycle
stall
Thus, any instruction waiting on the result should be separated by two other instructions. However, if the latter instruction is another SIMD-multiplication instruction, then the stall is one cycle despite data dependency.
Fast Multiply and Accumulation
For DSP and multimedia applications, multiply and accumulate (MAC) is
the most commonly used operation. In addition to multipliers, Intel
Wireless MMX technology offers accumulation capabilities. In the SIMD
coprocessor, any of the registers can be used as an accumulator.
Performing MAC Operations on Registers in Intel XScale Core A MAC operation can be done using TMIA 32-bit and TMIAPH 16-bit instructions. TMIA and TMIAPH instructions allow the use of two registers in the Intel XScale core as two operands and produce the result of multiplication and accumulation to any of the coprocessor registers.
The issue latency of the TMIA instruction is one cycle; the result and resource latency are two cycles. The second TMIA instruction in the following example stalls for one cycle due to the two-cycle resource latency.
tmia
wR0, r2, r3
tmia wR1, r4, r5
@stall 1 cycle
The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency.
tmia
wR0, r2, r3
waddhus wR1, wR0, wR2
Performing MAC Operations on
Registers
Wireless MMX technology supports 16-bit SIMD multiply and
accumulate operations, where the sources and the destination use SIMD
coprocessor registers. Similar to the TMIA instruction, any of the
coprocessor registers can be used as an accumulator for this case.
The issue latency of the WMAC instruction is one cycle, and the result and resource latency is two cycles. The second WMAC instruction in the following example will stall for one cycle due to the two-cycle resource latency.
wmacs
wR0, wR2, wR3
wmacs wR1, wR4,
wR5 @stall 1 cycle
The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency. However, the second WMACS does not stall for two cycles due to the internal forwarding supported by the multiplier and accumulate unit (MAU) of the coprocessor.
wmacs
wR0, wR4,
wR5
wmacs
wR0, wR2,
wR3 @stall 1 cycle
waddhss wR1,
wR0, wR2 @stall 2 cycles
It is often possible to interleave instructions and effectively overlap their execution with multicycle instructions that use the multiply pipeline. The two-cycle WMAC instruction may be easily interleaved with operations that do not use the same resources:
wmacs
wR14,
wR2, wR3
wldrd
wR3, [r4] ,
#8
wmacs
wR15, wR1,
wR0
waligni wR5,
wR6,
wR7, #4
wmacs
wR15, wR5,
wR0
wldrd
wR0, [r3], #8
In the preceding example, the WLDRD and WALIGNI instructions do not incur a stall since they are utilizing the memory and execution pipelines respectively and have no data dependencies. For interleaving WMACS with other instructions, instructions of the Intel XScale core can be used.
wmacs
wR14, wR1, wR2
add
R1, R2, R3
wmacs wR14, wR1,
wR2
mul
R4, R5, R6
Scheduling in the Addition and
Logical Pipeline
Most data-processing instructions for Intel XScale microarchitecture
technology and Intel Wireless MMX technology?including logical and
addition instructions?have a result latency of one cycle. Therefore,
the current instruction can use the result from the previous data
processing instruction without any penalty. For example, a series of
additions can be performed without any stalls, such as:
waddh
wR4, wR2, wR1
waddh wR5, wr4, wR1
waddh wR6, wR2, wR1
The preceding code segment does not incur any stall. The only exception to the above is the saturation arithmetic operation. During saturation, the result is generated one cycle later. Thus, subsequent instructions using the result stall by a cycle, as in this instance:
waddhss
wR4, wR2, wR1
waddhss wR5, wR4,
wR1 @single cycle
stall
waddhss wR6, wR2,
wR1
In this example, the second saturating SIMD instruction stalls for one cycle due to the read-after-write dependency on register wR4; however, the third saturating SIMD instruction does not stall since the two have no data dependency between each other. This code segment can be easily modified via translation such that there is no stall.
To make this modification, case swapping the locations of the second and the third WADDH is sufficient to remove the stall. The pipeline for the XScale microarchitecture also has no stalls on its logical and simple arithmetic operations. For many applications, this feature offers high performance.
Shifting an operand by an immediate value during an arithmetic operation is a feature of core processor instructions. This feature can save an extra instruction for explicit shifting.
You need to be mindful of the subtle constraints posed by this feature; if the current instruction uses the result of the previous data processing instruction for a shift by immediate, the result latency is two cycles. As a result, the following code segment incurs a one-cycle stall for the MOV instruction:
sub
r6,
r7, r8
add r1, r2, r3
mov r4, r1, LSL #2
The following code removes the one-cycle stall:
add
r1,
r2, r3
sub
r6, r7, r8
mov
r4, r1, LSL #2
Similarly, you can use a register to specify the shift or rotate amount for an operand. This instruction option can be very effective if the shift amount is not known beforehand; however, a longer latency is involved.
All data-processing instructions incur a two-cycle issue penalty and a two-cycle result penalty when the shifter operand is shifted or rotated based on a register. For instance, in the following code sequence, the sub incurs a two-cycle stall since the add instruction uses a register as a shift operand.
mov
r3,
#10
mul r4, r2, r3
add r5, r6, r2, LSL
r3
sub r7, r8, r4 @
Stalls for two cycles
Getting Data from Cache to Register
and Back Efficiently
Cache memory allows taking advantage of the data locality of the
program. Even if a data segment is in the cache, data has to be loaded
to the registers for data processing operations. For critical data
processing kernels, data cache optimization and register moving should
be optimized.
Knowing the Load-to-Use Penalty. An increased number of pipeline stages and increased complexity in design gives rise to non-unity load latency. Any load operation of word, byte, and half-word size has a result latency of three cycles if the load is the cache.
Thus, a load followed by a use should be avoided. For cases when the load gets a cache miss, the latency can be high. An example of the load-to-use stall follows:
wldrw
wR0, [r3],#4
waddw wR8, wR0, wR8
@stalls for 2
cycles
wldrw wR0,
[r3],#4
waddw wR8, wR0, wR8
@stalls for 2
cycles
wldrw wR0,
[r3],#4
waddw wR8,
wR0, wR8 @stalls for 2
cycles
Here, the stall of six cycles can be easily reduced by scheduling other instructions in the shadow of the load. A modification of the preceding code follows:
wldrw
wR0, [r3],#4
wldrw wR1, [r3],#4
wldrw wR2, [r3],#4
waddw wR8, wR0, wR8 @ no
stall
waddw wR8, wR1, wR8
@ no stall
waddw wR8, wR2, wR8
@ no stall
Note that the modified code segment uses multiple registers to target its load. This modification is known as register rotation. This technique hides cache access latency and utilizes the multiple-load buffering capability offered by the XScale microarchitecture. This particular technique is applicable to all other load operations - those of different sizes and also those in the co-processor space.
Double-Word Loading and Storing
The XScale microarchitecture supports double-word loads-and-stores from
a pair of 32-bit registers on an even boundary. Intel Wireless MMX
technology supports load-and-store operations on 64-bit registers.
When the LDRD instruction is used to load a pair of core registers, it has a result latency of three or four cycles depending on the destination register being accessed, assuming the data being loaded is in the data cache. When WLDRD is used to load a 64-bit coprocessor register, the latency is four cycles.
@ load double using Intel XScale core
ldrd r0, [r3]
orr r8, r1, #0xf @stalls for 4 cycles
mul r7, r0, r7
@ Another example
ldrd r0, [r3]
orr r8, r0,
#0xf @stalls for 3 cycles
mul r7, r1, r7
@ Load
using Intel Wireless MMX technology
wldrd wR0,
[r3]
waddw
wR1, WR0, wR2 @stalls for 4
cycles
Any memory instruction followed by a load double instruction has a resource hazard of one cycle, as shown in the next example:
@ str instruction
below will stall for 1 cycle
ldrd r0, [r3]
str r4, [r5] // 1 cycle
@ For
Intel Wireless MMX technology
wldrd
wR3,[r4],#8
wldrd
wR5,[r4],#8 @
STALL 1 cycle
wldrd
wR4,[r4],#8 @
STALL 1 cycle
waddb
wR0,wR1,wR2
waddb
wR0,wR0,wR6
waddb
wR0,wR0,wR7
The coprocessor supporting Wireless MMX technology can buffer incoming load operations up to two double-word loads at a time, or four word loads, byte loads, or half-word loads.
The overhead on issuing load transactions can be minimized by instruction scheduling and load pipelining. In most cases, interleaving other operations to avoid the penalty with back-to-back LDRD instructions is straightforward. In the following code sequence, three WLDRD instructions are issued back-to-back, incurring a stall on the second and third instruction.
wldrd
wR3,[r4],#8
wldrd wR5,[r4],#8 @
STALL
wldrd wR4,[r4],#8 @
STALL
waddb wR0,wR1,wR2
waddb wR0,wR0,wR6
waddb wR0,wR0,wR7
The same code sequence is reorganized to avoid a back-to-back issue of WLDRD instructions.
wldrd
wR3,[r4],#8
waddb wR0,wR1,wR2
wldrd
wR4,[r4],#8
waddb wR0,wR0,wR6
wldrd
wR5,[r4],#8
waddb
wR0,wR0,wR7
Always try to separate three multiple WLDRD instructions so that only two are outstanding at any one time and the loads are always interleaved with other instructions
wldrd
wR0,[r2],#8
wzero wR15
wldrd wR1,[r4],#8
subs r3,r3,#8
wldrd wR3,[r4],#8
Always try to interleave additional operations between the load instruction and the instruction that will first use the cached data.
wldrd
wR0,[r2],#8
wzero
wR15
wldrd
wR1,[r4],#8
subs
r3,r3,#8
wldrd
wR3,[r4],#8
wmacs
wR15,wR1,wR0
subs
r4,r4,#1
Similarly, WSRTD and STRD store data from coprocessor registers and from core register pairs. Like WLDRD and LDRD, store instructions also offer a stall for any memory operation followed by double-word store instructions.
Scheduling Load and Store Multiple
(LDM/STM)
Load and store multiple are two instructions - LDM and STM - that can
be used to load a set of core registers. These instructions are often
used for saving and retrieving the state of the processor.
LDM and STM instructions have an issue latency of 2 to 20 cycles, depending on the number of registers being loaded or stored. The issue latency is typically two cycles plus an additional cycle for each of the registers loaded or stored, assuming a data cache hit.
The instruction following an LDM stalls whether or not this instruction depends on the results of the load. While these instructions are useful to ease code development, they have two drawbacks: they have a two-cycle delay of issue latency and they are not used for loading and storing registers that support Wireless MMX technology.
Optimizing Align and Shift
The auxiliary registers are designed to hold constants that are
invariant across the lifetime of an inner loop calculation. For this
reason, values loaded into the auxiliary registers are not forwarded to
data operations.
The intended use of the registers is that the shift or alignment offset is loaded into a wCGRn register before the main loop is entered, and then the shift to alignment offset is used repeatedly inside the loop without change.
If the value in a wCGRx register is changed and an instruction immediately afterward tries to use the loaded value, then the coprocessor stalls until the loaded value has reached the control register file.
For most kernels, the alignment values and shift amount values do not change during the execution of the kernel. For example, consider an algorithm that accesses a large data array where each element has 16-bit accuracy and has been stored in a packed fashion in the memory.
Using Wireless MMX technology, four elements of this data array can be processed concurrently. If the data structure is aligned at a 64-bit boundary, Intel Wireless MMX technology can access the data by a simple WLDRD instruction. For instance:
wldrd
wR0, [r1],#8
.. use wR0 now
..
However, if the data is not aligned to a 64-bit boundary, it will be necessary to perform alignment. For the unaligned case, the data segment can be from a 64-bit boundary by an amount of one to seven bytes.
The last three bits of the pointer's address can determine the exact offset. Be aware that the misalignment for successive double words does not change throughout the array. You can keep the misalignment constant stored in a control register and perform alignment on successive accesses.
bic
r1, r2, #7 @ r1
gets aligned address
xor r0, r2, r1 @ r0 now contains misalignment
tmcr
wCGR0, r0 @ WCGR0 now gets misalignment
wldrd wR0, [r1],#8
wldrd wR1, [r1],#8
..
..
waligni wR2, wR0, wR1, #0
.. use wR2 now..
Similarly, control registers can be used to determine a shift amount. Some algorithms require a certain level of accuracy?range and precision?during the computation.
Following any multiplication or accumulation, you need to use a
right shift of the resultant value. This correction can be maintained
easily by using a control register-based shift operation.
This series of articles was excerpted from "Programming with Intel Wireless MMX Technology," by Nigel Paver, Bradley Aldrich and Moinul Khan. Copyright © 2004 Intel Corporation. All rights reserved.
Nigel Paver is an architect and
design manager for Wireless MMX technology at Intel
Corporation. Bradley Aldrich is a leading authority at Intel
Corporation on image and video processing. Moinul Khan is a multimedia
architect at Intel Corporation.