Code techniques for processor pipeline optimization: Part 2
Optimization for data processing operations
By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Embedded.com
(07/29/08, 01:00:00 AM EDT)
Data-processing operations are at the heart of any multimedia application. So expaniding on the discussion in Part 1, what we will be concerned with here is the impact of the pipeline delay characteristics on the coding style in a variety of operations such as Fast Multiply Operations, Fast Multiply and Accumulation, Double-Word Loading and Storing, Scheduling Load and Store Multiple, and Align and Shift.

Fast Multiply Operations
In the architecture under discussion, there are two sets of multiplication units, one in the Intel XScale microarchitecture and the other in the Intel Wireless MMX instructions. These two sets of multipliers support different levels of precision of data-processing capability.

The XScale microarchitecture supports half-word and word multiplication with results of word and double-word width. Selecting the correct precision for the algorithm under implementation helps reduce the execution time; for example, SMULxy has a latency of one cycle whereas SMULL has a latency of two cycles.

Multiply instructions can cause pipeline stalls due to resource conflicts or result latencies. The following code segment incurs a stall of zero to three cycles depending on the values in registers r1, r2, r4, and r5 due to resource conflicts:

mul     r0, r1, r2
mul     r3, r4, r5 @0-3 stalls

The second multiply operation would stall by three cycles if r1 and r2 did not have any trivial value and the S bit was set. Just as issue latency depends on the values of the operands, the result latency can vary between one and three cycles. In the following example, the mov instruction incurs the result penalty:

mul     r0, r1, r2
mov     r4, r0 @stall until previous mult

However, if an arithmetic operation follows the multiplication operation, it does not stall as long as no register dependency exists. Multiply instructions should be separated out from each other by the worst-case latency, especially if you have no a priori knowledge of the data value.

ARM instructions can set conditional flags so that following instructions can execute conditionally based on the flags. A multiply instruction that sets the condition codes blocks the multiply and arithmetic pipeline. Blocking stalls any subsequent instructions. For instance, in the following example, the add instruction waits three to four cycles for the muls instruction to finish.

muls    r0, r1, r2 @mult that updates flags
add     r3, r3, #1 @stalls until the mul finish
sub     r4, r4, #1
sub     r5, r5, #1

Thus, it is not efficient to use the multiplication operation to update the flags. The modified code is as follows:

mul     r0, r1, r2
add     r3, r3, #1
sub     r4, r4, #1
sub     r5, r5, #1
cmp     r0, #0

The issue latency of the WMUL and WMADD instructions is one cycle; the result and resource latency are two cycles. The second WMUL instruction in the following example stalls for one cycle due to the two-cycle issue latency.

WMULUM wR0, wR1, wR2
WMULSL wR3, wR4, wR5 @one cycle stall

Hence, two WMUL instructions should be separated by one instruction. The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency.

wmulm  wR0, wR1, wR2
waddhus  wR1, wR0, wR2 @two cycle stall

Thus, any instruction waiting on the result should be separated by two other instructions. However, if the latter instruction is another SIMD-multiplication instruction, then the stall is one cycle despite data dependency.

Fast Multiply and Accumulation
For DSP and multimedia applications, multiply and accumulate (MAC) is the most commonly used operation. In addition to multipliers, Intel Wireless MMX technology offers accumulation capabilities. In the SIMD coprocessor, any of the registers can be used as an accumulator.

Performing MAC Operations on Registers in Intel XScale Core A MAC operation can be done using TMIA 32-bit and TMIAPH 16-bit instructions. TMIA and TMIAPH instructions allow the use of two registers in the Intel XScale core as two operands and produce the result of multiplication and accumulation to any of the coprocessor registers.

The issue latency of the TMIA instruction is one cycle; the result and resource latency are two cycles. The second TMIA instruction in the following example stalls for one cycle due to the two-cycle resource latency.

tmia     wR0, r2, r3
tmia     wR1, r4, r5 @stall 1 cycle

The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency.

tmia         wR0, r2, r3
waddhus wR1, wR0, wR2

Performing MAC Operations on Registers
 Wireless MMX technology supports 16-bit SIMD multiply and accumulate operations, where the sources and the destination use SIMD coprocessor registers. Similar to the TMIA instruction, any of the coprocessor registers can be used as an accumulator for this case.

The issue latency of the WMAC instruction is one cycle, and the result and resource latency is two cycles. The second WMAC instruction in the following example will stall for one cycle due to the two-cycle resource latency.

wmacs     wR0, wR2, wR3
wmacs     wR1, wR4, wR5 @stall 1 cycle

The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency. However, the second WMACS does not stall for two cycles due to the internal forwarding supported by the multiplier and accumulate unit (MAU) of the coprocessor.

wmacs     wR0, wR4, wR5
wmacs     wR0, wR2, wR3 @stall 1 cycle
waddhss  wR1, wR0, wR2 @stall 2 cycles

It is often possible to interleave instructions and effectively overlap their execution with multicycle instructions that use the multiply pipeline. The two-cycle WMAC instruction may be easily interleaved with operations that do not use the same resources:

wmacs    wR14, wR2, wR3
wldrd     wR3, [r4] , #8
wmacs    wR15, wR1, wR0
waligni  wR5, wR6, wR7, #4
wmacs   wR15, wR5, wR0
wldrd    wR0, [r3], #8

In the preceding example, the WLDRD and WALIGNI instructions do not incur a stall since they are utilizing the memory and execution pipelines respectively and have no data dependencies. For interleaving WMACS with other instructions, instructions of the Intel XScale core can be used.

wmacs     wR14, wR1, wR2
add          R1, R2, R3
wmacs     wR14, wR1, wR2
mul         R4, R5, R6

Scheduling in the Addition and Logical Pipeline
Most data-processing instructions for Intel XScale microarchitecture technology and Intel Wireless MMX technology?including logical and addition instructions?have a result latency of one cycle. Therefore, the current instruction can use the result from the previous data processing instruction without any penalty. For example, a series of additions can be performed without any stalls, such as:

waddh     wR4, wR2, wR1
waddh     wR5, wr4, wR1
waddh     wR6, wR2, wR1

The preceding code segment does not incur any stall. The only exception to the above is the saturation arithmetic operation. During saturation, the result is generated one cycle later. Thus, subsequent instructions using the result stall by a cycle, as in this instance:

waddhss     wR4, wR2, wR1
waddhss     wR5, wR4, wR1 @single cycle stall
waddhss     wR6, wR2, wR1

In this example, the second saturating SIMD instruction stalls for one cycle due to the read-after-write dependency on register wR4; however, the third saturating SIMD instruction does not stall since the two have no data dependency between each other. This code segment can be easily modified via translation such that there is no stall.

To make this modification, case swapping the locations of the second and the third WADDH is sufficient to remove the stall. The pipeline for the XScale microarchitecture also has no stalls on its logical and simple arithmetic operations. For many applications, this feature offers high performance.

Shifting an operand by an immediate value during an arithmetic operation is a feature of core processor instructions. This feature can save an extra instruction for explicit shifting.

You need to be mindful of the subtle constraints posed by this feature; if the current instruction uses the result of the previous data processing instruction for a shift by immediate, the result latency is two cycles. As a result, the following code segment incurs a one-cycle stall for the MOV instruction:

sub      r6, r7, r8
add      r1, r2, r3
mov     r4, r1, LSL #2

The following code removes the one-cycle stall:

add         r1, r2, r3
sub         r6, r7, r8
mov        r4, r1, LSL #2

Similarly, you can use a register to specify the shift or rotate amount for an operand. This instruction option can be very effective if the shift amount is not known beforehand; however, a longer latency is involved.

All data-processing instructions incur a two-cycle issue penalty and a two-cycle result penalty when the shifter operand is shifted or rotated based on a register. For instance, in the following code sequence, the sub incurs a two-cycle stall since the add instruction uses a register as a shift operand.

mov     r3, #10
mul     r4, r2, r3
add     r5, r6, r2, LSL r3
sub     r7, r8, r4 @ Stalls for two cycles

Getting Data from Cache to Register and Back Efficiently
Cache memory allows taking advantage of the data locality of the program. Even if a data segment is in the cache, data has to be loaded to the registers for data processing operations. For critical data processing kernels, data cache optimization and register moving should be optimized.

Knowing the Load-to-Use Penalty. An increased number of pipeline stages and increased complexity in design gives rise to non-unity load latency. Any load operation of word, byte, and half-word size has a result latency of three cycles if the load is the cache.

Thus, a load followed by a use should be avoided. For cases when the load gets a cache miss, the latency can be high. An example of the load-to-use stall follows:

wldrw      wR0, [r3],#4
waddw     wR8, wR0, wR8 @stalls for 2 cycles
wldrw      wR0, [r3],#4
waddw     wR8, wR0, wR8 @stalls for 2 cycles
wldrw      wR0, [r3],#4
waddw     wR8, wR0, wR8 @stalls for 2 cycles

Here, the stall of six cycles can be easily reduced by scheduling other instructions in the shadow of the load. A modification of the preceding code follows:

wldrw     wR0, [r3],#4
wldrw     wR1, [r3],#4
wldrw     wR2, [r3],#4
waddw    wR8, wR0, wR8 @ no stall
waddw    wR8, wR1, wR8 @ no stall
waddw    wR8, wR2, wR8 @ no stall

Note that the modified code segment uses multiple registers to target its load. This modification is known as register rotation. This technique hides cache access latency and utilizes the multiple-load buffering capability offered by the XScale microarchitecture. This particular technique is applicable to all other load operations - those of different sizes and also those in the co-processor space.

Double-Word Loading and Storing
The XScale microarchitecture supports double-word loads-and-stores from a pair of 32-bit registers on an even boundary. Intel Wireless MMX technology supports load-and-store operations on 64-bit registers.

When the LDRD instruction is used to load a pair of core registers, it has a result latency of three or four cycles depending on the destination register being accessed, assuming the data being loaded is in the data cache. When WLDRD is used to load a 64-bit coprocessor register, the latency is four cycles.

@ load double using Intel XScale core

ldrd r0, [r3]
orr r8, r1, #0xf @stalls for 4 cycles mul r7, r0, r7

@ Another example
ldrd     r0, [r3]
orr      r8, r0, #0xf @stalls for 3 cycles
mul     r7, r1, r7

@ Load using Intel Wireless MMX technology
wldrd      wR0, [r3]

waddw     wR1, WR0, wR2 @stalls for 4 cycles

Any memory instruction followed by a load double instruction has a resource hazard of one cycle, as shown in the next example:

@ str instruction below will stall for 1 cycle ldrd r0, [r3]
str r4, [r5] // 1 cycle

@ For Intel Wireless MMX technology
wldrd     wR3,[r4],#8
wldrd     wR5,[r4],#8         @ STALL 1 cycle
wldrd     wR4,[r4],#8         @ STALL 1 cycle
waddb    wR0,wR1,wR2
waddb    wR0,wR0,wR6
waddb   wR0,wR0,wR7

The coprocessor supporting Wireless MMX technology can buffer incoming load operations up to two double-word loads at a time, or four word loads, byte loads, or half-word loads.

The overhead on issuing load transactions can be minimized by instruction scheduling and load pipelining. In most cases, interleaving other operations to avoid the penalty with back-to-back LDRD instructions is straightforward. In the following code sequence, three WLDRD instructions are issued back-to-back, incurring a stall on the second and third instruction.

wldrd     wR3,[r4],#8
wldrd     wR5,[r4],#8 @ STALL
wldrd     wR4,[r4],#8 @ STALL
waddb    wR0,wR1,wR2
waddb    wR0,wR0,wR6
waddb    wR0,wR0,wR7

The same code sequence is reorganized to avoid a back-to-back issue of WLDRD instructions.

wldrd     wR3,[r4],#8
waddb    wR0,wR1,wR2
wldrd     wR4,[r4],#8
waddb    wR0,wR0,wR6
wldrd      wR5,[r4],#8
waddb     wR0,wR0,wR7

Always try to separate three multiple WLDRD instructions so that only two are outstanding at any one time and the loads are always interleaved with other instructions

wldrd wR0,[r2],#8
wzero wR15
wldrd wR1,[r4],#8
subs r3,r3,#8
wldrd wR3,[r4],#8

Always try to interleave additional operations between the load instruction and the instruction that will first use the cached data.

wldrd     wR0,[r2],#8
wzero     wR15
wldrd     wR1,[r4],#8
subs       r3,r3,#8
wldrd     wR3,[r4],#8
wmacs    wR15,wR1,wR0
subs        r4,r4,#1

Similarly, WSRTD and STRD store data from coprocessor registers and from core register pairs. Like WLDRD and LDRD, store instructions also offer a stall for any memory operation followed by double-word store instructions.

Scheduling Load and Store Multiple (LDM/STM)
Load and store multiple are two instructions - LDM and STM - that can be used to load a set of core registers. These instructions are often used for saving and retrieving the state of the processor.

LDM and STM instructions have an issue latency of 2 to 20 cycles, depending on the number of registers being loaded or stored. The issue latency is typically two cycles plus an additional cycle for each of the registers loaded or stored, assuming a data cache hit.

The instruction following an LDM stalls whether or not this instruction depends on the results of the load. While these instructions are useful to ease code development, they have two drawbacks: they have a two-cycle delay of issue latency and they are not used for loading and storing registers that support Wireless MMX technology.

Optimizing Align and Shift
The auxiliary registers are designed to hold constants that are invariant across the lifetime of an inner loop calculation. For this reason, values loaded into the auxiliary registers are not forwarded to data operations.

The intended use of the registers is that the shift or alignment offset is loaded into a wCGRn register before the main loop is entered, and then the shift to alignment offset is used repeatedly inside the loop without change.

If the value in a wCGRx register is changed and an instruction immediately afterward tries to use the loaded value, then the coprocessor stalls until the loaded value has reached the control register file.

For most kernels, the alignment values and shift amount values do not change during the execution of the kernel. For example, consider an algorithm that accesses a large data array where each element has 16-bit accuracy and has been stored in a packed fashion in the memory.

Using Wireless MMX technology, four elements of this data array can be processed concurrently. If the data structure is aligned at a 64-bit boundary, Intel Wireless MMX technology can access the data by a simple WLDRD instruction. For instance:

wldrd     wR0, [r1],#8
.. use   wR0   now ..

However, if the data is not aligned to a 64-bit boundary, it will be necessary to perform alignment. For the unaligned case, the data segment can be from a 64-bit boundary by an amount of one to seven bytes.

The last three bits of the pointer's address can determine the exact offset. Be aware that the misalignment for successive double words does not change throughout the array. You can keep the misalignment constant stored in a control register and perform alignment on successive accesses.

bic     r1, r2, #7 @ r1 gets aligned address
xor     r0, r2, r1 @ r0 now contains misalignment

tmcr wCGR0, r0 @ WCGR0 now gets misalignment
wldrd wR0, [r1],#8
wldrd wR1, [r1],#8
..
..
waligni wR2, wR0, wR1, #0
.. use  wR2  now..

Similarly, control registers can be used to determine a shift amount. Some algorithms require a certain level of accuracy?range and precision?during the computation.

Following any multiplication or accumulation, you need to use a right shift of the resultant value. This correction can be maintained easily by using a control register-based shift operation.

Next in Part 3: "Optimization for Control-oriented operations."
To read Part 1, go to "Microarchitectural optimization philosophy."

This series of articles was excerpted from "Programming with Intel Wireless MMX Technology," by Nigel Paver, Bradley Aldrich and Moinul Khan. Copyright © 2004 Intel Corporation. All rights reserved.

Nigel Paver is an architect and design manager for Wireless MMX technology at Intel Corporation. Bradley Aldrich is a leading authority at Intel Corporation on image and video processing. Moinul Khan is a multimedia architect at Intel Corporation.