Code techniques for processor pipeline optimization: Part 2 -

Code techniques for processor pipeline optimization: Part 2


Data-processing operations are at the heart of any multimediaapplication. So expaniding on the discussion in Part 1,what we will be concerned with here is the impact ofthe pipeline delay characteristics on the coding style in a variety ofoperations such as Fast Multiply Operations, Fast Multiply andAccumulation, Double-Word Loading and Storing, Scheduling Load andStore Multiple, and Align and Shift.

Fast Multiply Operations
In the architecture under discussion, there are two sets ofmultiplication units, one in the Intel XScale microarchitecture and theother in the Intel Wireless MMX instructions. These two sets ofmultipliers support different levels of precision of data-processingcapability.

The XScale microarchitecture supports half-word and wordmultiplication with results of word and double-word width. Selectingthe correct precision for the algorithm under implementation helpsreduce the execution time; for example, SMULxy has a latency of onecycle whereas SMULL has a latency of two cycles.

Multiply instructions can cause pipeline stalls due to resourceconflicts or result latencies. The following code segment incurs astall of zero to three cycles depending on the values in registers r1,r2, r4, and r5 due to resource conflicts:

mul    r0,r1, r2
mul     r3, r4, r5 @0-3stalls

The second multiply operation would stall by three cycles if r1 andr2 did not have any trivial value and the S bit was set. Just as issuelatency depends on the values of the operands, the result latency canvary between one and three cycles. In the following example, the movinstruction incurs the result penalty:

mul    r0,r1, r2
mov     r4, r0 @stalluntil previous mult

However, if an arithmetic operation follows the multiplicationoperation, it does not stall as long as no register dependency exists.Multiply instructions should be separated out from each other by theworst-case latency, especially if you have no a priori knowledge of thedata value.

ARM instructions can set conditional flags so that followinginstructions can execute conditionally based on the flags. A multiplyinstruction that sets the condition codes blocks the multiply andarithmetic pipeline. Blocking stalls any subsequent instructions. Forinstance, in the following example, the add instruction waits three tofour cycles for the muls instruction to finish.

muls   r0, r1, r2 @mult that updates flags add     r3, r3, #1@stalls until the mulfinish
sub     r4, r4, #1 sub     r5, r5, #1

Thus, it is not efficient to use the multiplication operation toupdate the flags. The modified code is as follows:

mul    r0,r1, r2
add     r3, r3, #1 sub     r4, r4, #1 sub     r5, r5, #1 cmp     r0, #0

The issue latency of the WMUL and WMADD instructions is one cycle;the result and resource latency are two cycles. The second WMULinstruction in the following example stalls for one cycle due to thetwo-cycle issue latency.

WMULUMwR0, wR1, wR2
WMULSL wR3, wR4, wR5 @one cycle stall

Hence, two WMUL instructions should be separated by one instruction.The WADD instruction in the following example stalls for one cycle dueto the two-cycle result latency.

wmulm wR0, wR1, wR2
waddhus  wR1, wR0, wR2 @two cyclestall

Thus, any instruction waiting on the result should be separated bytwo other instructions. However, if the latter instruction is anotherSIMD-multiplication instruction, then the stall is one cycle despitedata dependency.

Fast Multiply and Accumulation
For DSP and multimedia applications, multiply and accumulate (MAC) isthe most commonly used operation. In addition to multipliers, IntelWireless MMX technology offers accumulation capabilities. In the SIMDcoprocessor, any of the registers can be used as an accumulator.

Performing MAC Operations on Registers in Intel XScale CoreA MAC operation can be done using TMIA 32-bit and TMIAPH 16-bitinstructions. TMIA and TMIAPH instructions allow the use of tworegisters in the Intel XScale core as two operands and produce theresult of multiplication and accumulation to any of the coprocessorregisters.

The issue latency of the TMIA instruction is one cycle; the resultand resource latency are two cycles. The second TMIA instruction in thefollowing example stalls for one cycle due to the two-cycle resourcelatency.

tmia    wR0, r2, r3
tmia     wR1, r4, r5@stall 1 cycle

The WADD instruction in the following example stalls for one cycledue to the two-cycle result latency.

tmia        wR0, r2, r3 waddhus wR1, wR0, wR2

Performing MAC Operations onRegisters
 Wireless MMX technology supports 16-bit SIMD multiply andaccumulate operations, where the sources and the destination use SIMDcoprocessor registers. Similar to the TMIA instruction, any of thecoprocessor registers can be used as an accumulator for this case.

The issue latency of the WMAC instruction is one cycle, and theresult and resource latency is two cycles. The second WMAC instructionin the following example will stall for one cycle due to the two-cycleresource latency.

wmacs    wR0, wR2, wR3
wmacs     wR1, wR4,wR5 @stall 1 cycle

The WADD instruction in the following example stalls for one cycledue to the two-cycle result latency. However, the second WMACS does notstall for two cycles due to the internal forwarding supported by themultiplier and accumulate unit (MAU) of the coprocessor.

wmacs    wR0, wR4,wR5
wmacs    wR0, wR2,wR3 @stall 1 cycle
waddhss  wR1,wR0, wR2 @stall 2 cycles

It is often possible to interleave instructions and effectivelyoverlap their execution with multicycle instructions that use themultiply pipeline. The two-cycle WMAC instruction may be easilyinterleaved with operations that do not use the same resources:

wmacs   wR14,wR2, wR3
wldrd    wR3, [r4] ,#8
wmacs   wR15, wR1,wR0
waligni  wR5,wR6,wR7, #4
wmacs  wR15, wR5,wR0
wldrd   wR0, [r3], #8

In the preceding example, the WLDRD and WALIGNI instructions do notincur a stall since they are utilizing the memory and executionpipelines respectively and have no data dependencies. For interleavingWMACS with other instructions, instructions of the Intel XScale corecan be used.

wmacs    wR14, wR1, wR2
add         R1, R2, R3 wmacs     wR14, wR1,wR2
mul        R4, R5, R6

Scheduling in the Addition andLogical Pipeline
Most data-processing instructions for Intel XScale microarchitecturetechnology and Intel Wireless MMX technology?including logical andaddition instructions?have a result latency of one cycle. Therefore,the current instruction can use the result from the previous dataprocessing instruction without any penalty. For example, a series ofadditions can be performed without any stalls, such as:

waddh    wR4, wR2, wR1
waddh     wR5, wr4, wR1 waddh     wR6, wR2, wR1

The preceding code segment does not incur any stall. The onlyexception to the above is the saturation arithmetic operation. Duringsaturation, the result is generated one cycle later. Thus, subsequentinstructions using the result stall by a cycle, as in this instance:

waddhss    wR4, wR2, wR1
waddhss     wR5, wR4,wR1 @single cyclestall
waddhss     wR6, wR2,wR1

In this example, the second saturating SIMD instruction stalls forone cycle due to the read-after-write dependency on register wR4;however, the third saturating SIMD instruction does not stall since thetwo have no data dependency between each other. This code segment canbe easily modified via translation such that there is no stall.

To make this modification, case swapping the locations of the secondand the third WADDH is sufficient to remove the stall. The pipeline forthe XScale microarchitecture also has no stalls on its logical andsimple arithmetic operations. For many applications, this featureoffers high performance.

Shifting an operand by an immediate value during an arithmeticoperation is a feature of core processor instructions. This feature cansave an extra instruction for explicit shifting.

You need to be mindful of the subtle constraints posed by thisfeature; if the current instruction uses the result of the previousdata processing instruction for a shift by immediate, the resultlatency is two cycles. As a result, the following code segment incurs aone-cycle stall for the MOV instruction:

sub     r6,r7, r8
add      r1, r2, r3 mov     r4, r1, LSL #2

The following code removes the one-cycle stall:

add        r1,r2, r3
sub        r6, r7, r8
mov       r4, r1, LSL #2

Similarly, you can use a register to specify the shift or rotateamount for an operand. This instruction option can be very effective ifthe shift amount is not known beforehand; however, a longer latency isinvolved.

All data-processing instructions incur a two-cycle issue penalty anda two-cycle result penalty when the shifter operand is shifted orrotated based on a register. For instance, in the following codesequence, the sub incurs a two-cycle stall since the add instructionuses a register as a shift operand.

mov    r3,#10
mul     r4, r2, r3 add     r5, r6, r2, LSLr3
sub     r7, r8, r4 @Stalls for two cycles

Getting Data from Cache to Registerand Back Efficiently
Cache memory allows taking advantage of the data locality of theprogram. Even if a data segment is in the cache, data has to be loadedto the registers for data processing operations. For critical dataprocessing kernels, data cache optimization and register moving shouldbe optimized.

Knowing the Load-to-Use Penalty. An increased number of pipelinestages and increased complexity in design gives rise to non-unity loadlatency. Any load operation of word, byte, and half-word size has aresult latency of three cycles if the load is the cache.

Thus, a load followed by a use should be avoided. For cases when theload gets a cache miss, the latency can be high. An example of theload-to-use stall follows:

wldrw     wR0, [r3],#4 waddw     wR8, wR0, wR8@stalls for 2cycles
wldrw      wR0,[r3],#4
waddw     wR8, wR0, wR8@stalls for 2cycles
wldrw      wR0,[r3],#4
waddw     wR8,wR0, wR8 @stalls for 2cycles

Here, the stall of six cycles can be easily reduced by schedulingother instructions in the shadow of the load. A modification of thepreceding code follows:

wldrw    wR0, [r3],#4
wldrw     wR1, [r3],#4 wldrw     wR2, [r3],#4 waddw    wR8, wR0, wR8 @ nostall
waddw    wR8, wR1, wR8@ no stall
waddw    wR8, wR2, wR8@ no stall

Note that the modified code segment uses multiple registers totarget its load. This modification is known as register rotation. Thistechnique hides cache access latency and utilizes the multiple-loadbuffering capability offered by the XScale microarchitecture. Thisparticular technique is applicable to all other load operations – thoseof different sizes and also those in the co-processor space.

Double-Word Loading and Storing
The XScale microarchitecture supports double-word loads-and-stores froma pair of 32-bit registers on an even boundary. Intel Wireless MMXtechnology supports load-and-store operations on 64-bit registers.

When the LDRD instruction is used to load a pair of core registers,it has a result latency of three or four cycles depending on thedestination register being accessed, assuming the data being loaded isin the data cache. When WLDRD is used to load a 64-bit coprocessorregister, the latency is four cycles.

@ load double using Intel XScale core

ldrd r0, [r3]
orr r8, r1, #0xf @stalls for 4 cyclesmul r7, r0, r7

@ Another example
ldrd     r0, [r3]
orr      r8, r0,#0xf @stalls for 3 cycles
mul     r7, r1, r7

@ Loadusing Intel Wireless MMX technology
wldrd      wR0,[r3]

waddw    wR1, WR0, wR2 @stalls for 4cycles

Any memory instruction followed by a load double instruction has aresource hazard of one cycle, as shown in the next example:

@ str instructionbelow will stall for 1 cycleldrd r0, [r3]
str r4, [r5] // 1 cycle

@ ForIntel Wireless MMX technology
wldrd    wR3,[r4],#8 wldrd    wR5,[r4],#8         @STALL 1 cycle
wldrd    wR4,[r4],#8         @STALL 1 cycle
waddb   wR0,wR1,wR2
waddb   wR0,wR0,wR6
waddb  wR0,wR0,wR7

The coprocessor supporting Wireless MMX technology can bufferincoming load operations up to two double-word loads at a time, or fourword loads, byte loads, or half-word loads.

The overhead on issuing load transactions can be minimized byinstruction scheduling and load pipelining. In most cases, interleavingother operations to avoid the penalty with back-to-back LDRDinstructions is straightforward. In the following code sequence, threeWLDRD instructions are issued back-to-back, incurring a stall on thesecond and third instruction.

wldrd    wR3,[r4],#8
wldrd     wR5,[r4],#8 @STALL
wldrd     wR4,[r4],#8 @STALL
waddb    wR0,wR1,wR2 waddb    wR0,wR0,wR6 waddb    wR0,wR0,wR7

The same code sequence is reorganized to avoid a back-to-back issueof WLDRD instructions.

wldrd    wR3,[r4],#8
waddb    wR0,wR1,wR2 wldrd    wR4,[r4],#8
waddb    wR0,wR0,wR6 wldrd     wR5,[r4],#8
waddb    wR0,wR0,wR7

Always try to separate three multiple WLDRD instructions so thatonly two are outstanding at any one time and the loads are alwaysinterleaved with other instructions

wzero wR15 wldrd wR1,[r4],#8 subs r3,r3,#8 wldrd wR3,[r4],#8

Always try to interleave additional operations between the loadinstruction and the instruction that will first use the cached data.

wldrd    wR0,[r2],#8 wzero    wR15 wldrd    wR1,[r4],#8 subs      r3,r3,#8 wldrd    wR3,[r4],#8 wmacs   wR15,wR1,wR0
subs       r4,r4,#1

Similarly, WSRTD and STRD store data from coprocessor registers andfrom core register pairs. Like WLDRD and LDRD, store instructions alsooffer a stall for any memory operation followed by double-word storeinstructions.

Scheduling Load and Store Multiple(LDM/STM)
Load and store multiple are two instructions – LDM and STM – that canbe used to load a set of core registers. These instructions are oftenused for saving and retrieving the state of the processor.

LDM and STM instructions have an issue latency of 2 to 20 cycles,depending on the number of registers being loaded or stored. The issuelatency is typically two cycles plus an additional cycle for each ofthe registers loaded or stored, assuming a data cache hit.

The instruction following an LDM stalls whether or not thisinstruction depends on the results of the load. While theseinstructions are useful to ease code development, they have twodrawbacks: they have a two-cycle delay of issue latency and they arenot used for loading and storing registers that support Wireless MMXtechnology.

Optimizing Align and Shift
The auxiliary registers are designed to hold constants that areinvariant across the lifetime of an inner loop calculation. For thisreason, values loaded into the auxiliary registers are not forwarded todata operations.

The intended use of the registers is that the shift or alignmentoffset is loaded into a wCGRn register before the main loop is entered,and then the shift to alignment offset is used repeatedly inside theloop without change.

If the value in a wCGRx register is changed and an instructionimmediately afterward tries to use the loaded value, then thecoprocessor stalls until the loaded value has reached the controlregister file.

For most kernels, the alignment values and shift amount values donot change during the execution of the kernel. For example, consider analgorithm that accesses a large data array where each element has16-bit accuracy and has been stored in a packed fashion in the memory.

Using Wireless MMX technology, four elements of this data array canbe processed concurrently. If the data structure is aligned at a 64-bitboundary, Intel Wireless MMX technology can access the data by a simpleWLDRD instruction. For instance:

wldrd    wR0, [r1],#8
.. use   wR0   now..

However, if the data is not aligned to a 64-bit boundary, it will benecessary to perform alignment. For the unaligned case, the datasegment can be from a 64-bit boundary by an amount of one to sevenbytes.

The last three bits of the pointer's address can determine the exactoffset. Be aware that the misalignment for successive double words doesnot change throughout the array. You can keep the misalignment constantstored in a control register and perform alignment on successiveaccesses.

bic    r1, r2, #7 @ r1gets aligned address
xor     r0, r2, r1 @ r0 now contains misalignment

tmcrwCGR0, r0 @ WCGR0 now gets misalignment wldrd wR0, [r1],#8 wldrd wR1, [r1],#8 .. .. waligni wR2, wR0, wR1, #0 .. use  wR2  now..

Similarly, control registers can be used to determine a shiftamount. Some algorithms require a certain level of accuracy?range andprecision?during the computation.

Following any multiplication or accumulation, you need to use aright shift of the resultant value. This correction can be maintainedeasily by using a control register-based shift operation.

Next in Part 3: “Optimization forControl-oriented operations.
To read Part 1, go to “Microarchitecturaloptimization philosophy.”

Thisseries of articles was excerpted from “Programming withIntel Wireless MMX Technology,” by Nigel Paver, Bradley Aldrich andMoinul Khan. Copyright © 2004 Intel Corporation. All rightsreserved.

Nigel Paver is an architect anddesign manager for Wireless MMX technology at IntelCorporation . Bradley Aldrich is a leading authority at IntelCorporation on image and video processing. Moinul Khan is a multimediaarchitect at Intel Corporation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.