Data-processing operations are at the heart of any multimediaapplication. So expaniding on the discussion in
Fast Multiply Operations
In the architecture under discussion, there are two sets ofmultiplication units, one in the Intel XScale microarchitecture and theother in the Intel Wireless MMX instructions. These two sets ofmultipliers support different levels of precision of data-processingcapability.
The XScale microarchitecture supports half-word and wordmultiplication with results of word and double-word width. Selectingthe correct precision for the algorithm under implementation helpsreduce the execution time; for example, SMULxy has a latency of onecycle whereas SMULL has a latency of two cycles.
Multiply instructions can cause pipeline stalls due to resourceconflicts or result latencies. The following code segment incurs astall of zero to three cycles depending on the values in registers r1,r2, r4, and r5 due to resource conflicts:
mul r0,r1, r2
mul r3, r4, r5 @0-3stalls
The second multiply operation would stall by three cycles if r1 andr2 did not have any trivial value and the S bit was set. Just as issuelatency depends on the values of the operands, the result latency canvary between one and three cycles. In the following example, the movinstruction incurs the result penalty:
mul r0,r1, r2
mov r4, r0 @stalluntil previous mult
However, if an arithmetic operation follows the multiplicationoperation, it does not stall as long as no register dependency exists.Multiply instructions should be separated out from each other by theworst-case latency, especially if you have no a priori knowledge of thedata value.
ARM instructions can set conditional flags so that followinginstructions can execute conditionally based on the flags. A multiplyinstruction that sets the condition codes blocks the multiply andarithmetic pipeline. Blocking stalls any subsequent instructions. Forinstance, in the following example, the add instruction waits three tofour cycles for the muls instruction to finish.
muls r0, r1, r2 @mult that updates flags
sub r4, r4, #1
Thus, it is not efficient to use the multiplication operation toupdate the flags. The modified code is as follows:
mul r0,r1, r2
add r3, r3, #1
The issue latency of the WMUL and WMADD instructions is one cycle;the result and resource latency are two cycles. The second WMULinstruction in the following example stalls for one cycle due to thetwo-cycle issue latency.
WMULUMwR0, wR1, wR2
WMULSL wR3, wR4, wR5 @one cycle stall
Hence, two WMUL instructions should be separated by one instruction.The WADD instruction in the following example stalls for one cycle dueto the two-cycle result latency.
wmulm wR0, wR1, wR2
waddhus wR1, wR0, wR2 @two cyclestall
Thus, any instruction waiting on the result should be separated bytwo other instructions. However, if the latter instruction is anotherSIMD-multiplication instruction, then the stall is one cycle despitedata dependency.
Fast Multiply and Accumulation
For DSP and multimedia applications, multiply and accumulate (MAC) isthe most commonly used operation. In addition to multipliers, IntelWireless MMX technology offers accumulation capabilities. In the SIMDcoprocessor, any of the registers can be used as an accumulator.
Performing MAC Operations on Registers in Intel XScale CoreA MAC operation can be done using TMIA 32-bit and TMIAPH 16-bitinstructions. TMIA and TMIAPH instructions allow the use of tworegisters in the Intel XScale core as two operands and produce theresult of multiplication and accumulation to any of the coprocessorregisters.
The issue latency of the TMIA instruction is one cycle; the resultand resource latency are two cycles. The second TMIA instruction in thefollowing example stalls for one cycle due to the two-cycle resourcelatency.
tmia wR0, r2, r3
tmia wR1, r4, r5@stall 1 cycle
The WADD instruction in the following example stalls for one cycledue to the two-cycle result latency.
tmia wR0, r2, r3
Performing MAC Operations onRegisters
Wireless MMX technology supports 16-bit SIMD multiply andaccumulate operations, where the sources and the destination use SIMDcoprocessor registers. Similar to the TMIA instruction, any of thecoprocessor registers can be used as an accumulator for this case.
The issue latency of the WMAC instruction is one cycle, and theresult and resource latency is two cycles. The second WMAC instructionin the following example will stall for one cycle due to the two-cycleresource latency.
wmacs wR0, wR2, wR3
wmacs wR1, wR4,wR5 @stall 1 cycle
The WADD instruction in the following example stalls for one cycledue to the two-cycle result latency. However, the second WMACS does notstall for two cycles due to the internal forwarding supported by themultiplier and accumulate unit (MAU) of the coprocessor.
wmacs wR0, wR4,wR5
wmacs wR0, wR2,wR3 @stall 1 cycle
waddhss wR1,wR0, wR2 @stall 2 cycles
It is often possible to interleave instructions and effectivelyoverlap their execution with multicycle instructions that use themultiply pipeline. The two-cycle WMAC instruction may be easilyinterleaved with operations that do not use the same resources:
wmacs wR14,wR2, wR3
wldrd wR3, [r4] ,#8
wmacs wR15, wR1,wR0
waligni wR5,wR6,wR7, #4
wmacs wR15, wR5,wR0
wldrd wR0, [r3], #8
In the preceding example, the WLDRD and WALIGNI instructions do notincur a stall since they are utilizing the memory and executionpipelines respectively and have no data dependencies. For interleavingWMACS with other instructions, instructions of the Intel XScale corecan be used.
wmacs wR14, wR1, wR2
add R1, R2, R3
mul R4, R5, R6
Scheduling in the Addition andLogical Pipeline
Most data-processing instructions for Intel XScale microarchitecturetechnology and Intel Wireless MMX technology?including logical andaddition instructions?have a result latency of one cycle. Therefore,the current instruction can use the result from the previous dataprocessing instruction without any penalty. For example, a series ofadditions can be performed without any stalls, such as:
waddh wR4, wR2, wR1
waddh wR5, wr4, wR1
The preceding code segment does not incur any stall. The onlyexception to the above is the saturation arithmetic operation. Duringsaturation, the result is generated one cycle later. Thus, subsequentinstructions using the result stall by a cycle, as in this instance:
waddhss wR4, wR2, wR1
waddhss wR5, wR4,wR1 @single cyclestall
waddhss wR6, wR2,wR1
In this example, the second saturating SIMD instruction stalls forone cycle due to the read-after-write dependency on register wR4;however, the third saturating SIMD instruction does not stall since thetwo have no data dependency between each other. This code segment canbe easily modified via translation such that there is no stall.
To make this modification, case swapping the locations of the secondand the third WADDH is sufficient to remove the stall. The pipeline forthe XScale microarchitecture also has no stalls on its logical andsimple arithmetic operations. For many applications, this featureoffers high performance.
Shifting an operand by an immediate value during an arithmeticoperation is a feature of core processor instructions. This feature cansave an extra instruction for explicit shifting.
You need to be mindful of the subtle constraints posed by thisfeature; if the current instruction uses the result of the previousdata processing instruction for a shift by immediate, the resultlatency is two cycles. As a result, the following code segment incurs aone-cycle stall for the MOV instruction:
sub r6,r7, r8
add r1, r2, r3
The following code removes the one-cycle stall:
add r1,r2, r3
sub r6, r7, r8
mov r4, r1, LSL #2
Similarly, you can use a register to specify the shift or rotateamount for an operand. This instruction option can be very effective ifthe shift amount is not known beforehand; however, a longer latency isinvolved.
All data-processing instructions incur a two-cycle issue penalty anda two-cycle result penalty when the shifter operand is shifted orrotated based on a register. For instance, in the following codesequence, the sub incurs a two-cycle stall since the add instructionuses a register as a shift operand.
mul r4, r2, r3
sub r7, r8, r4 @Stalls for two cycles
Getting Data from Cache to Registerand Back Efficiently
Cache memory allows taking advantage of the data locality of theprogram. Even if a data segment is in the cache, data has to be loadedto the registers for data processing operations. For critical dataprocessing kernels, data cache optimization and register moving shouldbe optimized.
Knowing the Load-to-Use Penalty. An increased number of pipelinestages and increased complexity in design gives rise to non-unity loadlatency. Any load operation of word, byte, and half-word size has aresult latency of three cycles if the load is the cache.
Thus, a load followed by a use should be avoided. For cases when theload gets a cache miss, the latency can be high. An example of theload-to-use stall follows:
wldrw wR0, [r3],#4
waddw wR8, wR0, wR8@stalls for 2cycles
waddw wR8,wR0, wR8 @stalls for 2cycles
Here, the stall of six cycles can be easily reduced by schedulingother instructions in the shadow of the load. A modification of thepreceding code follows:
wldrw wR0, [r3],#4
wldrw wR1, [r3],#4
waddw wR8, wR1, wR8@ no stall
waddw wR8, wR2, wR8@ no stall
Note that the modified code segment uses multiple registers totarget its load. This modification is known as register rotation. Thistechnique hides cache access latency and utilizes the multiple-loadbuffering capability offered by the XScale microarchitecture. Thisparticular technique is applicable to all other load operations – thoseof different sizes and also those in the co-processor space.
Double-Word Loading and Storing
The XScale microarchitecture supports double-word loads-and-stores froma pair of 32-bit registers on an even boundary. Intel Wireless MMXtechnology supports load-and-store operations on 64-bit registers.
When the LDRD instruction is used to load a pair of core registers,it has a result latency of three or four cycles depending on thedestination register being accessed, assuming the data being loaded isin the data cache. When WLDRD is used to load a 64-bit coprocessorregister, the latency is four cycles.
@ load double using Intel XScale core
ldrd r0, [r3]
orr r8, r1, #0xf @stalls for 4 cyclesmul r7, r0, r7
@ Another example
ldrd r0, [r3]
orr r8, r0,#0xf @stalls for 3 cycles
mul r7, r1, r7
@ Loadusing Intel Wireless MMX technology
waddw wR1, WR0, wR2 @stalls for 4cycles
Any memory instruction followed by a load double instruction has aresource hazard of one cycle, as shown in the next example:
@ str instructionbelow will stall for 1 cycleldrd r0, [r3]
str r4, [r5] // 1 cycle
@ ForIntel Wireless MMX technology
wldrd wR4,[r4],#8 @STALL 1 cycle
The coprocessor supporting Wireless MMX technology can bufferincoming load operations up to two double-word loads at a time, or fourword loads, byte loads, or half-word loads.
The overhead on issuing load transactions can be minimized byinstruction scheduling and load pipelining. In most cases, interleavingother operations to avoid the penalty with back-to-back LDRDinstructions is straightforward. In the following code sequence, threeWLDRD instructions are issued back-to-back, incurring a stall on thesecond and third instruction.
wldrd wR5,[r4],#8 @STALL
wldrd wR4,[r4],#8 @STALL
The same code sequence is reorganized to avoid a back-to-back issueof WLDRD instructions.
Always try to separate three multiple WLDRD instructions so thatonly two are outstanding at any one time and the loads are alwaysinterleaved with other instructions
Always try to interleave additional operations between the loadinstruction and the instruction that will first use the cached data.
Similarly, WSRTD and STRD store data from coprocessor registers andfrom core register pairs. Like WLDRD and LDRD, store instructions alsooffer a stall for any memory operation followed by double-word storeinstructions.
Scheduling Load and Store Multiple(LDM/STM)
Load and store multiple are two instructions – LDM and STM – that canbe used to load a set of core registers. These instructions are oftenused for saving and retrieving the state of the processor.
LDM and STM instructions have an issue latency of 2 to 20 cycles,depending on the number of registers being loaded or stored. The issuelatency is typically two cycles plus an additional cycle for each ofthe registers loaded or stored, assuming a data cache hit.
The instruction following an LDM stalls whether or not thisinstruction depends on the results of the load. While theseinstructions are useful to ease code development, they have twodrawbacks: they have a two-cycle delay of issue latency and they arenot used for loading and storing registers that support Wireless MMXtechnology.
Optimizing Align and Shift
The auxiliary registers are designed to hold constants that areinvariant across the lifetime of an inner loop calculation. For thisreason, values loaded into the auxiliary registers are not forwarded todata operations.
The intended use of the registers is that the shift or alignmentoffset is loaded into a wCGRn register before the main loop is entered,and then the shift to alignment offset is used repeatedly inside theloop without change.
If the value in a wCGRx register is changed and an instructionimmediately afterward tries to use the loaded value, then thecoprocessor stalls until the loaded value has reached the controlregister file.
For most kernels, the alignment values and shift amount values donot change during the execution of the kernel. For example, consider analgorithm that accesses a large data array where each element has16-bit accuracy and has been stored in a packed fashion in the memory.
Using Wireless MMX technology, four elements of this data array canbe processed concurrently. If the data structure is aligned at a 64-bitboundary, Intel Wireless MMX technology can access the data by a simpleWLDRD instruction. For instance:
wldrd wR0, [r1],#8
.. use wR0 now..
However, if the data is not aligned to a 64-bit boundary, it will benecessary to perform alignment. For the unaligned case, the datasegment can be from a 64-bit boundary by an amount of one to sevenbytes.
The last three bits of the pointer's address can determine the exactoffset. Be aware that the misalignment for successive double words doesnot change throughout the array. You can keep the misalignment constantstored in a control register and perform alignment on successiveaccesses.
bic r1, r2, #7 @ r1gets aligned address
xor r0, r2, r1 @ r0 now contains misalignment
tmcrwCGR0, r0 @ WCGR0 now gets misalignment
Similarly, control registers can be used to determine a shiftamount. Some algorithms require a certain level of accuracy?range andprecision?during the computation.
Following any multiplication or accumulation, you need to use aright shift of the resultant value. This correction can be maintainedeasily by using a control register-based shift operation.
Next in Part 3: “Optimization forControl-oriented operations. “
To read Part 1, go to “
Thisseries of articles was excerpted from “
Nigel Paver is an architect anddesign manager for Wireless MMX technology at