CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Code techniques for processor pipeline optimization: Part 2
Optimization for data processing operations



Embedded.com
Data-processing operations are at the heart of any multimedia application. So expaniding on the discussion in Part 1, what we will be concerned with here is the impact of the pipeline delay characteristics on the coding style in a variety of operations such as Fast Multiply Operations, Fast Multiply and Accumulation, Double-Word Loading and Storing, Scheduling Load and Store Multiple, and Align and Shift.

Fast Multiply Operations
In the architecture under discussion, there are two sets of multiplication units, one in the Intel XScale microarchitecture and the other in the Intel Wireless MMX instructions. These two sets of multipliers support different levels of precision of data-processing capability.

The XScale microarchitecture supports half-word and word multiplication with results of word and double-word width. Selecting the correct precision for the algorithm under implementation helps reduce the execution time; for example, SMULxy has a latency of one cycle whereas SMULL has a latency of two cycles.

Multiply instructions can cause pipeline stalls due to resource conflicts or result latencies. The following code segment incurs a stall of zero to three cycles depending on the values in registers r1, r2, r4, and r5 due to resource conflicts:

mul     r0, r1, r2
mul     r3, r4, r5 @0-3 stalls

The second multiply operation would stall by three cycles if r1 and r2 did not have any trivial value and the S bit was set. Just as issue latency depends on the values of the operands, the result latency can vary between one and three cycles. In the following example, the mov instruction incurs the result penalty:

mul     r0, r1, r2
mov     r4, r0 @stall until previous mult

However, if an arithmetic operation follows the multiplication operation, it does not stall as long as no register dependency exists. Multiply instructions should be separated out from each other by the worst-case latency, especially if you have no a priori knowledge of the data value.

ARM instructions can set conditional flags so that following instructions can execute conditionally based on the flags. A multiply instruction that sets the condition codes blocks the multiply and arithmetic pipeline. Blocking stalls any subsequent instructions. For instance, in the following example, the add instruction waits three to four cycles for the muls instruction to finish.

muls    r0, r1, r2 @mult that updates flags
add     r3, r3, #1 @stalls until the mul finish
sub     r4, r4, #1
sub     r5, r5, #1

Thus, it is not efficient to use the multiplication operation to update the flags. The modified code is as follows:

mul     r0, r1, r2
add     r3, r3, #1
sub     r4, r4, #1
sub     r5, r5, #1
cmp     r0, #0

The issue latency of the WMUL and WMADD instructions is one cycle; the result and resource latency are two cycles. The second WMUL instruction in the following example stalls for one cycle due to the two-cycle issue latency.

WMULUM wR0, wR1, wR2
WMULSL wR3, wR4, wR5 @one cycle stall

Hence, two WMUL instructions should be separated by one instruction. The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency.

wmulm  wR0, wR1, wR2
waddhus  wR1, wR0, wR2 @two cycle stall

Thus, any instruction waiting on the result should be separated by two other instructions. However, if the latter instruction is another SIMD-multiplication instruction, then the stall is one cycle despite data dependency.

Fast Multiply and Accumulation
For DSP and multimedia applications, multiply and accumulate (MAC) is the most commonly used operation. In addition to multipliers, Intel Wireless MMX technology offers accumulation capabilities. In the SIMD coprocessor, any of the registers can be used as an accumulator.

Performing MAC Operations on Registers in Intel XScale Core A MAC operation can be done using TMIA 32-bit and TMIAPH 16-bit instructions. TMIA and TMIAPH instructions allow the use of two registers in the Intel XScale core as two operands and produce the result of multiplication and accumulation to any of the coprocessor registers.

The issue latency of the TMIA instruction is one cycle; the result and resource latency are two cycles. The second TMIA instruction in the following example stalls for one cycle due to the two-cycle resource latency.

tmia     wR0, r2, r3
tmia     wR1, r4, r5 @stall 1 cycle

The WADD instruction in the following example stalls for one cycle due to the two-cycle result latency.

tmia         wR0, r2, r3
waddhus wR1, wR0, wR2

1 | 2 | 3 | 4 | 5

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :