CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Code techniques for processor pipeline optimization: Part 2
Optimization for data processing operations



Embedded.com

Scheduling in the Addition and Logical Pipeline
Most data-processing instructions for Intel XScale microarchitecture technology and Intel Wireless MMX technology?including logical and addition instructions?have a result latency of one cycle. Therefore, the current instruction can use the result from the previous data processing instruction without any penalty. For example, a series of additions can be performed without any stalls, such as:

waddh     wR4, wR2, wR1
waddh     wR5, wr4, wR1
waddh     wR6, wR2, wR1

The preceding code segment does not incur any stall. The only exception to the above is the saturation arithmetic operation. During saturation, the result is generated one cycle later. Thus, subsequent instructions using the result stall by a cycle, as in this instance:

waddhss     wR4, wR2, wR1
waddhss     wR5, wR4, wR1 @single cycle stall
waddhss     wR6, wR2, wR1

In this example, the second saturating SIMD instruction stalls for one cycle due to the read-after-write dependency on register wR4; however, the third saturating SIMD instruction does not stall since the two have no data dependency between each other. This code segment can be easily modified via translation such that there is no stall.

To make this modification, case swapping the locations of the second and the third WADDH is sufficient to remove the stall. The pipeline for the XScale microarchitecture also has no stalls on its logical and simple arithmetic operations. For many applications, this feature offers high performance.

Shifting an operand by an immediate value during an arithmetic operation is a feature of core processor instructions. This feature can save an extra instruction for explicit shifting.

You need to be mindful of the subtle constraints posed by this feature; if the current instruction uses the result of the previous data processing instruction for a shift by immediate, the result latency is two cycles. As a result, the following code segment incurs a one-cycle stall for the MOV instruction:

sub      r6, r7, r8
add      r1, r2, r3
mov     r4, r1, LSL #2

The following code removes the one-cycle stall:

add         r1, r2, r3
sub         r6, r7, r8
mov        r4, r1, LSL #2

Similarly, you can use a register to specify the shift or rotate amount for an operand. This instruction option can be very effective if the shift amount is not known beforehand; however, a longer latency is involved.

All data-processing instructions incur a two-cycle issue penalty and a two-cycle result penalty when the shifter operand is shifted or rotated based on a register. For instance, in the following code sequence, the sub incurs a two-cycle stall since the add instruction uses a register as a shift operand.

mov     r3, #10
mul     r4, r2, r3
add     r5, r6, r2, LSL r3
sub     r7, r8, r4 @ Stalls for two cycles

1 | 2 | 3 | 4 | 5

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :