CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Code techniques for processor pipeline optimization: Part 2
Optimization for data processing operations



Embedded.com

Getting Data from Cache to Register and Back Efficiently
Cache memory allows taking advantage of the data locality of the program. Even if a data segment is in the cache, data has to be loaded to the registers for data processing operations. For critical data processing kernels, data cache optimization and register moving should be optimized.

Knowing the Load-to-Use Penalty. An increased number of pipeline stages and increased complexity in design gives rise to non-unity load latency. Any load operation of word, byte, and half-word size has a result latency of three cycles if the load is the cache.

Thus, a load followed by a use should be avoided. For cases when the load gets a cache miss, the latency can be high. An example of the load-to-use stall follows:

wldrw      wR0, [r3],#4
waddw     wR8, wR0, wR8 @stalls for 2 cycles
wldrw      wR0, [r3],#4
waddw     wR8, wR0, wR8 @stalls for 2 cycles
wldrw      wR0, [r3],#4
waddw     wR8, wR0, wR8 @stalls for 2 cycles

Here, the stall of six cycles can be easily reduced by scheduling other instructions in the shadow of the load. A modification of the preceding code follows:

wldrw     wR0, [r3],#4
wldrw     wR1, [r3],#4
wldrw     wR2, [r3],#4
waddw    wR8, wR0, wR8 @ no stall
waddw    wR8, wR1, wR8 @ no stall
waddw    wR8, wR2, wR8 @ no stall

Note that the modified code segment uses multiple registers to target its load. This modification is known as register rotation. This technique hides cache access latency and utilizes the multiple-load buffering capability offered by the XScale microarchitecture. This particular technique is applicable to all other load operations - those of different sizes and also those in the co-processor space.

Double-Word Loading and Storing
The XScale microarchitecture supports double-word loads-and-stores from a pair of 32-bit registers on an even boundary. Intel Wireless MMX technology supports load-and-store operations on 64-bit registers.

When the LDRD instruction is used to load a pair of core registers, it has a result latency of three or four cycles depending on the destination register being accessed, assuming the data being loaded is in the data cache. When WLDRD is used to load a 64-bit coprocessor register, the latency is four cycles.

@ load double using Intel XScale core

ldrd r0, [r3]
orr r8, r1, #0xf @stalls for 4 cycles mul r7, r0, r7

@ Another example
ldrd     r0, [r3]
orr      r8, r0, #0xf @stalls for 3 cycles
mul     r7, r1, r7

@ Load using Intel Wireless MMX technology
wldrd      wR0, [r3]

waddw     wR1, WR0, wR2 @stalls for 4 cycles

Any memory instruction followed by a load double instruction has a resource hazard of one cycle, as shown in the next example:

@ str instruction below will stall for 1 cycle ldrd r0, [r3]
str r4, [r5] // 1 cycle

@ For Intel Wireless MMX technology
wldrd     wR3,[r4],#8
wldrd     wR5,[r4],#8         @ STALL 1 cycle
wldrd     wR4,[r4],#8         @ STALL 1 cycle
waddb    wR0,wR1,wR2
waddb    wR0,wR0,wR6
waddb   wR0,wR0,wR7

The coprocessor supporting Wireless MMX technology can buffer incoming load operations up to two double-word loads at a time, or four word loads, byte loads, or half-word loads.

The overhead on issuing load transactions can be minimized by instruction scheduling and load pipelining. In most cases, interleaving other operations to avoid the penalty with back-to-back LDRD instructions is straightforward. In the following code sequence, three WLDRD instructions are issued back-to-back, incurring a stall on the second and third instruction.

wldrd     wR3,[r4],#8
wldrd     wR5,[r4],#8 @ STALL
wldrd     wR4,[r4],#8 @ STALL
waddb    wR0,wR1,wR2
waddb    wR0,wR0,wR6
waddb    wR0,wR0,wR7

The same code sequence is reorganized to avoid a back-to-back issue of WLDRD instructions.

wldrd     wR3,[r4],#8
waddb    wR0,wR1,wR2
wldrd     wR4,[r4],#8
waddb    wR0,wR0,wR6
wldrd      wR5,[r4],#8
waddb     wR0,wR0,wR7

Always try to separate three multiple WLDRD instructions so that only two are outstanding at any one time and the loads are always interleaved with other instructions

wldrd wR0,[r2],#8
wzero wR15
wldrd wR1,[r4],#8
subs r3,r3,#8
wldrd wR3,[r4],#8

Always try to interleave additional operations between the load instruction and the instruction that will first use the cached data.

wldrd     wR0,[r2],#8
wzero     wR15
wldrd     wR1,[r4],#8
subs       r3,r3,#8
wldrd     wR3,[r4],#8
wmacs    wR15,wR1,wR0
subs        r4,r4,#1

Similarly, WSRTD and STRD store data from coprocessor registers and from core register pairs. Like WLDRD and LDRD, store instructions also offer a stall for any memory operation followed by double-word store instructions.

1 | 2 | 3 | 4 | 5

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :