By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Getting Data from Cache to Register
and Back Efficiently
Cache memory allows taking advantage of the data locality of the
program. Even if a data segment is in the cache, data has to be loaded
to the registers for data processing operations. For critical data
processing kernels, data cache optimization and register moving should
be optimized.
Knowing the Load-to-Use Penalty. An increased number of pipeline
stages and increased complexity in design gives rise to non-unity load
latency. Any load operation of word, byte, and half-word size has a
result latency of three cycles if the load is the cache.
Thus, a load followed by a use should be avoided. For cases when the
load gets a cache miss, the latency can be high. An example of the
load-to-use stall follows:
wldrw
wR0, [r3],#4
waddw wR8, wR0, wR8
@stalls for 2
cycles
wldrw wR0,
[r3],#4
waddw wR8, wR0, wR8
@stalls for 2
cycles
wldrw wR0,
[r3],#4
waddw wR8,
wR0, wR8 @stalls for 2
cycles
Here, the stall of six cycles can be easily reduced by scheduling
other instructions in the shadow of the load. A modification of the
preceding code follows:
wldrw
wR0, [r3],#4
wldrw wR1, [r3],#4
wldrw wR2, [r3],#4
waddw wR8, wR0, wR8 @ no
stall
waddw wR8, wR1, wR8
@ no stall
waddw wR8, wR2, wR8
@ no stall
Note that the modified code segment uses multiple registers to
target its load. This modification is known as register rotation. This
technique hides cache access latency and utilizes the multiple-load
buffering capability offered by the XScale microarchitecture. This
particular technique is applicable to all other load operations - those
of different sizes and also those in the co-processor space.
Double-Word Loading and Storing
The XScale microarchitecture supports double-word loads-and-stores from
a pair of 32-bit registers on an even boundary. Intel Wireless MMX
technology supports load-and-store operations on 64-bit registers.
When the LDRD instruction is used to load a pair of core registers,
it has a result latency of three or four cycles depending on the
destination register being accessed, assuming the data being loaded is
in the data cache. When WLDRD is used to load a 64-bit coprocessor
register, the latency is four cycles.
@ load double using Intel XScale core
ldrd r0, [r3]
orr r8, r1, #0xf @stalls for 4 cycles
mul r7, r0, r7
@ Another example
ldrd r0, [r3]
orr r8, r0,
#0xf @stalls for 3 cycles
mul r7, r1, r7
@ Load
using Intel Wireless MMX technology
wldrd wR0,
[r3]
waddw
wR1, WR0, wR2 @stalls for 4
cycles
Any memory instruction followed by a load double instruction has a
resource hazard of one cycle, as shown in the next example:
@ str instruction
below will stall for 1 cycle
ldrd r0, [r3]
str r4, [r5] // 1 cycle
@ For
Intel Wireless MMX technology
wldrd
wR3,[r4],#8
wldrd
wR5,[r4],#8 @
STALL 1 cycle
wldrd
wR4,[r4],#8 @
STALL 1 cycle
waddb
wR0,wR1,wR2
waddb
wR0,wR0,wR6
waddb
wR0,wR0,wR7
The coprocessor supporting Wireless MMX technology can buffer
incoming load operations up to two double-word loads at a time, or four
word loads, byte loads, or half-word loads.
The overhead on issuing load transactions can be minimized by
instruction scheduling and load pipelining. In most cases, interleaving
other operations to avoid the penalty with back-to-back LDRD
instructions is straightforward. In the following code sequence, three
WLDRD instructions are issued back-to-back, incurring a stall on the
second and third instruction.
wldrd
wR3,[r4],#8
wldrd wR5,[r4],#8 @
STALL
wldrd wR4,[r4],#8 @
STALL
waddb wR0,wR1,wR2
waddb wR0,wR0,wR6
waddb wR0,wR0,wR7
The same code sequence is reorganized to avoid a back-to-back issue
of WLDRD instructions.
wldrd
wR3,[r4],#8
waddb wR0,wR1,wR2
wldrd
wR4,[r4],#8
waddb wR0,wR0,wR6
wldrd
wR5,[r4],#8
waddb
wR0,wR0,wR7
Always try to separate three multiple WLDRD instructions so that
only two are outstanding at any one time and the loads are always
interleaved with other instructions
wldrd
wR0,[r2],#8
wzero wR15
wldrd wR1,[r4],#8
subs r3,r3,#8
wldrd wR3,[r4],#8
Always try to interleave additional operations between the load
instruction and the instruction that will first use the cached data.
wldrd
wR0,[r2],#8
wzero
wR15
wldrd
wR1,[r4],#8
subs
r3,r3,#8
wldrd
wR3,[r4],#8
wmacs
wR15,wR1,wR0
subs
r4,r4,#1
Similarly, WSRTD and STRD store data from coprocessor registers and
from core register pairs. Like WLDRD and LDRD, store instructions also
offer a stall for any memory operation followed by double-word store
instructions.