By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Scheduling Load and Store Multiple
(LDM/STM)
Load and store multiple are two instructions - LDM and STM - that can
be used to load a set of core registers. These instructions are often
used for saving and retrieving the state of the processor.
LDM and STM instructions have an issue latency of 2 to 20 cycles,
depending on the number of registers being loaded or stored. The issue
latency is typically two cycles plus an additional cycle for each of
the registers loaded or stored, assuming a data cache hit.
The instruction following an LDM stalls whether or not this
instruction depends on the results of the load. While these
instructions are useful to ease code development, they have two
drawbacks: they have a two-cycle delay of issue latency and they are
not used for loading and storing registers that support Wireless MMX
technology.
Optimizing Align and Shift
The auxiliary registers are designed to hold constants that are
invariant across the lifetime of an inner loop calculation. For this
reason, values loaded into the auxiliary registers are not forwarded to
data operations.
The intended use of the registers is that the shift or alignment
offset is loaded into a wCGRn register before the main loop is entered,
and then the shift to alignment offset is used repeatedly inside the
loop without change.
If the value in a wCGRx register is changed and an instruction
immediately afterward tries to use the loaded value, then the
coprocessor stalls until the loaded value has reached the control
register file.
For most kernels, the alignment values and shift amount values do
not change during the execution of the kernel. For example, consider an
algorithm that accesses a large data array where each element has
16-bit accuracy and has been stored in a packed fashion in the memory.
Using Wireless MMX technology, four elements of this data array can
be processed concurrently. If the data structure is aligned at a 64-bit
boundary, Intel Wireless MMX technology can access the data by a simple
WLDRD instruction. For instance:
wldrd
wR0, [r1],#8
.. use wR0 now
..
However, if the data is not aligned to a 64-bit boundary, it will be
necessary to perform alignment. For the unaligned case, the data
segment can be from a 64-bit boundary by an amount of one to seven
bytes.
The last three bits of the pointer's address can determine the exact
offset. Be aware that the misalignment for successive double words does
not change throughout the array. You can keep the misalignment constant
stored in a control register and perform alignment on successive
accesses.
bic
r1, r2, #7 @ r1
gets aligned address
xor r0, r2, r1 @ r0 now contains misalignment
tmcr
wCGR0, r0 @ WCGR0 now gets misalignment
wldrd wR0, [r1],#8
wldrd wR1, [r1],#8
..
..
waligni wR2, wR0, wR1, #0
.. use wR2 now..
Similarly, control registers can be used to determine a shift
amount. Some algorithms require a certain level of accuracy?range and
precision?during the computation.
Following any multiplication or accumulation, you need to use a
right shift of the resultant value. This correction can be maintained
easily by using a control register-based shift operation.
Next in Part 3: "
Optimization for
Control-oriented operations."
To read Part 1, go to "
Microarchitectural
optimization philosophy."
This
series of articles was excerpted from "Programming with
Intel Wireless MMX Technology," by Nigel Paver, Bradley Aldrich and
Moinul Khan. Copyright © 2004 Intel Corporation. All rights
reserved.
Nigel Paver is an architect and
design manager for Wireless MMX technology at Intel
Corporation. Bradley Aldrich is a leading authority at Intel
Corporation on image and video processing. Moinul Khan is a multimedia
architect at Intel Corporation.