By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Optimizing Complex Expressions
Using Conditional Execution
Using conditional instructions helps improve the code generated for
complex expressions such as the C shortcut evaluation feature.
The use of conditional instructions in this fashion helps improve
performance by minimizing the number of branches, thereby minimizing
the penalties caused by branch mispredictions.
int
foo(int a, int b) {
if (a != 0
&& b != 0)
return 0;
else
return 1;
}
The optimized code for the if condition is:
cmp
r0,#0
cmpne r1,#0
This approach also reduces the utilization of branch prediction
resources. With Wireless MMX technology, the flag registers can be set
based on data values in the coprocessor registers or SIMD flag
registers.
Use Addressing Modes Efficiently
XScale and Wireless MMX provide a variety of addressing modes that make
indexing an array of objects highly efficient. The following code
samples illustrate how various kinds of array operations can be
optimized to make use of these addressing modes:
@ Set the contents of
the word pointed to
@ by r0 to the value contained in r1 and
@ make r0 point to the next word
wstrw wR1,[r0], #4
@ Increment the
contents of r0 to make it
@ point to the next word and set the
@ contents of the word pointed to the
@ value contained in r1
wstrw wR1, [r0, #4]!
@ Set the contents of
the word pointed to
@ by r0 to the value contained in r1 and
@ make r0 point to the previous word
wstrw wR1,[r0], #-4
@ Decrement the
contents of r0 to make it
@ point to the previous word and set the
@ contents of the word pointed to the value
@ contained in r1
wstrw wR1,[r0, #-4]!
Various addressing modes save you from explicitly spending an
instruction on updating the pointer.
Miscellaneous Approaches
Apart from the techniques mentioned earlier, you might consider these
tricks geared towards interesting use of the instructions. Consider the
following two cases.
Optimizing the
Use of Immediate Values. For programming purposes, constant
values may need to be used. Constant values are created to be used as
masks or known coefficients in different calculations.
The MOV or MVN instruction should be used when loading an immediate,
or constant, value into a register. However, immediate move is
restricted to a 12-bit number. One could load the constant from memory.
Loading 32-bit or 64-bit constant values requires loading from the
memory.
The compiler typically places all the constants in a literal pool
close to the instructions. Literal pools are not likely to be in the
data cache, which makes loading constants expensive - a main memory
access. Also, LDR instruction has the potential to pollute the data
cache.
It is possible to generate a whole set of constant values using a
combination of MOV, MVN, ORR, BIC, and ADD instructions. Use a
combination of the above instructions to set a register to a constant
value. An example of this is shown in these code samples.
@Set
the value of r0 to 127
mov r0, #127
@Set the value of r0 to 0xfffffefb.
mvn r0, #260
@Set the value of r0 to 257
mov r0, #1
orr r0, r0, #256
@Set the value of r0 to 0x51f
mov r0, #0x1f
orr r0, r0, #0x500
@Set the value of r0 to 0xf100ffff
mvn r0, #0xff, LSL 16
bic r0, r0, #0xe, LSL 8
@ Set the value of r0 to 0x12341234
mov r0, #0x8d, LSL 2
orr r0, r0, #0x1, LSL 12
add r0, r0, r0, LSL #16
@ shifter delay of 1 cycle
<>It is possible to load any 32-bit value into a register using a
sequence of four instructions. With Wireless MMX technology, two
such 32-bit values can be generated in core registers, and then
transferred to coprocessor registers using TCMR, TMCRR, and TBCST
instructions.>
Bit Field Manipulation.
Different encryption algorithms such as Data Encryption Standard (DES),
Triple DES (T-DES), and hashing functions (SHA) perform many
bit-manipulation operations.
The shift and logical operations of the XScale provide a useful way
of manipulating bit fields. Bit field operations can be optimized using
regular instructions:
@ Set the bit number
specified by
@ r1 in register r0
mov
r2, #1
orr r0, r0, r2, asl r1
@ Clear the bit
number specified by
@ r1 in register r0
mov
r2, #1
bic r0, r0, r2, asl r1
@ Extract the bit
value of the bit
@ number specified by r1 of the
@ value in r0 storing the value in r0
mov
r1, r0, asr r1
and r0, r1, #1
@ Extract the higher
order 8 bits of the
@ value in r0 storing
@ the result in r1
mov
r1, r0, lsr #24
This approach helps other applications such as video stream parsing.
Wireless MMX supports 64-bit-wide bit-wise manipulation - for instance,
shift, and, or - which can be effectively used for different bit-wise
algorithms.
Conclusion
The methods described in this series of articles are intended for
assembly language development but can also be applied during
development using intrinsic functions and in-line assembly.
High-level language programming styles based on these techniques
have also been presented. These programming styles demonstrate how best
to use different instructions and, more specifically, how the sequence
of instructions should be scheduled to reduce stalls. However, the list
of methods described here is not exhaustive.
Finally, a few points to remember are:
1) Use the correct precision
for the algorithm, and choose instructions accordingly.
2) Interleave instructions
between the pipe to hide result and issue latency.
3) Schedule load and stores
with the correct data-addressing mode.
4) Watch out for load-to-use
penalty and shifter-processing latency.
5) Count down on loops to
reduce loop control overhead.
6) Use conditional
instructions to avoid branch costs.
To read Part 1, go to "Microarchitectural optimization philosophy."
To read Part 2, go to "Optimization for data processing-oriented
operations."
This
series of articles was excerpted from "Programming with
Intel Wireless MMX Technology," by Nigel Paver, Bradley Aldrich and
Moinul Khan. Copyright © 2004 Intel Corporation. All rights
reserved.
Nigel Paver is an architect and
design manager for Wireless MMX technology at Intel
Corporation. Bradley Aldrich is a leading authority at Intel
Corporation on image and video processing. Moinul Khan is a multimedia
architect at Intel Corporation.