By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Optimization of a code segment can contribute greatly to its
performance. An optimized application makes best use of all available
microarchitectural features. In a pipelined processor, the key to
optimization is to keep all the
stages of the pipelines and functional units occupied with meaningful
tasks.
This series of articles will explain the implications of the
pipelined structure in terms of performance and then offer best-known
ways to utilize all the available resources in data processing (
Part 2)
and control oriented operations (
Part 3).
We will focus on low-level instruction sequences to demonstrate
different optimization techniques. While most of the techniques
described are
generic to all processors, some are specific to the Intel XScale and
Wireless MMX microarchitectures.
Picking the right optimization
philosophy
The optimization process is a hierarchical one that needs to be
addressed at different levels during the application development cycle.
During algorithm selection, high-level code development, and kernel
development, an optimization effort is necessary. The focus here is
simply on the optimizations useful for developing assembly routines or
kernel-level coding.
Two sets of instructions are used. Intel XScale microarchitecture
implements ARM V5TE-compliant instructions,
and Intel Wireless MMX technology
implements an instruction set specifically designed to accelerate
multimedia applications.
The implementation uses two concurrent pipelines to handle the
respective
instruction set architecture (ISA). You have two aspects of
the optimization to consider with regard to the pipeline and ISA: 1)
choosing the right instruction, and 2)
choosing the right sequence of
instructions.
Choosing the
Right Instruction. The combined instruction sets of the Intel
XScale microarchitecture and Intel Wireless MMX technology have a large
variety of instructions from which the programmer can choose those most
appropriate for a given application.
For example, 32-bit addition can be performed on the register file
of Intel XScale core using an add instruction. An addition of this kind
can also be performed on the coprocessor register file using a WADDW
instruction.
Choosing the correct instruction for the operation and partitioning
the desired kernel between both the register files are the first
challenges in optimizing for the pipeline.
Intel Wireless MMX instructions are relatively orthogonal; that is,
each instruction supports the same operations on different data types -
byte, half-word, and word. Based on the algorithmic need of the kernel,
selecting the correct data types offers you the most efficient use of
the resources.
For example, if the required accuracy for an algorithm is only
16-bit data, you can use a 16-bit data type and process four data
samples concurrently using Intel Wireless MMX Technology. However, when
using 32-bit data, you meet the accuracy requirement, but you can enjoy
only two-way concurrency.
Choosing the
Right Sequence. Intel XScale microarchitecture with Intel
Wireless MMX technology has two processing pipelines. Resource and
data-dependency hazards introduce inefficiencies into the system,
since, for the duration of the stalls, the concurrency between the pipe
stages is not utilized, thus hurting performance and power.
You can employ different techniques for instruction scheduling to
reduce such stalls. Instruction scheduling refers to the rearrangement
of a sequence of instructions for the purpose of helping to minimize
pipeline stalls.
When you are writing code in a high-level language, you cannot
specifically control the selection of the right instruction or the
selection of the right sequence of instructions. The compiler tool
chain can handle some of these concerns.
However, for performance-critical applications, critical and heavily
used routines might need to be written or optimized by hand in assembly
language or using intrinsic functions. The majority of this
optimization effort is spent in stall reduction.
Stall-Directed Instruction
Scheduling
The initial step in basic pipeline optimization is to understand the
pipeline and delay characteristics of each instruction. Two parameters
characterize an instruction:
1) Resource, or
issue, latency. When a functional unit is processing an
instruction, it can be busy for one or more cycles. Resource or issue
latency for an instruction is the number of cycles that the functional
unit will be busy before the next instruction using the same functional
unit can be scheduled.
2) Result
latency. Result latency for an instruction is the number of
cycles that the functional unit takes to produce the result. If two
instructions are back to back and the later instruction uses the data
produced by the earlier instruction, then the later instruction stalls
for the period of result latency.
The only way to solve the stall and hazard problem is to avoid it.
Stall avoidance is achieved by reordering instructions while
maintaining the same functionality.