By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Looking for Stalls
Attempting to reschedule a complete application, one kernel at a time,
can be very cumbersome, especially when you are working with code
written in assembly language.
Once a piece of code has been developed with functional correctness,
the developer can optimize it piece by piece. Using existing knowledge
of the application or a VTune analyzer profile, you can identify which
critical piece of code has the greatest impact on the application.
After applying profiling and critical component analysis techniques,
you must measure the number of stalls in that piece of code to
determine whether you should reschedule.
To decouple the contribution of the memory subsystem stalls from the
pipeline stalls, assume that all the data for the kernel is in the data
cache. This operation can be done by making multiple function calls to
the same kernel with the same data set during the test.
Once you have measured stalls per instruction and determined that an
optimization to reduce stalls is necessary, you need to evaluate
rescheduling opportunities.
Begin by considering a window of instructions for reordering, then
gradually slide the window down as you reorder the instructions, as
shown in Figure 1 below.
 |
| Figure
1. Sliding Window of Optimization |
This method is typically known as the sliding window approach. You
need to choose the peeping window size such that it fits a small loop
or Bbasic block of code,
which in our context is defined by a sequence
of code that does not have any branches.
For particular application classes, such basic blocks could be too
small; for those cases, larger window sizes should be considered.
Avoiding Stalls
Using a sliding window approach, you can find opportunities for stall
reduction. Depending on the complexity of the code, the window size can
be limited to a single loop or a small kernel.
Once a stall is detected within the observation window, one of the
following techniques can be applied: moving instructions up or down,
loop unrolling, loop-fusion, register renaming, footprint adjustment,
and so on. After optimizing code in the current window, it slides into
next set of instructions. Two windows should moderately overlap.
If a resource hazard between two instructions exists, you can do one
of two things. Either the first instruction of the hazard can be moved
up or the later instructions can be moved down, thus creating enough
distance between the two instructions.
You can apply the same technique to stalls due to data dependency.
This approach is referred to as simple translation, since the
instructions are simply moved up or down.
In a tight kernel loop, reordering might be limited to a single
iteration of processing due to the lack of a sufficient number of
instructions. Unrolling the loop allows more instructions within a
basic block.
In addition, two consecutive loops can be combined to obtain the
same effect. Later in this chapter you will find case studies in which
this technique has been used quite extensively.
Register renaming is another approach to remove undue dependencies.
By renaming registers, certain parts of a computation can be decoupled
and performed independently; the result can be combined later. Care
should be taken not to perform excessive register renaming, so that
registers used in one window do not overwrite the register in the next
window of optimization.
Reducing the register footprint, or using the least number of
registers, is a good programming technique. Reducing register footprint
can save the overhead of hot save.
Hot save refers to storing data to the stack and reading it back as
a result of using a high number of registers. However, when the
rescheduling opportunity is limited, increasing the register footprint
can help.
Especially during loop unrolling and loop fusion, using more
registers could be necessary to compute data values concurrently for
different iterations. Intel Wireless MMX technology has a large
register file that can be effectively used in these cases.
Keeping track of the registers and usage for each window under
investigation is a good practice. Keeping track ensures that no
register is overwritten unnecessarily and the optimization effort has
not introduced any functional bugs.
Another commonly used technique to avoid stalls is to extract the
concurrency between the pipelines - more specifically, to reduce
resource hazards. This technique is known as pipeline interleaving.
The technique allows the programmer to interleave instructions
between both the pipes, providing greater opportunities for avoiding
stalls. The next level of concurrency extraction is at the functional
unit level, or intracore level.
For example, when an Intel Wireless MMX multiply instruction is
operating, another instruction can be processed concurrently in the ALU
pipe. Similarly, reordering with concurrency in mind also helps avoid
stalls due to resource hazards. .
 |
| Figure
2. Decomposition of a Code Segment into Data Blocks and Control
Constructs |
Decomposition of an Application. An application can be decomposed
into a set of connected basic blocks, where the basic blocks are data
processing tasks and the control operations that determine the flow
between the basic blocks, as shown in Figure
2 above.
Thus, for the sake of simplicity, the discussion in the rest of this
series will focus on two categories:
*Optimization for data-processing operations (
Part 2),
and
*Optimization for control-oriented operations (
Part 3).
This
series of articles was excerpted from "Programming with
Intel Wireless MMX Technology," by Nigel Paver, Bradley Aldrich and
Moinul Khan. Copyright © 2004 Intel Corporation. All rights
reserved.
Nigel Paver is an architect and
design manager for Wireless MMX technology at Intel
Corporation. Bradley Aldrich is a leading authority at Intel
Corporation on image and video processing. Moinul Khan is a multimedia
architect at Intel Corporation.