We will focus on low-level instruction sequences to demonstrate different optimization techniques. While most of the techniques described are generic to all processors, some are specific to the Intel XScale and Wireless MMX microarchitectures.
Picking the right optimization
philosophy
The optimization process is a hierarchical one that needs to be
addressed at different levels during the application development cycle.
During algorithm selection, high-level code development, and kernel
development, an optimization effort is necessary. The focus here is
simply on the optimizations useful for developing assembly routines or
kernel-level coding.
Two sets of instructions are used. Intel XScale microarchitecture implements ARM V5TE-compliant instructions, and Intel Wireless MMX technology implements an instruction set specifically designed to accelerate multimedia applications.
The implementation uses two concurrent pipelines to handle the respective instruction set architecture (ISA). You have two aspects of the optimization to consider with regard to the pipeline and ISA: 1) choosing the right instruction, and 2) choosing the right sequence of instructions.
Choosing the Right Instruction. The combined instruction sets of the Intel XScale microarchitecture and Intel Wireless MMX technology have a large variety of instructions from which the programmer can choose those most appropriate for a given application.
For example, 32-bit addition can be performed on the register file of Intel XScale core using an add instruction. An addition of this kind can also be performed on the coprocessor register file using a WADDW instruction.
Choosing the correct instruction for the operation and partitioning the desired kernel between both the register files are the first challenges in optimizing for the pipeline.
Intel Wireless MMX instructions are relatively orthogonal; that is, each instruction supports the same operations on different data types - byte, half-word, and word. Based on the algorithmic need of the kernel, selecting the correct data types offers you the most efficient use of the resources.
For example, if the required accuracy for an algorithm is only 16-bit data, you can use a 16-bit data type and process four data samples concurrently using Intel Wireless MMX Technology. However, when using 32-bit data, you meet the accuracy requirement, but you can enjoy only two-way concurrency.
Choosing the Right Sequence. Intel XScale microarchitecture with Intel Wireless MMX technology has two processing pipelines. Resource and data-dependency hazards introduce inefficiencies into the system, since, for the duration of the stalls, the concurrency between the pipe stages is not utilized, thus hurting performance and power.
You can employ different techniques for instruction scheduling to reduce such stalls. Instruction scheduling refers to the rearrangement of a sequence of instructions for the purpose of helping to minimize pipeline stalls.
When you are writing code in a high-level language, you cannot specifically control the selection of the right instruction or the selection of the right sequence of instructions. The compiler tool chain can handle some of these concerns.
However, for performance-critical applications, critical and heavily used routines might need to be written or optimized by hand in assembly language or using intrinsic functions. The majority of this optimization effort is spent in stall reduction.
Stall-Directed Instruction
Scheduling
The initial step in basic pipeline optimization is to understand the
pipeline and delay characteristics of each instruction. Two parameters
characterize an instruction:
1) Resource, or issue, latency. When a functional unit is processing an instruction, it can be busy for one or more cycles. Resource or issue latency for an instruction is the number of cycles that the functional unit will be busy before the next instruction using the same functional unit can be scheduled.
2) Result latency. Result latency for an instruction is the number of cycles that the functional unit takes to produce the result. If two instructions are back to back and the later instruction uses the data produced by the earlier instruction, then the later instruction stalls for the period of result latency.
The only way to solve the stall and hazard problem is to avoid it.
Stall avoidance is achieved by reordering instructions while
maintaining the same functionality.
Looking for Stalls
Attempting to reschedule a complete application, one kernel at a time,
can be very cumbersome, especially when you are working with code
written in assembly language.
Once a piece of code has been developed with functional correctness, the developer can optimize it piece by piece. Using existing knowledge of the application or a VTune analyzer profile, you can identify which critical piece of code has the greatest impact on the application.
After applying profiling and critical component analysis techniques, you must measure the number of stalls in that piece of code to determine whether you should reschedule.
To decouple the contribution of the memory subsystem stalls from the pipeline stalls, assume that all the data for the kernel is in the data cache. This operation can be done by making multiple function calls to the same kernel with the same data set during the test.
Once you have measured stalls per instruction and determined that an optimization to reduce stalls is necessary, you need to evaluate rescheduling opportunities.
Begin by considering a window of instructions for reordering, then gradually slide the window down as you reorder the instructions, as shown in Figure 1 below.
![]() |
| Figure 1. Sliding Window of Optimization |
This method is typically known as the sliding window approach. You need to choose the peeping window size such that it fits a small loop or Bbasic block of code, which in our context is defined by a sequence of code that does not have any branches.
For particular application classes, such basic blocks could be too small; for those cases, larger window sizes should be considered.
Avoiding Stalls
Using a sliding window approach, you can find opportunities for stall
reduction. Depending on the complexity of the code, the window size can
be limited to a single loop or a small kernel.
Once a stall is detected within the observation window, one of the following techniques can be applied: moving instructions up or down, loop unrolling, loop-fusion, register renaming, footprint adjustment, and so on. After optimizing code in the current window, it slides into next set of instructions. Two windows should moderately overlap.
If a resource hazard between two instructions exists, you can do one of two things. Either the first instruction of the hazard can be moved up or the later instructions can be moved down, thus creating enough distance between the two instructions.
You can apply the same technique to stalls due to data dependency. This approach is referred to as simple translation, since the instructions are simply moved up or down.
In a tight kernel loop, reordering might be limited to a single iteration of processing due to the lack of a sufficient number of instructions. Unrolling the loop allows more instructions within a basic block.
In addition, two consecutive loops can be combined to obtain the same effect. Later in this chapter you will find case studies in which this technique has been used quite extensively.
Register renaming is another approach to remove undue dependencies. By renaming registers, certain parts of a computation can be decoupled and performed independently; the result can be combined later. Care should be taken not to perform excessive register renaming, so that registers used in one window do not overwrite the register in the next window of optimization.
Reducing the register footprint, or using the least number of registers, is a good programming technique. Reducing register footprint can save the overhead of hot save.
Hot save refers to storing data to the stack and reading it back as a result of using a high number of registers. However, when the rescheduling opportunity is limited, increasing the register footprint can help.
Especially during loop unrolling and loop fusion, using more registers could be necessary to compute data values concurrently for different iterations. Intel Wireless MMX technology has a large register file that can be effectively used in these cases.
Keeping track of the registers and usage for each window under investigation is a good practice. Keeping track ensures that no register is overwritten unnecessarily and the optimization effort has not introduced any functional bugs.
Another commonly used technique to avoid stalls is to extract the concurrency between the pipelines - more specifically, to reduce resource hazards. This technique is known as pipeline interleaving.
The technique allows the programmer to interleave instructions between both the pipes, providing greater opportunities for avoiding stalls. The next level of concurrency extraction is at the functional unit level, or intracore level.
For example, when an Intel Wireless MMX multiply instruction is operating, another instruction can be processed concurrently in the ALU pipe. Similarly, reordering with concurrency in mind also helps avoid stalls due to resource hazards. .
![]() |
| Figure 2. Decomposition of a Code Segment into Data Blocks and Control Constructs |
Decomposition of an Application. An application can be decomposed into a set of connected basic blocks, where the basic blocks are data processing tasks and the control operations that determine the flow between the basic blocks, as shown in Figure 2 above.
Thus, for the sake of simplicity, the discussion in the rest of this
series will focus on two categories:
This series of articles was excerpted from "Programming with Intel Wireless MMX Technology," by Nigel Paver, Bradley Aldrich and Moinul Khan. Copyright © 2004 Intel Corporation. All rights reserved.
Nigel Paver is an architect and
design manager for Wireless MMX technology at Intel
Corporation. Bradley Aldrich is a leading authority at Intel
Corporation on image and video processing. Moinul Khan is a multimedia
architect at Intel Corporation.