Optimization of a code segment can contribute greatly to itsperformance. An optimized application makes best use of all availablemicroarchitectural features. In a pipelined processor, the key tooptimization is to keep all thestages of the pipelines and functional units occupied with meaningfultasks.
This series of articles will explain the implications of thepipelined structure in terms of performance and then offer best-knownways to utilize all the available resources in data processing (
We will focus on low-level instruction sequences to demonstratedifferent optimization techniques. While most of the techniquesdescribed aregeneric to all processors, some are specific to the Intel XScale andWireless MMX microarchitectures.
Picking the right optimizationphilosophy
The optimization process is a hierarchical one that needs to beaddressed at different levels during the application development cycle.During algorithm selection, high-level code development, and kerneldevelopment, an optimization effort is necessary. The focus here issimply on the optimizations useful for developing assembly routines orkernel-level coding.
Two sets of instructions are used.
The implementation uses two concurrent pipelines to handle therespective
Choosing theRight Instruction. The combined instruction sets of the IntelXScale microarchitecture and Intel Wireless MMX technology have a largevariety of instructions from which the programmer can choose those mostappropriate for a given application.
For example, 32-bit addition can be performed on the register fileof Intel XScale core using an add instruction. An addition of this kindcan also be performed on the coprocessor register file using a WADDWinstruction.
Choosing the correct instruction for the operation and partitioningthe desired kernel between both the register files are the firstchallenges in optimizing for the pipeline.
Intel Wireless MMX instructions are relatively orthogonal; that is,each instruction supports the same operations on different data types -byte, half-word, and word. Based on the algorithmic need of the kernel,selecting the correct data types offers you the most efficient use ofthe resources.
For example, if the required accuracy for an algorithm is only16-bit data, you can use a 16-bit data type and process four datasamples concurrently using Intel Wireless MMX Technology. However, whenusing 32-bit data, you meet the accuracy requirement, but you can enjoyonly two-way concurrency.
Choosing theRight Sequence. Intel XScale microarchitecture with IntelWireless MMX technology has two processing pipelines. Resource anddata-dependency hazards introduce inefficiencies into the system,since, for the duration of the stalls, the concurrency between the pipestages is not utilized, thus hurting performance and power.
You can employ different techniques for instruction scheduling toreduce such stalls. Instruction scheduling refers to the rearrangementof a sequence of instructions for the purpose of helping to minimizepipeline stalls.
When you are writing code in a high-level language, you cannotspecifically control the selection of the right instruction or theselection of the right sequence of instructions. The compiler toolchain can handle some of these concerns.
However, for performance-critical applications, critical and heavilyused routines might need to be written or optimized by hand in assemblylanguage or using intrinsic functions. The majority of thisoptimization effort is spent in stall reduction.
The initial step in basic pipeline optimization is to understand thepipeline and delay characteristics of each instruction. Two parameterscharacterize an instruction:
1) Resource, orissue, latency. When a functional unit is processing aninstruction, it can be busy for one or more cycles. Resource or issuelatency for an instruction is the number of cycles that the functionalunit will be busy before the next instruction using the same functionalunit can be scheduled.
2) Resultlatency. Result latency for an instruction is the number ofcycles that the functional unit takes to produce the result. If twoinstructions are back to back and the later instruction uses the dataproduced by the earlier instruction, then the later instruction stallsfor the period of result latency.
The only way to solve the stall and hazard problem is to avoid it.Stall avoidance is achieved by reordering instructions whilemaintaining the same functionality.
Looking for Stalls
Attempting to reschedule a complete application, one kernel at a time,can be very cumbersome, especially when you are working with codewritten in assembly language.
Once a piece of code has been developed with functional correctness,the developer can optimize it piece by piece. Using existing knowledgeof the application or a VTune analyzer profile, you can identify whichcritical piece of code has the greatest impact on the application.
After applying profiling and critical component analysis techniques,you must measure the number of stalls in that piece of code todetermine whether you should reschedule.
To decouple the contribution of the memory subsystem stalls from thepipeline stalls, assume that all the data for the kernel is in the datacache. This operation can be done by making multiple function calls tothe same kernel with the same data set during the test.
Once you have measured stalls per instruction and determined that anoptimization to reduce stalls is necessary, you need to evaluaterescheduling opportunities.
Begin by considering a window of instructions for reordering, thengradually slide the window down as you reorder the instructions, asshown in Figure 1 below .
|Figure1. Sliding Window of Optimization|
This method is typically known as the sliding window approach. Youneed to choose the peeping window size such that it fits a small loopor Bbasic block of code,which in our context is defined by a sequenceof code that does not have any branches.
For particular application classes, such basic blocks could be toosmall; for those cases, larger window sizes should be considered.
Using a sliding window approach, you can find opportunities for stallreduction. Depending on the complexity of the code, the window size canbe limited to a single loop or a small kernel.
Once a stall is detected within the observation window, one of thefollowing techniques can be applied: moving instructions up or down,loop unrolling, loop-fusion, register renaming, footprint adjustment,and so on. After optimizing code in the current window, it slides intonext set of instructions. Two windows should moderately overlap.
If a resource hazard between two instructions exists, you can do oneof two things. Either the first instruction of the hazard can be movedup or the later instructions can be moved down, thus creating enoughdistance between the two instructions.
You can apply the same technique to stalls due to data dependency.This approach is referred to as simple translation, since theinstructions are simply moved up or down.
In a tight kernel loop, reordering might be limited to a singleiteration of processing due to the lack of a sufficient number ofinstructions. Unrolling the loop allows more instructions within abasic block.
In addition, two consecutive loops can be combined to obtain thesame effect. Later in this chapter you will find case studies in whichthis technique has been used quite extensively.
Register renaming is another approach to remove undue dependencies.By renaming registers, certain parts of a computation can be decoupledand performed independently; the result can be combined later. Careshould be taken not to perform excessive register renaming, so thatregisters used in one window do not overwrite the register in the nextwindow of optimization.
Reducing the register footprint, or using the least number ofregisters, is a good programming technique. Reducing register footprintcan save the overhead of hot save.
Hot save refers to storing data to the stack and reading it back asa result of using a high number of registers. However, when therescheduling opportunity is limited, increasing the register footprintcan help.
Especially during loop unrolling and loop fusion, using moreregisters could be necessary to compute data values concurrently fordifferent iterations. Intel Wireless MMX technology has a largeregister file that can be effectively used in these cases.
Keeping track of the registers and usage for each window underinvestigation is a good practice. Keeping track ensures that noregister is overwritten unnecessarily and the optimization effort hasnot introduced any functional bugs.
Another commonly used technique to avoid stalls is to extract theconcurrency between the pipelines – more specifically, to reduceresource hazards. This technique is known as pipeline interleaving.
The technique allows the programmer to interleave instructionsbetween both the pipes, providing greater opportunities for avoidingstalls. The next level of concurrency extraction is at the functionalunit level, or intracore level.
For example, when an Intel Wireless MMX multiply instruction isoperating, another instruction can be processed concurrently in the ALUpipe. Similarly, reordering with concurrency in mind also helps avoidstalls due to resource hazards. .
|Figure2. Decomposition of a Code Segment into Data Blocks and ControlConstructs|
Decomposition of an Application. An application can be decomposedinto a set of connected basic blocks, where the basic blocks are dataprocessing tasks and the control operations that determine the flowbetween the basic blocks, as shown in Figure2 above .
Thus, for the sake of simplicity, the discussion in the rest of thisseries will focus on two categories:
Thisseries of articles was excerpted from “Programming withIntel Wireless MMX Technology,” by Nigel Paver, Bradley Aldrich andMoinul Khan. Copyright © 2004 Intel Corporation. All rightsreserved.
Nigel Paver is an architect anddesign manager for Wireless MMX technology at IntelCorporation. Bradley Aldrich is a leading authority at IntelCorporation on image and video processing. Moinul Khan is a multimediaarchitect at Intel Corporation.