Mastering the art of microprocessor optimization - Embedded.com

Mastering the art of microprocessor optimization

In an ideal microprocessor, one that can issue one machine instructionper clock cycle, the best performance that can be achieved is that, forevery cycle, one useful instruction is completed. The processor'schoice of instructions then leads to an efficient implementation of thetask at hand.

The two goals of optimization are: (1) to reduce the total number of cycles required to achieve aparticular task by reducing the number of cycles wherein no useful workis done and (2) to make themost effective use of the cycles consumed.

In real-world microprocessors, many factors conspire to causedeviation from ideal performance. The art of optimization lies infiguring out how to work around a particular factor or how to mitigateits effect on performance.

Optimization Strategy
The overall strategy for optimization is straightforward and can besummarized by the following recipe:

1. Identify the hotspots.
2. Identify the cause of theperformance bottleneck and assess the suitability of using SIMDtechniques.
3. Select a suitable SIMDoptimization and method of applying it.
4. Repeat steps 1″3 until theperformance goal is reached.

This article introduces you to the concepts and the terminologyassociated with performance optimization, laying the foundation foroptimization techniques.

Identifying the Hotspots
Identifying those places where the application spends most of its timeis the first step in looking for performance bottlenecks. Many toolsare available for profiling an application to determine whichsubfunctions are used the most. For IntelPersonal Internet ClientArchitecture (PCA)baseddevices, Intel VTune Performance Analyzer isthe usual choice for application profiling; this tool is available formultiple operating systems.

While you run the application on the target system, under theoperating system, performance data is captured so that it can beanalyzed to find the parts of the application that are consuming themost time.

The subfunctions that consume the most run-time of the applicationare selected first as optimization candidates. Typically, a functionshows up in this group for one of two reasons:

1) The function is acomputationally intense calculation.
2) The function contains somesimpler computation, but it is repeated many times.

A profiling tool can determine the reason a particular subfunctionis a hotspot. The repeated operation type is more easily optimizedbecause the function itself is usually simpler.

Once a candidate function is selected, the suitability for SIMDacceleration is evaluated by using multiple, independent, and manycriteria. If the function is suitable for SIMD acceleration, the nextstep is to ensure that the optimal solution is selected.

Pipelined Microprocessors
All modern microprocessors employ pipelining techniques in one form oranother. Intel Wireless MMX technology uses a similar pipelinestructure. Both the main core and the coprocessor use a techniquecalled data forwarding to send recently calculated data directly to thenext instruction as soon as it is available rather than waiting untilthe value has been written back to the destination register. Thishardware helps minimize the number of data hazards but does notcompletely eliminate them.

Understanding the Impact of Stalls
Every cycle in which the processor is stalled is a cycle in which nouseful work is being done. The object of optimization is to reduce oreliminate the number of wasted cycles in a given application. Stallsare the result of either of two conditions:

1) Instructions cannot beprocessed because they are not available or the resource to processthem is not available
2) Data is not yet available

The Supply of Instructions
To keep the microprocessor busy, a ready supply of instructions toexecute is required. Modern microprocessors are designed to keepfrequently used instructions in fast memory close to the core.

This memory usually contains a cache of recently accessedinstructions. Slower and larger backing memory is supplied to store theless frequently used instructions. If the instruction is not in thefast cache memory, it must reside in the slower memory, quite oftenoff-chip. The arrangement of the different types of storage is oftencalled the memory hierarchy. The small faster caches are known as level1 memory.

If the required instruction is not found in the level 1 cache, theinstruction has to be retrieved from the next level in memory. If thismemory is off-chip, retrieving the data can take many clock cycles.

For example, retrieval times in excess of 50 clocks are notuncommon. During this time, the core has no instructions to process, soit stalls. The impact of cache-miss stalls can quite quickly have adramatic effect on performance.

Effect of Missing the Cache
To demonstrate the effect of cache misses, consider the case of anideal 400-megahertz core that can execute one instruction every cycle.If the instructions are always located within the cache, the number ofclocks required per instruction (CPI) remains at one. The equation tocalculate the time taken to execute 100 instructions is:

#Clocks for 100 Instructions (C100)= (#Instructions did notstall)*1 + (number ofstalls)*(time for each stall)

This calculation can be generalized to a short form, as follows:

C100 = Icache hit rate*1 + Icachemiss rate*stall time

Thus, if all instructions are found in the instruction cache everytime – that is, a 100 percent hit rate – the time for 100 instructionsis:

C100 = 100*1 + 0*50=100
CPI = 100 Instructions/100 Clocks = 1

If in this example the next level memory, level 2, is externalmemory requiring 50 clocks cycles to access and two instruction cachemisses occur every 100 instructions, the calculation of CPI becomes:

C100 = 98*1 + 2 * 50 = 198
CPI = 198/100 = 1.98

This result shows that the time taken to complete the same 100instructions has nearly doubled or, to put it another way, theperformance has been halved. The hit rate of the instruction cache andthe miss penalty thus play a very important role in performance.

Figure 1 below shows a graphof the effect of instruction cache hit rate on the performance of asystem. Here the graph shows a processor with a 400-megahertz clockwhere the miss penalty is 50 clock cycles.

The X-axis shows the impact of the instruction cache hit rate. TheY-axis shows the rate at which instructions are consumed, taking stallsinto account. The instruction rate can be calculated from the equation:

Instruction rate = ClockFrequency/CPI

The graph in Figure 1 showsthat the effective instruction rate, orobserved performance, drops off dramatically as the hit rate decreases.With a 100 percent hit rate, instructions are consumed at one per cycleor at a rate of 400 megahertz in this case.

Figure1. Effect of Instruction Cache Misses on Performance

With a hit rate of 90 percent, the effective instruction rate hasdropped to approximately 70 megahertz or just 17.5 percent of the idealperformance.

The objective of optimization is to get as close to ideal aspossible. Once again, the factors affecting the performance equationare:

C100 = Icache hit rate*1 + IcacheMiss Rate*stall time

Clearly you can improve the performance in two ways:

1) Reduce the instructioncache miss rate
2) Reduce the stall time

Instruction Locality
Reducing the instruction cache miss rate relies upon getting good codereuse. Fortunately, many multimedia algorithms tend to havecomputationally intense inner loop calculations that are repeated manytimes on the multimedia data stream, as shown in Figure 2 below .


snippet() {
    unsigned chara1[64];
    unsigned chara2[64];
    unsigned chara3[64], a4[64];
    for (i=0; i<64;i++) {
       a3[i] = a1[i] + a2[i];
       a4[i] = a1[i] – a2[i];
    }
}


Figure2 Instruction Locality

The first pass through the loop causes the loop body instructions tobe loaded into the cache, and the remaining 63 loop iterations arelikely to find the instructions required already in the level 1 cache.

Reducing the Stall Time
Reducing the stall time is usually a hardware implementation decision.Typically, such reductions are achieved by carefully designing the pathto the next level of memory and ensuring the next level of memory is aseconomically fast as possible.

Resource Conflicts
Not being able to execute an instruction because the hardware resourceis busy is known as a resource conflict. This event can lead to aprocessor stall and wasted cycles.

Some Intel Wireless MMX instructions occupy a particular pipelinestage for more than one cycle, which can lead to stalls if thefollowing instruction needs to use the same resource. For example, theIntel Wireless MMX multiply accumulate instruction, WMAC, uses thefirst stage of the multiplier pipeline twice. The code sequence belowshows two WMAC instructions back to back.

WMAC wR0, wR1, wR2
WMAC wR0, wR3, wR4

The second WMAC incurs a one-cycle stall because it cannot startuntil the first WMAC has finished its two cycles in the first stage ofthe multiplier pipeline. Usually, it is quite easy to schedule otherinstructions between resource-constrained instructions. In this case, atypical operation would be to load data – using WLDR – for the WMAC inasequence such as:

WLDR
WMAC
WLDR
WMAC

This sequence would not stall because the WMAC instructions nolonger have a resource hazard.

Finding the Data
Not having the data needed to start a calculation is the other reasonthat cycles can be wasted in a microprocessor. Typically, the data isnot available because it is:

1) In memory
2) Being calculated by aprevious instruction
3) In transit from the memoryor previous instruction
4) Available in the registers

These causes are ordered according to the impact on performance.Fetching data from memory has the greatest potential impact; usingreadily available data in the registers has the fewest wasted cycles.

Accessing DataMemory. Data memory is organized using a hierarchicalscheme similar to the instruction memory. Again, fast level 1 cache isprovided for frequently accessed data values. A similar performanceequation can be constructed for the impact of data cache misses, and asimilar graph of performance impact would result:

C100 = Dcache hit rate*1 + DcacheMiss Rate*stall time

The total performance impact of a cached subsystem is a function ofboth instruction and data cache hit rates.

To reduce the impact of data cache misses, data structures can beorganized so that frequently accessed data is grouped together in thesame area of memory. Data can also be preloaded into the fast caches byspecial instructions.

Other techniques such as loop unrolling and load pipelining can beemployed to reduce stall times due to memory data availability. Allthese techniques should be considered when writing optimal code.

Internal DataHazards. Internal data hazards occur when the databeing calculated by a previous instruction is not yet available. If theprocessor cannot get the data to the right place by use of elaboratedata forwarding schemes, it must stall until the data is available.

The magnitude of the stalls for internal data hazards is muchsmaller than for external memory, but even a single cycle stall canimpact performance. For example, if the inner loop is only fourinstructions and one stall is present, the task would take five clocksinstead of four. To put it another way, the data hazard reducesperformance by 25 percent.

Data hazards can usually be avoided by scheduling instructionscarefully and finding other instructions to complete while waiting fordata to become available. The result latency gives an indication of theresult availability of each instruction and when each instruction canbe safely used without incurring a stall.

Identifying Algorithm Techniques
A common issue in the optimizing of any code sequence is that a finitenumber of registers are available to store frequently used variables.Register file “pressure” occurs when the limited number of registers ina processor is not sufficient to conveniently implement an algorithm.In response, developers often reorganize the algorithm to execute usingmultiple code sequences or take a performance hit by savingintermediate values back to memory.

For example, Intel Wireless MMX technology, with its 16 64-bit SIMDregisters, and Intel Xscale microarchitecture, with its 15 32-bitregisters, reduce the pressure on register storage and allow morevariables to be stored in registers to create a correspondingimprovement in performance.

With a larger register file, a number of optimization techniques canbe applied, trading off cycle consumption for memory bandwidth and codesize. Some of the best techniques include:

1) Software pipelining
2) Multi-sample calculations
3) Data and coefficientpreloading
4) Algorithm fusion

Software Pipelining
Software pipelining, or loop unrolling, is the best-known optimizationtechnique. It involves reducing the number of iterations for a loop byincreasing the number of calculations done in each iteration.

The advantage with Intel Wireless MMX technology is in reduced cycleconsumption; however, additional performance enhancements may also beachieved. Load-to-use stalls may be more easily eliminated withscheduling, and multi-cycle instructions may be interleaved with otherinstructions.

The disadvantages of applying the technique include an increase incode size for critical loops and restrictions on the minimum andmultiple number of operands that are processed. Software pipelining isan important technique that forms one of the pillars of algorithmmapping with Intel Wireless MMX technology.

Multi-Sample Calculations
The multi-sample technique is one of the more popular optimizationmethods with Intel Wireless MMX technology. This technique – a form ofsoftware pipelining – may beapplied to algorithms that can bedescribed through nested loops. These types of algorithms appear quiteoften in multimedia applications.

The multi-sample approach involves loop unrolling at two levels. Thefirst level maps to the parallel execution resources, and the secondlevel calculates as many outputs as possible through maximum use ofloaded data. In addition to cycle count improvements, an advantage isrealized through reduced memory bandwidth.

Amortization of the overhead cost is achieved for other aspects ofthe calculation, including both data formatting and loop managementoperations. The disadvantages of applying this technique include anincrease in code size for critical loops and restrictions on theminimum and multiples of operands. Examples are discussed in followingchapters that cover digital filtering, image filtering, and videomotion estimation algorithms.

Register File Loading
The large register file enables an important optimization technique.When combined with the multi-sample approach, near-ideal throughput ispossible for many important algorithms.

Filter coefficients can be effectively cached for multipleapplications in audio and voice processing. In video motion estimation,a source block can be preloaded and then compared to multiple candidatereference blocks.

Algorithm Fusion
In multimedia applications, sequences of algorithms are used. Theoutput from one algorithm often becomes the input to the next. If twoalgorithms can be combined, a single more efficient routine can bedeveloped in many cases.

Identifying algorithms that can be merged can be a complicated task.However, the potential reduction in both cycle consumption and memorybandwidth reduction is appealing. When applied effectively, loadamortization can be achieved with dramatic results.

The choice of optimization implementation – integratedperformanceprimitives, vectorizingcompiler, C intrinsics, or assembly code -is atrade-off between development time versus control of the code sequencesand ultimate performance.

The other factor that you must consider is how difficult maintenanceof the code base becomes. Figure 3below shows the relative complexity of the various approaches,with the quicker, easy-to-maintain IPP libraries on the left and themore time-consuming, difficult-to-maintain detailed assembly code onthe right.

Figure3. Selecting the Method to Write the Multimedia Code

In general, the choice of optimization approach should be evaluatedfrom left to right to determine how much effort is required toaccelerate a particular hotspot. The approach that yields the requiredperformance with the minimum of effort should be selected.

This article was excerpted from”Programmingwith Intel® Wireless MMX Technology,” by Nigel Paver,Bradley Aldrich and Moinul Khan. Copyright © 2004 IntelCorporation. All rights reserved.

Nigel Paver is an architect anddesign manager for Wireless MMX technology at Intel Corporation.Bradley Aldrich is a leading authority at Intel Corporation on imageand video processing. Moinul Khan is a multimedia architect at IntelCorporation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.