Code techniques for processor pipeline optimization: Part 3 -

Code techniques for processor pipeline optimization: Part 3


A typical application has a set of control-oriented operations, such ascompare, branching, and looping. Certain applications, such as a datasort operation, are control heavy. Even in a more data-intensiveapplication, such as video processing (discussedin Part 1 and Part 2),controloperations can consume a significant part of the application'sexecution time.

The optimizing control-oriented operation has three key components:branch overhead reduction, conditional execution, and addressingtechniques.

Reduce Branch Cost
In a deeply pipelined device, branches can be expensive since, during abranch, many stages of a pipeline can potentially be flushed, thuscausing stalls of five cycles. XScale has branch prediction logic andbranch target buffers to ensure that branch penalties are as small aspossible.

From the application developer's point of view, avoiding unnecessarybranches and ensuring that the branches are predictable helps improveperformance. The following set of examples demonstrates how toeliminate potential branches, eliminate unnecessary loop overhead, andmake control flow efficient.

The branch prediction and branch target buffers can reducebranch-related penalties a great deal. The operating system andplatform level software must ensure that the branch target buffer (BTB)is enabled.

The BTB has only 32 entries. The size of the branch target bufferlimits the number of correctly predictable branches. Because the totalnumber of branches executed in a program is relatively large comparedto the size of the branch target buffer, it is often beneficial tominimize the number of branches in a program. Consider the following Ccode segment:

intfoo(int a) {
    if (a > 10)        return 0;
    else        return 1;

The code generated for the if…else portion of this code segment using branches is:

   cmp     r0,#10
    ble      L1
    mov    r0,#0
    b        L2 L1:     mov    r0,#1

This code takes three cycles to execute the else statement and fourcycles for the if statement assuming best-case conditions and no branchmisprediction penalties. In the case of the XScale, a branchmisprediction incurs a penalty of four cycles.

If the branch is mispredicted 50 percent of the time and if both theif statement and the else statement are equally likely to be taken, onan average the code above takes 5.5 cycles to execute.

Using the XScale to execute instructions conditionally, the codegenerated for the preceding if…else statement is:

cmp        r0,#10 movgt      r0,#0 movle      r0,#1

The preceding code segment would not incur any branch mispredictionpenalties and would take three cycles to execute assuming best-caseconditions. Using conditional instructions helps to speed up executionsignificantly.

Use Conditional Instruction
Many embedded applications, such as parsing of a packet or looking fora peak in a spectrum, are loop intensive. Branch density – the numberof instructions per branch – can be high, which may mean that thenumber of instructions in a loop or number of instructions between twobranches is low. Each loop has management overhead – counting the looppointer and comparing for the exit condition – associated with it.

The XScale provides the ability to execute instructionsconditionally based on a set of conditional flags. This conditionalexecution feature, combined with the ability of the instructions tomodify the condition codes, makes a wide array of optimizationspossible.

Optimizing Condition Checks
Core instructions can selectively modify the state of the conditioncodes. When generating code for if…else and loop conditions, it isoften beneficial to make use of this feature to set condition codes,thereby eliminating the need for a subsequent compare instruction.Consider the following C statement.

if ((a+ b) !=0)
    c = c + 1;

Code generated for the if condition without using an add instructionto set condition codes is:

add        r2, r0, r1 cmp       r2, #0 addne     r3, r3,#1

However, code can be optimized making use of an add instruction toset condition codes:

adds        r2, r0, r1 addne      r3, r3, #1

Condition checking for coprocessor registers can also be performed.SIMD flags in the wCASF register are updated during execution ofWireless MMX instructions. Then, using one of the three flag extractionoperations – TANDC, TORC, or TEXTRC – flags for the XScale core can beupdated.

This method allows checking of all or one of the SIMD fields forconditional execution. Called group conditional execution, this methodis shown in the following example:

wsubhus    wR1, wR2, wR3
@ Saturating subtraction minimum ofwR1
@ is zero torch R15 @ Updating core flags with ORed @ coprocessor flag values addeq        r2, r2, #1
@ now executes conditional coprocessorflag

All preceding techniques of effectively using conditional executioncan also be applied to the group conditional execution. For cases suchas peak detection or finding a match in a vector, you can use groupconditional techniques.

The instructions that increment or decrement the loop counter canalso be used to modify the condition codes. Modifying the codeseliminates the need for a subsequent compare instruction. A conditionalbranch instruction can then be used to exit or continue with the nextloop iteration.

Consider the following C code segment:

for (i= 10; i != 0; i–) {

The optimized code generated for the preceding code segment wouldlook like:

L6: @equivalent to inner_kernel     subs    r3, r3, #1
    bne    .L6

Using the above argument, it is also beneficial to rewrite loopswhenever possible to make the loop exit conditions check against thevalue 0. For example, the code generated for the following code segmentneeds a compare instruction to check for the loop exit condition.

for (i= 0; i < 10; i++) {

If the loop is rewritten as follows, the code generated avoids usinga compare instruction to check for the loop exit condition.

for (i= 9; i >= 0; i–) {

However, the use of conditional instructions should be consideredcarefully to ensure it improves performance. To decide when to useconditional instructions over branches, consider this hypothetical codesegment:

< style="font-style: italic;">if(cond)
    if_stmt else     else_stmt

Using the following data:

N1Beta = number of cyclesto execute the if_stmt, assuming the use of branch instructions
N2Beta = number of cycles toexecute the else_stmt, assuming the use of branch instructions
P1 = percentage of times theif_stmt is likely to be executed
P2 = percentage of times likelyto incur a branch misprediction penalty
N1c = number of cycles toexecute the if…else portion using conditional instructions assumingthe if condition to be true
N2c = number of cycles toexecute the if…else portion using conditional instructions assumingthe if condition to be false

Use conditional instructions when:


The following example illustrates a situation in which it is betterto use branches instead of conditional instructions.

    cmp    r0, #0
   bne     L1    add     r0, r0, #1    add     r1, r1, #1    add     r2, r2, #1    add     r3, r3, #1    add     r4, r4, #1 b           L2 L1:    sub     r0, r0, #1    sub     r1, r1, #1    sub     r2, r2, #1    sub     r3, r3, #1    sub     r4, r4, #1 L2:

The CMP instruction takes one cycle to execute, the if statementtakes seven cycles to execute, and the else statement takes six cyclesto execute. If the code were changed to eliminate the branchinstructions by using conditional instructions, the if…else statementwould take 10 cycles to complete.

Assuming an equal probability of both paths being taken and thatbranch mispredictions occur 50 percent of the time, the cost of usingconditional instructions is 11 cycles and the cost of branches is 9.5cycles.

Optimizing Complex ExpressionsUsing Conditional Execution
Using conditional instructions helps improve the code generated forcomplex expressions such as the C shortcut evaluation feature.

The use of conditional instructions in this fashion helps improveperformance by minimizing the number of branches, thereby minimizingthe penalties caused by branch mispredictions.

intfoo(int a, int b) {
    if (a != 0&& b != 0)
       return 0;
else        return 1;

The optimized code for the if condition is:

    cmp        r0,#0   cmpne     r1,#0

This approach also reduces the utilization of branch predictionresources. With Wireless MMX technology, the flag registers can be setbased on data values in the coprocessor registers or SIMD flagregisters.

Use Addressing Modes Efficiently
XScale and Wireless MMX provide a variety of addressing modes that makeindexing an array of objects highly efficient. The following codesamples illustrate how various kinds of array operations can beoptimized to make use of these addressing modes:

@ Set the contents ofthe word pointed to
@ by r0 to the value contained in r1 and
@ make r0 point to the next word

       wstrw wR1,[r0], #4

@ Increment thecontents of r0 to make it
@ point to the next word and set the
@ contents of the word pointed to the
@ value contained in r1

       wstrw wR1, [r0, #4]!

@ Set the contents ofthe word pointed to
@ by r0 to the value contained in r1 and
@ make r0 point to the previous word

       wstrw wR1,[r0], #-4

@ Decrement thecontents of r0 to make it
@ point to the previous word and set the
@ contents of the word pointed to the value
@ contained in r1

       wstrw wR1,[r0, #-4]!

Various addressing modes save you from explicitly spending aninstruction on updating the pointer.

Miscellaneous Approaches
Apart from the techniques mentioned earlier, you might consider thesetricks geared towards interesting use of the instructions. Consider thefollowing two cases.

Optimizing theUse of Immediate Values. For programming purposes, constantvalues may need to be used. Constant values are created to be used asmasks or known coefficients in different calculations.

The MOV or MVN instruction should be used when loading an immediate,or constant, value into a register. However, immediate move isrestricted to a 12-bit number. One could load the constant from memory.Loading 32-bit or 64-bit constant values requires loading from thememory.

The compiler typically places all the constants in a literal poolclose to the instructions. Literal pools are not likely to be in thedata cache, which makes loading constants expensive – a main memoryaccess. Also, LDR instruction has the potential to pollute the datacache.

It is possible to generate a whole set of constant values using acombination of MOV, MVN, ORR, BIC, and ADD instructions. Use acombination of the above instructions to set a register to a constantvalue. An example of this is shown in these code samples.

@Setthe value of r0 to 127
       mov r0, #127
@Set the value of r0 to 0xfffffefb.        mvn r0, #260
@Set the value of r0 to 257        mov r0, #1
       orr r0, r0, #256
@Set the value of r0 to 0x51f        mov r0, #0x1f
       orr r0, r0, #0x500
@Set the value of r0 to 0xf100ffff        mvn r0, #0xff, LSL 16
       bic r0, r0, #0xe, LSL 8
@ Set the value of r0 to 0x12341234        mov r0, #0x8d, LSL 2
       orr r0, r0, #0x1, LSL 12
       add r0, r0, r0, LSL #16
@ shifter delay of 1 cycle

<>It is possible to load any 32-bit value into a register using asequence of four instructions. With  Wireless MMX technology, twosuch 32-bit values can be generated in core registers, and thentransferred to coprocessor registers using TCMR, TMCRR, and TBCSTinstructions.

Bit Field Manipulation.
Different encryption algorithms such as Data Encryption Standard (DES),Triple DES (T-DES), and hashing functions (SHA) perform manybit-manipulation operations.

The shift and logical operations of the XScale provide a useful wayof manipulating bit fields. Bit field operations can be optimized usingregular instructions:

@ Set the bit numberspecified by
@ r1 in register r0

mov    r2, #1
orr      r0, r0, r2, asl r1

@ Clear the bitnumber specified by
@ r1 in register r0

mov    r2, #1
bic       r0, r0, r2, asl r1

@ Extract the bitvalue of the bit
@ number specified by r1 of the
@ value in r0 storing the value in r0

mov    r1, r0, asr r1
and      r0, r1, #1

@ Extract the higherorder 8 bits of the
@ value in r0 storing
@ the result in r1

mov    r1, r0, lsr #24

This approach helps other applications such as video stream parsing.Wireless MMX supports 64-bit-wide bit-wise manipulation – for instance,shift, and, or – which can be effectively used for different bit-wisealgorithms.

The methods described in this series of articles are intended forassembly language development but can also be applied duringdevelopment using intrinsic functions and in-line assembly.

High-level language programming styles based on these techniqueshave also been presented. These programming styles demonstrate how bestto use different instructions and, more specifically, how the sequenceof instructions should be scheduled to reduce stalls. However, the listof methods described here is not exhaustive.

Finally, a few points to remember are:

1) Use the correct precisionfor the algorithm, and choose instructions accordingly.
2) Interleave instructionsbetween the pipe to hide result and issue latency.
3) Schedule load and storeswith the correct data-addressing mode.
4) Watch out for load-to-usepenalty and shifter-processing latency.
5) Count down on loops toreduce loop control overhead.
6) Use conditionalinstructions to avoid branch costs.

To read Part 1, go to “Microarchitectural optimization philosophy.”
To read Part 2, go to “Optimization for data processing-orientedoperations.”

Thisseries of articles was excerpted from “Programming withIntel Wireless MMX Technology,” by Nigel Paver, Bradley Aldrich andMoinul Khan. Copyright © 2004 Intel Corporation. All rightsreserved.

Nigel Paver is an architect anddesign manager for Wireless MMX technology at IntelCorporation . Bradley Aldrich is a leading authority at IntelCorporation on image and video processing. Moinul Khan is a multimediaarchitect at Intel Corporation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.