CMP EMBEDDED.COM

Login | Register     Welcome Guest IPS  Call for Abstracts
 

Code techniques for processor pipeline optimization: Part 1
Microarchitectural optimization philosophy



Embedded.com
Optimization of a code segment can contribute greatly to its performance. An optimized application makes best use of all available microarchitectural features. In a pipelined processor, the key to optimization is to keep all the stages of the pipelines and functional units occupied with meaningful tasks.

This series of articles will explain the implications of the pipelined structure in terms of performance and then offer best-known ways to utilize all the available resources in data processing (Part 2) and control oriented operations (Part 3).

We will focus on low-level instruction sequences to demonstrate different optimization techniques. While most of the techniques described are generic to all processors, some are specific to the Intel XScale and Wireless MMX microarchitectures.

Picking the right optimization philosophy
The optimization process is a hierarchical one that needs to be addressed at different levels during the application development cycle. During algorithm selection, high-level code development, and kernel development, an optimization effort is necessary. The focus here is simply on the optimizations useful for developing assembly routines or kernel-level coding.

Two sets of instructions are used. Intel XScale microarchitecture implements ARM V5TE-compliant instructions, and Intel Wireless MMX technology implements an instruction set specifically designed to accelerate multimedia applications.

The implementation uses two concurrent pipelines to handle the respective instruction set architecture (ISA). You have two aspects of the optimization to consider with regard to the pipeline and ISA: 1) choosing the right instruction, and 2) choosing the right sequence of instructions.

Choosing the Right Instruction. The combined instruction sets of the Intel XScale microarchitecture and Intel Wireless MMX technology have a large variety of instructions from which the programmer can choose those most appropriate for a given application.

For example, 32-bit addition can be performed on the register file of Intel XScale core using an add instruction. An addition of this kind can also be performed on the coprocessor register file using a WADDW instruction.

Choosing the correct instruction for the operation and partitioning the desired kernel between both the register files are the first challenges in optimizing for the pipeline.

Intel Wireless MMX instructions are relatively orthogonal; that is, each instruction supports the same operations on different data types - byte, half-word, and word. Based on the algorithmic need of the kernel, selecting the correct data types offers you the most efficient use of the resources.

For example, if the required accuracy for an algorithm is only 16-bit data, you can use a 16-bit data type and process four data samples concurrently using Intel Wireless MMX Technology. However, when using 32-bit data, you meet the accuracy requirement, but you can enjoy only two-way concurrency.

Choosing the Right Sequence. Intel XScale microarchitecture with Intel Wireless MMX technology has two processing pipelines. Resource and data-dependency hazards introduce inefficiencies into the system, since, for the duration of the stalls, the concurrency between the pipe stages is not utilized, thus hurting performance and power.

You can employ different techniques for instruction scheduling to reduce such stalls. Instruction scheduling refers to the rearrangement of a sequence of instructions for the purpose of helping to minimize pipeline stalls.

When you are writing code in a high-level language, you cannot specifically control the selection of the right instruction or the selection of the right sequence of instructions. The compiler tool chain can handle some of these concerns.

However, for performance-critical applications, critical and heavily used routines might need to be written or optimized by hand in assembly language or using intrinsic functions. The majority of this optimization effort is spent in stall reduction.

Stall-Directed Instruction Scheduling
The initial step in basic pipeline optimization is to understand the pipeline and delay characteristics of each instruction. Two parameters characterize an instruction:

1) Resource, or issue, latency. When a functional unit is processing an instruction, it can be busy for one or more cycles. Resource or issue latency for an instruction is the number of cycles that the functional unit will be busy before the next instruction using the same functional unit can be scheduled.

2) Result latency. Result latency for an instruction is the number of cycles that the functional unit takes to produce the result. If two instructions are back to back and the later instruction uses the data produced by the earlier instruction, then the later instruction stalls for the period of result latency.

The only way to solve the stall and hazard problem is to avoid it. Stall avoidance is achieved by reordering instructions while maintaining the same functionality.

1 | 2

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Ready to take that job and shove it?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS




 :