A wide range of new applications are entering the mainstream ofdesktop, server and portable/mobile computing applications – includingdata mining, database management, complex search and pattern matchingalgorithms, as well as compression algorithms for audio, video, imagesand data.
To more efficiently perform such complex operations, the IntelStreaming SIMD Extensions 2 (Intel SSE2) set of instructions have beenmodified, implementing a much more efficient means of completing taskssuch as motion vector search in video processing, packed dwordmultiplies by compilers, and improving throughput between the CPU andgraphics processor to benefit a wide scope of applications. Thisimproved instruction set is being released as Intel SSE4.
This article provides a quick overview of the new extensions to the instruction setarchitecture (ISA) and includes examples of how you can use the newinstructions to optimize video encoding functions. We will also look atthe new streaming load instruction, which can significantly improve theperformance of applications that share data between the graphicsprocessor and the CPU.
The 45 nanometer next generation Intel Core 2 processor family (Penryn) includessupport for the Intel SSE4 instruction set as an extension to the Intel64 Instruction Set Architecture (ISA). These new instructions deliverperformance gains for SIMD (single instruction, multiple data)software and will enable the new family of microprocessors to deliverimproved performance and energy efficiency with a broad range of 32-bitand 64-bit software.
54 new instructions
Intel SSE4 is a set of 54 new instructions designed to improveperformance and energy efficiency of media, 3D and gaming applications.The Penryn microarchitecture supports 47 of these new instructions,with the remainder being introduced on future processors. We can groupthem into three areas:
1) Video acceleratorsinclude instructions to accelerate 4×4 sum absolute difference (SAD)calculations, sub-pixel filtering and horizontal minimum search. TheIntel SSE4 video accelerator instructions include a sum absolutedifference engine that can perform eight SAD calculations at once.
New instructions also include an instruction for horizontal minimumsearch that can look at eight values and identify the minimum value andan index of that minimum value, as well as instructions for convertingpacked integers into wider data types, allowing for fasterinteger-to-float conversions in 3D applications.
2) Graphics building blocksinclude common graphics primitives generalized for compilerauto-vectorizations, such as packed dword multiply and floating pointdot products.
3) The Streaming Loadinstruction enables faster reads from Uncacheable Speculative WriteCombining (USWC) memory. When used in conjunction with Intel SSE2streaming writes, Streaming Load allows for faster Memory Mapped I/O(MMIO).
Now we will take a closer look at how you can use Intel SSE4 toimprove performance and energy efficiency using new instructions forMotion Vector Search and Streaming Loads.
Optimizing motion vector search
Motion estimation is one of the main bottlenecks in video encoders. Itinvolves searching reference frames for best matches, and it canconsume as much as 40 percent of the total CPU cycles used by theencoder. Search quality is an important determinant for the ultimatevideo quality of encoded video.
For this reason, algorithmic and SIMD optimizations designed toimprove encoding speed often target the search operation. SIMDinstructions provide an ideal means of optimizing motion estimationbecause the required arithmetic operations are performed on blocks ofpixels with a high degree of parallelism.
The Intel SSE2 instruction PSADBW is widely used to optimize thisoperation. PSADBW computes the sum of absolute differences from a pairof 16 unsigned byte integers. One sum is the result from the eightlower differences, while the other sum is the result from the eightupper differences.
The PSADBW instruction finds the matching blocks for four 4×4 blocksin each call. Since the width of a 4×4 block is only 4 bytes, in orderto use this instruction, two rows are first unpacked to concatenate two4-byte data sets into 8 bytes.
Since each load gets 16 consecutive bytes, data is loaded from fourconsecutive blocks in one load. Therefore, it makes sense to write thisfunction to find the matching block for four blocks in each call.
The new MPSADBW instruction
The new MPSADBW instruction improves performance by computing eightsums of difference in a single instruction. Each sum is computed fromthe absolute difference of a pair of four unsigned byte integers.
Figure 1 below shows how the eight sums are computed using MPSADBW.
MPSADBW takes an immediate as a third operand:
* Bits 0 and 1 of the immediate value are used to select one of thefour groups of 4 bytes from the source operand.
* Bit 2 of the immediate value is used to select one of the twogroups of 11 bytes from the destination operand.
|Figure1: Computing eight sums using MPSADBW|
In Figure 1 above , the boxwith the darkened solid outline is the selected block. The box with thedarkened broken outline represents other blocks that could be selectedby setting the corresponding bits in the immediate. While the idealblock size for this instruction is 4×4, other block sizes, such as 8×4,or 8×8, can also benefit from MPSADBW.
As shown in Figure 1, Bits 0 and 1 of the immediate value are usedto select a different 4-pixel group for the computation. To computesums of absolute difference for block sizes that are multiples of 4×4,we repeat the MPSADBW operation using a different immediate value eachtime, and then add the results from the multiple MPSADBW operationsusing PADDUSW to yield the final results.
Example: horizontal minimum search
After the sums are computed, you use the PHMINPOSUW instruction tolocate the minimum from the computed SADs, as shown in the code examplein Figure 2 below.
|Figure2: Code example for the Optimized Integer Block Matching function “finding the matching block for four 4×4 blocks in each call.|
Fast SAD calculations
MPSADBW is very useful for fast SAD calculations. MPSAD calculates andsums 32 absolute differences per instruction, which is double the yieldof PSAD. You can use the MPSADBW instruction to compute sums ofabsolute difference, not only for 4-byte wide blocks, but also for8-byte wide and 16-byte wide blocks.
The built-in source shifts allow the developer to avoid many of theunaligned loads or ALIGN instructions which would otherwise berequired, contributing to the overall efficiency and performance of theapplication.
Table 1 below shows thespeed-ups from these optimizations. The results are expressed as cyclesper block SAD that we computed. The speed-up column contains ratiosthat we computed using the Intel SSE2 results as the baseline. Wetested three different block sizes, 4×4, 8×8 and 16×16.
|Table1: Speed-Up Results for Motion Vector Search measured in number ofcycles per block SAD ” Intel SSE4 can provide 1.6 to 3.8x fasterperformance than Intel SSE2 (The Intel Compiler 10.0.018 beta wasused to build the code. The 'O2' and 'QxS' compiler flags were used.'QxS' is a new flag for the compiler to generate optimized codespecifically for Penryn).|
In addition to the Intel SSE4 instructions, multi-threadedimplementations can help you achieve additional performance gains incompute-intensive applications such as video codecs.
Memory Mapped I/O devices
UC (uncacheable) memory refers to the fact that this type of memorydoes not store data in any of the processor's caches. UncacheableSpeculative Write Combining (USWC) memory is an extension to UC memorythat contains uncacheable data that is typically subject to sequentialwrite operations, such as writing to frame buffer memory.
As its name indicates, write-combining allows all writes to a cacheline to be combined before they are written out to memory. USWC memoryis often used in Memory Mapped I/O (MMIO) devices.
Intel SSE2 introduced the MOVNTDQ instruction for streaming writes.It improves write throughput by streaming uncacheable writes to devicesthat are memory-mapped to USWC memory. The streaming write instructiontells the processor that the target data is to be written directly toexternal memory and not cached into any level of the processor's cachehierarchy.
Streaming writes allow write-combining to the same cache line(typically 64-bytes) going out to memory, as opposed to writing tomemory in 16-byte chunks to substantially improve bus utilization andwrite throughput.
The new MOVNTDQA instruction
Although USWC memory is primarily employed for memory that is subjectto write operations (such as frame buffer memory), load operations mayalso occur.
For example, today's sophisticated graphics devices can performrendering operations while the CPU performs other tasks. The CPU canthen load in the data for further processing and display. One problemwith this is that SSE load instructions operate on a maximum of 16-bytechunks and have limited throughput when accessing USWC memory. Theoperation requires two separate front-side bus transactions andconsumes four bus clocks.
To relieve this bottleneck, Intel SSE4 introduces the new MOVNTDQAinstruction for streaming loads, improving read throughput by streaminguncacheable reads from USWC memory. MOVNTDQA allows the fetching of a16-byte chunk within an aligned cache line of USWC memory. Similar tostreaming writes, the streaming load instruction allows data to bemoved in full 64-byte cache line quantities as opposed to a 16-bytechunk, substantially improving bus throughput.
Loading a full cache line
Fetching a complete 64-byte cache line and temporarily holding thecontents in a streaming load buffer enables the supply of subsequentloads without using the front side bus (FSB), which makes these loadsmuch faster. Figure 3 provides a code example.
|Figure3: Code example of loading a full cache line using the MOVNTDQAinstruction|
For streaming across multiple line addresses, loads of all fourchunks of a given line should be batched together. Note that theloading of each chunk using MOVNTDQA instructions does not have to bein the order shown in the example.
It is also important to note that a streaming load of a given chunkwill cause a new streaming load buffer allocation if one does notcurrently exist. Because there are a finite number of streaming loadbuffers within any given micro-architectural implementation, groupingthe chunks together will generally improve overall utilization.
Streaming Load programming models
There are two common programming models for using streaming loads: bulkload and operate and incremental load and operate.
Bulk Load andOperate. Here the application loads the data using streamingloads and copies it as a bulk load to a temporary cacheable (WB)buffer. After all the data has been loaded, the CPU operates on thetemporary buffer and then sends the data back to memory.
IncrementalLoad and Operate. In this model the application loads a singlecache line using streaming load, operates on the data, and then writesit back to memory. This model operates on the data as it is loaded, asopposed to loading a large amount of data first, as is the case withthe Bulk Load and Operate model.
The Bulk Load and Operate programming model generates moreconsistent performance gains. In the Incremental Load and Operateprogramming model, the data operations performed between streamingloads could interfere with streaming load performance due to contentionfor buffers and other processor resources. The Bulk Load and Operateprogramming model reduces this possibility by performing data loadingand operations in separate batches.
|Figure4: Memory throughput from system memory using Streaming Load – IntelSSE4 can lead to a 5.0 ” 7.5x speed increase over traditional loads.Specifically, Memory Throughput = (Processor Frequency * Number ofIterations * Data Copied per Iteration) / Total Execution Time (inclock cycles). For dual-threaded implementations, the memory throughputwas calculated for each thread and then added together. The test systemconsisted of a Next generation Intel Core 2 desktop processor(Wolfdale), Intel D975XBX2KR motherboard, 2 GB DDR2 RAM PC2-8000 (667MHz), Windows* XP Professional with Service Pack 2.|
To measure the performance of streaming loads from local system memory,we used four implementations of the Bulk Load and Operate programmingmodel to load 4KB of data from USWC memory and copy the data to acacheable WB buffer:
* One implementation without streaming loads using the MOVDQAinstruction
* Another implementation using streaming loads with the new MOVNTDQAinstruction
* Two additional implementations consisting of dual-threadedversions that split the USWC segment into two parts and performed theload and copy in two threads, each bound to a separate core.
For each implementation, the 4KB load and copy loop executedapproximately 10,000 iterations, and we calculated the average memorythroughput by measuring the total time required.
Figure 4 earlier shows thememory throughput improvements. In the single-threaded implementation,using streaming loads increased memory throughput by more than 5x. Inthe dual-threaded implementation, streaming loads increased memorythroughput more than 7.5x.
Achieving the benefits
Intel SSE4 can improve the performance and energy efficiency of a broadrange of applications, including media editors, encoders, 3Dapplications and games. Of course, performance gains will vary byworkload and application.
So what do you need to do to incorporate the benefits of Intel SSE4instructions into your applications?
Intel provides a variety of tools and performance librariesoptimized for Intel SSE4. One such tool is the Intel C++ Compiler 10.0,which supports automatic vectorization.
The vectorization process parallelizes code by analyzing loops todetermine when it may be appropriate to execute several iterations ofthe loop in parallel by utilizing MMX, SSE, Intel SSE2, SSE3 and IntelSSE4 instructions. Vectorization may be useful as a way to optimizeapplication code and take advantage of new extensions when runningapplications on Intel processors.
While some applications might achieve performance gains by simplyrecompiling with this compiler, you will obtain maximum gains bymanually optimizing your applications. Some of the highest-value IntelSSE4 instructions will require careful integration using intrinsic orassembly development, and may require algorithm changes. For example,achieving the benefits of streaming load will require manualintegration, but the payback will be significantly improvedperformance.
To get started, visit the IntelSoftware Network. This site includes white papers and adownloadable Software Developers Kit (SDK) for Penryn and Intel SSE4.For more information on Intel SSE4 instruction set innovation,
Jeremy Saldate is Senior TechnicalMarketing Engineer, Software and Solutions Group,
, ,