Using the new Intel Streaming SIMD Extensions (SSE4) for audio, video and image apps
Fast SAD calculations
MPSADBW is very useful for fast SAD calculations. MPSAD calculates and
sums 32 absolute differences per instruction, which is double the yield
of PSAD. You can use the MPSADBW instruction to compute sums of
absolute difference, not only for 4-byte wide blocks, but also for
8-byte wide and 16-byte wide blocks.
The built-in source shifts allow the developer to avoid many of the
unaligned loads or ALIGN instructions which would otherwise be
required, contributing to the overall efficiency and performance of the
application.
Table 1 below shows the
speed-ups from these optimizations. The results are expressed as cycles
per block SAD that we computed. The speed-up column contains ratios
that we computed using the Intel SSE2 results as the baseline. We
tested three different block sizes, 4x4, 8x8 and 16x16.
 |
| Table
1: Speed-Up Results for Motion Vector Search measured in number of
cycles per block SAD " Intel SSE4 can provide 1.6 to 3.8x faster
performance than Intel SSE2[3] (The Intel Compiler 10.0.018 beta was
used to build the code. The 'O2' and 'QxS' compiler flags were used.
'QxS' is a new flag for the compiler to generate optimized code
specifically for Penryn). |
In addition to the Intel SSE4 instructions, multi-threaded
implementations can help you achieve additional performance gains in
compute-intensive applications such as video codecs.
Memory Mapped I/O devices
UC (uncacheable) memory refers to the fact that this type of memory
does not store data in any of the processor's caches. Uncacheable
Speculative Write Combining (USWC) memory is an extension to UC memory
that contains uncacheable data that is typically subject to sequential
write operations, such as writing to frame buffer memory.
As its name indicates, write-combining allows all writes to a cache
line to be combined before they are written out to memory. USWC memory
is often used in Memory Mapped I/O (MMIO) devices.
Intel SSE2 introduced the MOVNTDQ instruction for streaming writes.
It improves write throughput by streaming uncacheable writes to devices
that are memory-mapped to USWC memory. The streaming write instruction
tells the processor that the target data is to be written directly to
external memory and not cached into any level of the processor's cache
hierarchy.
Streaming writes allow write-combining to the same cache line
(typically 64-bytes) going out to memory, as opposed to writing to
memory in 16-byte chunks to substantially improve bus utilization and
write throughput.
The new MOVNTDQA instruction
Although USWC memory is primarily employed for memory that is subject
to write operations (such as frame buffer memory), load operations may
also occur.
For example, today's sophisticated graphics devices can perform
rendering operations while the CPU performs other tasks. The CPU can
then load in the data for further processing and display. One problem
with this is that SSE load instructions operate on a maximum of 16-byte
chunks and have limited throughput when accessing USWC memory. The
operation requires two separate front-side bus transactions and
consumes four bus clocks.
To relieve this bottleneck, Intel SSE4 introduces the new MOVNTDQA
instruction for streaming loads, improving read throughput by streaming
uncacheable reads from USWC memory. MOVNTDQA allows the fetching of a
16-byte chunk within an aligned cache line of USWC memory. Similar to
streaming writes, the streaming load instruction allows data to be
moved in full 64-byte cache line quantities as opposed to a 16-byte
chunk, substantially improving bus throughput.
Loading a full cache line
Fetching a complete 64-byte cache line and temporarily holding the
contents in a streaming load buffer enables the supply of subsequent
loads without using the front side bus (FSB), which makes these loads
much faster. Figure 3 provides a code example.
 |
| Figure
3: Code example of loading a full cache line using the MOVNTDQA
instruction[4] |
For streaming across multiple line addresses, loads of all four
chunks of a given line should be batched together. Note that the
loading of each chunk using MOVNTDQA instructions does not have to be
in the order shown in the example.
It is also important to note that a streaming load of a given chunk
will cause a new streaming load buffer allocation if one does not
currently exist. Because there are a finite number of streaming load
buffers within any given micro-architectural implementation, grouping
the chunks together will generally improve overall utilization.
Streaming Load programming models
There are two common programming models for using streaming loads: bulk
load and operate and incremental load and operate.
Bulk Load and
Operate. Here the application loads the data using streaming
loads and copies it as a bulk load to a temporary cacheable (WB)
buffer. After all the data has been loaded, the CPU operates on the
temporary buffer and then sends the data back to memory.
Incremental
Load and Operate. In this model the application loads a single
cache line using streaming load, operates on the data, and then writes
it back to memory. This model operates on the data as it is loaded, as
opposed to loading a large amount of data first, as is the case with
the Bulk Load and Operate model.
The Bulk Load and Operate programming model generates more
consistent performance gains. In the Incremental Load and Operate
programming model, the data operations performed between streaming
loads could interfere with streaming load performance due to contention
for buffers and other processor resources. The Bulk Load and Operate
programming model reduces this possibility by performing data loading
and operations in separate batches.
 |
| Figure
4: Memory throughput from system memory using Streaming Load - Intel
SSE4 can lead to a 5.0 " 7.5x speed increase over traditional loads[5].
Specifically, Memory Throughput = (Processor Frequency * Number of
Iterations * Data Copied per Iteration) / Total Execution Time (in
clock cycles). For dual-threaded implementations, the memory throughput
was calculated for each thread and then added together. The test system
consisted of a Next generation Intel Core 2 desktop processor
(Wolfdale), Intel D975XBX2KR motherboard, 2 GB DDR2 RAM PC2-8000 (667
MHz), Windows* XP Professional with Service Pack 2. |