CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Using the new Intel Streaming SIMD Extensions (SSE4) for audio, video and image apps



Embedded.com
Fast SAD calculations
MPSADBW is very useful for fast SAD calculations. MPSAD calculates and sums 32 absolute differences per instruction, which is double the yield of PSAD. You can use the MPSADBW instruction to compute sums of absolute difference, not only for 4-byte wide blocks, but also for 8-byte wide and 16-byte wide blocks.

The built-in source shifts allow the developer to avoid many of the unaligned loads or ALIGN instructions which would otherwise be required, contributing to the overall efficiency and performance of the application.

Table 1 below shows the speed-ups from these optimizations. The results are expressed as cycles per block SAD that we computed. The speed-up column contains ratios that we computed using the Intel SSE2 results as the baseline. We tested three different block sizes, 4x4, 8x8 and 16x16.

Table 1: Speed-Up Results for Motion Vector Search measured in number of cycles per block SAD " Intel SSE4 can provide 1.6 to 3.8x faster performance than Intel SSE2[3] (The Intel Compiler 10.0.018 beta was used to build the code. The 'O2' and 'QxS' compiler flags were used. 'QxS' is a new flag for the compiler to generate optimized code specifically for Penryn).

In addition to the Intel SSE4 instructions, multi-threaded implementations can help you achieve additional performance gains in compute-intensive applications such as video codecs.

Memory Mapped I/O devices
UC (uncacheable) memory refers to the fact that this type of memory does not store data in any of the processor's caches. Uncacheable Speculative Write Combining (USWC) memory is an extension to UC memory that contains uncacheable data that is typically subject to sequential write operations, such as writing to frame buffer memory.

As its name indicates, write-combining allows all writes to a cache line to be combined before they are written out to memory. USWC memory is often used in Memory Mapped I/O (MMIO) devices.

Intel SSE2 introduced the MOVNTDQ instruction for streaming writes. It improves write throughput by streaming uncacheable writes to devices that are memory-mapped to USWC memory. The streaming write instruction tells the processor that the target data is to be written directly to external memory and not cached into any level of the processor's cache hierarchy.

Streaming writes allow write-combining to the same cache line (typically 64-bytes) going out to memory, as opposed to writing to memory in 16-byte chunks to substantially improve bus utilization and write throughput.

The new MOVNTDQA instruction
Although USWC memory is primarily employed for memory that is subject to write operations (such as frame buffer memory), load operations may also occur.

For example, today's sophisticated graphics devices can perform rendering operations while the CPU performs other tasks. The CPU can then load in the data for further processing and display. One problem with this is that SSE load instructions operate on a maximum of 16-byte chunks and have limited throughput when accessing USWC memory. The operation requires two separate front-side bus transactions and consumes four bus clocks.

To relieve this bottleneck, Intel SSE4 introduces the new MOVNTDQA instruction for streaming loads, improving read throughput by streaming uncacheable reads from USWC memory. MOVNTDQA allows the fetching of a 16-byte chunk within an aligned cache line of USWC memory. Similar to streaming writes, the streaming load instruction allows data to be moved in full 64-byte cache line quantities as opposed to a 16-byte chunk, substantially improving bus throughput.

Loading a full cache line
Fetching a complete 64-byte cache line and temporarily holding the contents in a streaming load buffer enables the supply of subsequent loads without using the front side bus (FSB), which makes these loads much faster. Figure 3 provides a code example.

Figure 3: Code example of loading a full cache line using the MOVNTDQA instruction[4]

For streaming across multiple line addresses, loads of all four chunks of a given line should be batched together. Note that the loading of each chunk using MOVNTDQA instructions does not have to be in the order shown in the example.

It is also important to note that a streaming load of a given chunk will cause a new streaming load buffer allocation if one does not currently exist. Because there are a finite number of streaming load buffers within any given micro-architectural implementation, grouping the chunks together will generally improve overall utilization.

Streaming Load programming models
There are two common programming models for using streaming loads: bulk load and operate and incremental load and operate.

Bulk Load and Operate. Here the application loads the data using streaming loads and copies it as a bulk load to a temporary cacheable (WB) buffer. After all the data has been loaded, the CPU operates on the temporary buffer and then sends the data back to memory.

Incremental Load and Operate. In this model the application loads a single cache line using streaming load, operates on the data, and then writes it back to memory. This model operates on the data as it is loaded, as opposed to loading a large amount of data first, as is the case with the Bulk Load and Operate model.

The Bulk Load and Operate programming model generates more consistent performance gains. In the Incremental Load and Operate programming model, the data operations performed between streaming loads could interfere with streaming load performance due to contention for buffers and other processor resources. The Bulk Load and Operate programming model reduces this possibility by performing data loading and operations in separate batches.

Figure 4: Memory throughput from system memory using Streaming Load - Intel SSE4 can lead to a 5.0 " 7.5x speed increase over traditional loads[5]. Specifically, Memory Throughput = (Processor Frequency * Number of Iterations * Data Copied per Iteration) / Total Execution Time (in clock cycles). For dual-threaded implementations, the memory throughput was calculated for each thread and then added together. The test system consisted of a Next generation Intel Core 2 desktop processor (Wolfdale), Intel D975XBX2KR motherboard, 2 GB DDR2 RAM PC2-8000 (667 MHz), Windows* XP Professional with Service Pack 2.


1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :