CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Using the new Intel Streaming SIMD Extensions (SSE4) for audio, video and image apps



Embedded.com
Performance gains
To measure the performance of streaming loads from local system memory, we used four implementations of the Bulk Load and Operate programming model to load 4KB of data from USWC memory and copy the data to a cacheable WB buffer:

* One implementation without streaming loads using the MOVDQA instruction

* Another implementation using streaming loads with the new MOVNTDQA instruction

* Two additional implementations consisting of dual-threaded versions that split the USWC segment into two parts and performed the load and copy in two threads, each bound to a separate core.

For each implementation, the 4KB load and copy loop executed approximately 10,000 iterations, and we calculated the average memory throughput by measuring the total time required.

Figure 4 earlier shows the memory throughput improvements. In the single-threaded implementation, using streaming loads increased memory throughput by more than 5x. In the dual-threaded implementation, streaming loads increased memory throughput more than 7.5x.

Achieving the benefits
Intel SSE4 can improve the performance and energy efficiency of a broad range of applications, including media editors, encoders, 3D applications and games. Of course, performance gains will vary by workload and application.

So what do you need to do to incorporate the benefits of Intel SSE4 instructions into your applications?

Intel provides a variety of tools and performance libraries optimized for Intel SSE4. One such tool is the Intel C++ Compiler 10.0, which supports automatic vectorization.

The vectorization process parallelizes code by analyzing loops to determine when it may be appropriate to execute several iterations of the loop in parallel by utilizing MMX, SSE, Intel SSE2, SSE3 and Intel SSE4 instructions. Vectorization may be useful as a way to optimize application code and take advantage of new extensions when running applications on Intel processors.

While some applications might achieve performance gains by simply recompiling with this compiler, you will obtain maximum gains by manually optimizing your applications. Some of the highest-value Intel SSE4 instructions will require careful integration using intrinsic or assembly development, and may require algorithm changes. For example, achieving the benefits of streaming load will require manual integration, but the payback will be significantly improved performance.

To get started, visit the Intel Software Network.  This site includes white papers and a downloadable Software Developers Kit (SDK) for Penryn and Intel SSE4. For more information on Intel SSE4 instruction set innovation, click here. And to learn more about architecture and silicon technology, visit this location on Intel's web site.

Jeremy Saldate is Senior Technical Marketing Engineer, Software and Solutions Group, Intel Corp.

References
[1], [2], [3] Motion Estimation with Intel Streaming SIMD Extensions 4 (Intel SSE4), White Paper. Intel Software Solutions Group, 2007

[4], [5] Increasing Memory Throughput With Intel Streaming SIMD Extensions 4 (Intel SSE4) Streaming Load, White Paper. Intel Software Solutions Group, 2007

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :