Designing for performance with software optimization -

Designing for performance with software optimization

Optimization of applications should not be a final step in project development, but rather an ongoing design philosophy that begins from the planning stages all the way through completion.

By thinking about the optimal algorithms and data structure designs early on, greater performance benefits can be realized with less work by the time the product is mature.

But, getting into this mindset requires being aware of everything from computational complexity of algorithms, to the data structure design and implementation, and the proper use of parallelization.

Designing for Performance
Software optimization can occur at any time during application development, but the earlier you start, the better. Typically, the later you start optimizing in a development cycle the more the optimizations tend to be narrower in scope and focused more on algorithm implementations and choice of instructions, like adding a few SIMD instructions here and there, instead of broad high-level optimizations.

High-level optimizations include the basic software architecture, key data structures and buffers, algorithms, memory access patterns, and of course parallelism. It is in these high-level optimizations that huge performance improvements can be made for seemingly little effort.

Unless you are writing short applications, it is going to take a ton of more effort to make significant performance improvements at the local level as compared to the high-level or foundation-level.

That old clich of laying a solid foundation is alive and well in software optimization. Before any code is written, think about performance and what foundation-laying things can be done to guarantee a high-performance application.

Designing the performance foundation focuses on the selection and evaluation of the critical algorithms and how those algorithms store data and move it throughout the application.

The goal is to find an algorithm and general architecture that can be easily threaded, scales well to multiple processors, is easy on memory bandwidth, uses the processors' caches efficiently, stores data in formats that are SIMD friendly, isn't limited by some of the slower instructions such as the transcendental functions (e.g. sine, cosine, logarithms, and so forth), and does not bottleneck on an off-motherboard system component like the network or hard disk.

That might sound like a real tough maybe even an impossible job, but remember any extra time spent improving the foundation means less time anguishing over how to make the application faster later when time is fleeting and you only really have time left to optimize a function or two at a local-level.

Good algorithms and data layouts are the foundations of a fast application, and apt use of them can open up many more performance opportunities later on and may even avoid the need to use anything more than the compiler.

Data Movement
Data movement refers to how well and how much data moves into and out of the processor and how many times it does so. Data movement can make or break performance especially when considering multiple processors and large datasets.

Good data movement means that the data flows into the processor's caches, is operated on, gets evicted from the cache, and never needs to be reloaded all while making sure that the amount of memory needed by all the threads is within the memory bandwidth available on the target platform.

When designing an application, consider how and when every piece of data is used. Look for predictable patterns of data movement that allow for good cache efficiency while avoiding randomly accessed memory and the cache misses that come along with them. Work towards designs that use memory sparingly and find ways to avoid any unnecessary memory movements.

Dynamically allocated linked data structures like lists, queues, and trees are a common source of problems with data movement and cache efficiency because memory tends to be spread out instead of in neat cacheable chunks. When traversing these data structures, the processor is forced to load discontinuous memory that, in turn, generates cache misses.

These structures also present a problem for cache efficiency because it is rare that a full cache line's worth of data is needed. Switching these data structures to continuous arrays or at least keeping an index in an array will help to keep the demands of the memory subsystem low.

Memory and Parallelism
Throwing threads into the design equation creates a unique set of design problems. Beyond the relatively simple choices of functional or data decomposition lies what could be the most important memory issue and that is just how much memory is consumed in a unit of time.

All computers have a fixed amount of available memory bandwidth. If you are thinking that you can't do much about that, you would be correct. However, you do have control over how your application uses that bandwidth and that is what is really important.

Consider a compression and archive program like gzip. It loads a bunch of files, compresses each of them individually, and then sticks them all together. Stop right now and think for a moment how you would add parallelism to gzip .

Back so soon? One straightforward method is to divide up one file and compress it in chunks using multiple threads. But another, and equally valid, data decomposition method is to compress multiple files at the same time by compressing each file in its entirety on a separate thread. Which method is better?

The way to think about this problem is to consider the extreme case of when the number of processor cores equals the number of files to compress. So, if you are compressing 200 files the computer would have 200 processor cores.

If your design choice was to compress each file separately with each file having its own single thread, all 200 files would need to be opened, read from the disk, loaded into the processors' caches, and stored at the same time! This would be a mess.

The hard disk would be seeking forever, the amount of RAM needed would be equal to the sum of all the file sizes and the processors caches would be getting thrashed. However, reading one file at a time, dividing it into cache-sized chunks, compressing those chunks with multiple threads, and storing only one file at a time keeps memory demands much lower not to mention disk demands.

It might be a little more work to implement because the one thread per file version is already written, but in the long run, multiple threads per one file will definitely be a stronger foundation.

An even stronger foundation would include using overlapped/nonblocking I/O so that you can be loading the next file, while compressing with multiple threads the current file, and storing the previous file all at the same time.

Performance Experiments for Design
You can and should use performance experiments and prototypes to determine whether the data movement and choice of threading is good or bad in your application. A performance experiment that accesses memory in the same manner as the final application or algorithm, but does not perform any calculations, shows you the maximum speed at which that portion of your application can run.

Simple load, copy, and store memory operations should be sufficient to access memory in a meaningful and representative way. A note of caution here though, the compiler might detect that loads are doing nothing and remove that code completely. Make sure to examine the compiler's assembly code to avoid this situation and replace the do-nothing load loop with assembly if the compiler's code optimization engine skips it.

Using these experiments, it should be quick and easy to analyze and modify the algorithms to improve memory access patterns or reduce the amount of memory needed to increase performance.

Once the optimal pattern is discovered, the calculations can be put in and performance can be monitored to make sure that memory does not become a bottleneck. As you are adding in the calculations, make sure to use this opportunity to monitor how far from the speed-of-light time (do-nothing time) the code is getting.

Should it get to a point where you feel performance is heading in a bad direction, you can try other designs to make a compromise between calculation efficiency and memory usage.

When designing data structures keep the following things in mind:

* Cache efficiency. Memory that gets transferred across the bus because it is part of a loaded cache line, but never is used, is a huge waste. Find ways to organize data to increase the efficiency of the processor's caches, and as a result, transfer less memory.

Refine algorithms and their use of data structures to minimize cache conflicts, cache capacity issues, false sharing, and wasted memory loads. Think about what the best possible data layout would be and then try to work towards it tweaking the algorithm when necessary.

Remember, when working with small performance experiments, it will be easy to try a bunch of different data structures to find the one that works best.

* Instruction Parallelism. The right data organization, alignment, and padding can make it easy to add instruction parallelism. Make sure to accommodate the requirements of the SIMD instructions and multiple threads by organizing the data and adjusting algorithms when you still can do so easily.

Even if you do not plan on using the SIMD instructions or multiple threads, at least consider what the friendliest SIMD format is and move in that direction. Just because you don't specifically use the SIMD instructions does not mean the compiler cannot use them automatically for you. Furthermore, do not overlook the goal of keeping the data dependencies low.

During the design phase, you probably can guess fairly easily which algorithms will be the most costly in terms of calculation time and memory demands. These key algorithms are going to likely end up using the majority of the computing resources, and therefore they should also consume the majority of the design time optimizations.

Computational complexity is the critical starting point. But, right along with selecting an efficient algorithm comes the task of making sure that it fits well within the rest of the application and within the processor's performance features such as the cache and the SIMD instructions.

The fastest algorithm is not always the one with the lowest computational complexity or the one that is easiest to thread. When designing an algorithm or process, consider the following issues:

* Always consider the most efficient algorithm. An algorithm's computational complexity makes a huge performance difference and should always be considered a top priority. But algorithms also have memory requirements, instruction requirements, scalability issues, and data dependency issues. Consider the big picture when choosing an algorithm. Look for ways to tweak the current algorithm to get the best possible performance from it before trying a different one.

* Don't be limited by memory access patterns and cache efficiency. The algorithm mostly dictates memory access patterns and cache efficiency. Examine the memory used by a particular task or algorithm to determine the cache usage. Look for ways to minimize cache misses by changing data structures, reducing the amount of memory used, blocking, or any of a number of other memory optimizations.

If possible, determine whether an algorithm will be called with a warm (data already in the cache) or cold (data needs to be loaded) cache, and whether the data can be left in the cache to be used by a future function, if needed.

* Allow room to use the SIMD Instructions. Analyze the algorithms to determine whether the SIMD instructions can be used. The fastest way to get an extra performance boost is to do more operations at the same time using instruction-level parallelism. If the data is in a SIMD friendly format, you will need to do much less overhead-work in the inner-loops.

Most loops that use SIMD have three stages: source data conversion to a SIMD friendly format, the SIMD calculations, and finally conversion to the destination format.

Keep the conversion waste as small and quick as possible to keep the overall loop efficiency high (useful instructions compared to conversion instructions). Also, a good SIMD friendly layout will increase your chances of the compiler automatically generating the SIMD instructions for you.

* Keep data dependencies low. Data dependencies limit the maximum speed of any instruction sequence because the processor cannot start an instruction before its data is ready. You can get a good approximation of the number of clocks an algorithm will take just by counting the longest chain of data dependent instructions.

Some algorithms have few dependencies while other algorithms are full of them. Choosing an algorithm with few dependencies permits the processor to reorder more instructions in order to maximize the use of the internal execution units keeping the performance potential high.

* Plan to use multiprocessing. The multiprocessor industry is now mature. Inexpensive parallel computing with multiple cores and Hyper-Threading Technology are ubiquitous. A good algorithm should be able to be multithreaded to keep up with hardware improvements and customer demands.

Consider at design time how to decompose the algorithms for parallel execution. Even if the first version of the application is single threaded, it will be easy to improve performance later on with a good foundation.

* Find the bottlenecks and the constrained resources early. Use the prototypes and performance experiments to identify the location of bottlenecks as soon as possible; it will help to steer you towards making good algorithm choices. Off motherboard bottlenecks like network access, USB transfer rates, and disk access usually dominate most programs.

When bottlenecks are known early, you can plan more easily for the use of threads and non-waited I/O that keeps the processor doing useful things during the wait for these slower resources. Engineering time can be spent designing and improving the actual bottleneck instead of trying to squeeze out more disk performance.

* Avoid semaphores. Select an algorithm with the least amount of synchronization (critical sections, mutexes, and so on). Synchronization is usually a performance killer. When synchronization is high, the processors get stuck waiting and that's just lost performance. Keep in mind that operating system calls, such as memory allocations and GDI, also use synchronization objects which can cause the idle time to creep up.

Final key points
Keep the following design-time issues in mind:

1) Design-time optimizations can have a big impact on performance. Start early!
2) Design efficient data structures and buffers before application development to allow for the greatest flexibility laying the performance foundation for your application.
3) No substitute exists for a well-chosen highly efficient algorithm that fits well within the data movement of your application and performance features of the processor.
4) Make sure to consider the demands of multiple threads by including the total memory demands, not just any one thread.
5) Design your application to minimize the number of synchronization objects.
6) Test your designs using simple experiments that are quick to write and easy to analyze.

Copyright 2008 Intel Corporation. All rights reserved. This article is based on material The Software Optimization Cookbook, Second Edition by Richard Gerber, Aart J.C. Bik, Kevin B. Smith, and Xinmin Tian and used with the permission of Intel Press.

Richard Gerber has worked on numerous multimedia projects, 3D libraries, and computer games for Intel. As a software engineer, he worked on the Intel VTune Performance Analyzer and led training sessions on optimization techniques. Richard is the original author of The Software Optimization Cookbook and co-author of Programming with Hyper-Threading Technology.

Aart J.C. Bik holds a PhD in computer science and is a Principal Engineer at Intel Corporation, working on the development of high performance Intel C++ and Fortran compilers. Aart received an Intel Achievement Award, the company's highest award, for making the Intel Streaming SIMD Extensions easier to use through automatic vectorization. Aart is the author of The Software Vectorization Handbook.

Kevin B. Smith is a software architect for Intel's C and FORTRAN compilers. Since 1981 he has worked on optimizing compilers for Intel 8086, 80186, i960, Pentium, Pentium Pro, Pentium III, Pentium 4, and Pentium M processors.

Xinmin Tian holds a PhD in computer science and leads an Intel development group working on exploiting thread-level parallelism in high-performance Intel C++ and Fortran compilers for Intel Itanium, IA-32, Intel EM64T, and multi-core architectures.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.