Designing for performance with software optimization
Data MovementData movement refers to how well and how much data moves into and out of the processor and how many times it does so. Data movement can make or break performance especially when considering multiple processors and large datasets.
Good data movement means that the data flows into the processor's caches, is operated on, gets evicted from the cache, and never needs to be reloaded all while making sure that the amount of memory needed by all the threads is within the memory bandwidth available on the target platform.
When designing an application, consider how and when every piece of data is used. Look for predictable patterns of data movement that allow for good cache efficiency while avoiding randomly accessed memory and the cache misses that come along with them. Work towards designs that use memory sparingly and find ways to avoid any unnecessary memory movements.
Dynamically allocated linked data structures like lists, queues, and trees are a common source of problems with data movement and cache efficiency because memory tends to be spread out instead of in neat cacheable chunks. When traversing these data structures, the processor is forced to load discontinuous memory that, in turn, generates cache misses.
These structures also present a problem for cache efficiency because it is rare that a full cache line's worth of data is needed. Switching these data structures to continuous arrays or at least keeping an index in an array will help to keep the demands of the memory subsystem low.
Memory and Parallelism
Throwing threads into the design equation creates a unique set of design problems. Beyond the relatively simple choices of functional or data decomposition lies what could be the most important memory issue and that is just how much memory is consumed in a unit of time.
All computers have a fixed amount of available memory bandwidth. If you are thinking that you can't do much about that, you would be correct. However, you do have control over how your application uses that bandwidth and that is what is really important.
Consider a compression and archive program like gzip. It loads a bunch of files, compresses each of them individually, and then sticks them all together. Stop right now and think for a moment how you would add parallelism to gzip.
Back so soon? One straightforward method is to divide up one file and compress it in chunks using multiple threads. But another, and equally valid, data decomposition method is to compress multiple files at the same time by compressing each file in its entirety on a separate thread. Which method is better?
The way to think about this problem is to consider the extreme case of when the number of processor cores equals the number of files to compress. So, if you are compressing 200 files the computer would have 200 processor cores.
If your design choice was to compress each file separately with each file having its own single thread, all 200 files would need to be opened, read from the disk, loaded into the processors' caches, and stored at the same time! This would be a mess.
The hard disk would be seeking forever, the amount of RAM needed would be equal to the sum of all the file sizes and the processors caches would be getting thrashed. However, reading one file at a time, dividing it into cache-sized chunks, compressing those chunks with multiple threads, and storing only one file at a time keeps memory demands much lower not to mention disk demands.
It might be a little more work to implement because the one thread per file version is already written, but in the long run, multiple threads per one file will definitely be a stronger foundation.
An even stronger foundation would include using overlapped/nonblocking I/O so that you can be loading the next file, while compressing with multiple threads the current file, and storing the previous file all at the same time.


Loading comments... Write a comment