Parallel programming has been around for decades, though before theadvent of multi-core processors, it was an esoteric discipline.Numerous programmers have tripped over the common stumbling blocks bynow. By recognizing these problems you can avoid stumbling.
Furthermore, it is important to understand the common problemsbefore designing a parallel program, because many of the problems arisefrom the overall decomposition of the program, and cannot be easilypatched later. This series of articles looks at some of these commonproblems, their symptoms, and ways to circumvent them.
Too Many Threads
It may seem that if a little threading is good, then a lot must bebetter. In fact, having too many threads can seriously degrade programperform-ance. The impact comes in two ways. First, partitioning a fixedamount of work among too many threads gives each thread too littlework, so that the overhead of starting and terminating threads swampsthe useful work. Second, having too many concurrent software threadsincurs overhead from having to share fixed hardware resources.
When there are more software threads than hardware threads, theoperating system typically resorts to round robin scheduling. Thescheduler gives each software thread a short turn, called a time slice,to run on one of the hardware threads.
When a software thread's time slice runs out, the schedulerpreemptively suspends the thread in order to run another softwarethread on the same hardware thread. The software thread freezes in timeuntil it gets another time slice.
Time slicing ensures that all software threads make some progress.Otherwise, some software threads might hog all the hardware threads andstarve other software threads. However, this equitable distribution ofhardware threads incurs overhead. When there are too many softwarethreads, the overhead can severely degrade performance. There areseveral kinds of overhead, and it helps to know the culprits so you canspot them when they appear.
Thread registerstate. The most obvious overhead is the process of saving andrestoring a thread's register state. Suspending a software threadrequires saving the register values of the hardware thread, so thevalues can be restored later, when the software thread resumes on itsnext time slice. Typically, thread schedulers allocate big enough timeslices so that the save/restore overheads for registers areinsignificant, so this obvious overhead is in fact not much of aconcern.
A more subtle overhead of time slicing is saving and restoring athread's cache state. Modern processors rely heavily on cache memory,which can be about 10 to 100 times faster than main memory. Accessesthat hit in cache are not only much faster; they also consume nobandwidth of the memory bus. Caches are fast, but finite.
When the cache is full, a processor must evict data from the cacheto make room for new data. Typically, the choice for eviction is theleast recently used data, which more often than not is data from anearlier time slice. Thus threads tend to evict each other's data. Thenet effect is that too many threads hurt performance by fighting eachother for cache.
Thrashingvirtual memory. A similar overhead, at a different level, isthrashing virtual memory. Most systems use virtual memory, where theprocessors have an address space bigger than the actual availablememory. Virtual memory resides on disk, and the frequently usedportions are kept in real memory. Similar to caches, the least recentlyused data is evicted from memory when necessary to make room.
Each software thread requires virtual memory for its stack andprivate data structures. As with caches, time slicing causes threads tofight each other for real memory and thus hurts performance. In extremecases, there can be so many threads that the program runs out of evenvirtual memory.
The cache and virtual memory issues described arise from sharinglimited resources among too many software threads. A very different,and often more severe, problem arises called convoying, in whichsoftware threads pile up waiting to acquire a lock.
Consider what happens when a thread's time slice expires while thethread is holding a lock. All threads waiting for the lock must nowwait for the holding thread to wake up and release the lock. Theproblem is even worse if the lock implementation is fair, in which thelock is acquired in first-come first-served order. If a waiting threadis suspended, then all threads waiting behind it are blocked fromacquiring the lock.
The solution that usually works best is to limit the number of”runnable” threads to the number of hardware threads, and possiblylimit it to the number of outer-level caches.
For example, a dual-core Intel Pentium Processor Extreme Edition hastwo physical cores, each with Hyper-Threading Technology, and each withits own cache. This configuration supports four hardware threads andtwo outer-level caches.
Using all four runnable threads will work best unless the threadsneed so much cache that it causes fighting over cache, in which casemaybe only two threads is best. The only way to be sure is toexperiment. Never “hard code” the number of threads; leave it as atuning parameter.
Runnable threads, not blocked threads, cause time-slicing overhead.When a thread is blocked waiting for an external event, such as a mouseclick or disk I/O request, the operating system takes it off theround-robin schedule. Hence a blocked thread does not causetime-slicing overhead. A program may have many more software threadsthan hardware threads, and still run efficiently if most of the OSthreads are blocked.
A helpful organizing principle is to separate compute threads fromI/O threads. Compute threads should be the threads that are runnablemost of the time. Ideally, the compute threads never block on externalevents, and instead feed from task queues that provide work. The numberof compute threads should match the processor resources. The I/Othreads are threads that wait on external events most of the time, andthus do not contribute to having too many threads.
Some Useful Programming Practices
Because building efficient task queues takes some expertise, it isusually best to use existing software to do this. Common usefulpractices are as follows:
1) Let OpenMPdo the work. OpenMP lets the programmer specify loop iterationsinstead of threads. OpenMP deals with managing the threads. As long asthe programmer does not request a particular number of threads, theOpenMP implementation will strive to use the optimal number of softwarethreads.
2) Use a threadpool, which is a construct used to maintain a set of long livedsoftware threads and eliminates the overhead of initialization processof threads for short lived tasks. A thread pool is a collection oftasks which are serviced by the software threads in the pool. Eachsoftware thread finishes a task before taking on another.
For example, Windows has a routine QueueUserWorkItem. Clients addtasks by calling QueueUserWorkItem with a callback and pointer thatdefine the task. Hardware threads feed from this queue. For managedcode, Windows .NET has a class ThreadPool. Java has a class Executorfor similar purposes. Unfortunately, there is no standard thread poolsupport in POSIX threads.
3) Experts maywish to write their own task scheduler. The method of choice iscalled work stealing, where each thread has its own private collectionof tasks. When a thread runs out of tasks, it steals from anotherthread's collection. Work stealing yields good cache usage and loadbalancing. While a thread is working on its own tasks, it tends to bereusing data that is hot in its cache.
When it runs out of tasks and has to steal work, it balances theload. The trick to effective task stealing is to bias the stealingtowards large tasks, so that the thief can stay busy for a while. Theearly Cilk scheduler is a good example of how to write an effectivetask-stealing scheduler.
|Figure7.1 Unsynchronized Threads Racing against each Other Lead toNondeterministic Outcome|
Data Races, Deadlocks, and LiveLocks
Unsynchronized access to shared memory can introduce race conditions,where the program results depend nondeterministically on the relativetimings of two or more threads. Figure7.1 above shows two threads trying to add to a shared variablex, which has an initial value of 0. Depending upon the relative speedsof the threads, the final value of x can be 1, 2, or 3.
Parallel programming would be a lot easier if races were as obviousas in Figure 7.1 above . Butthe same race can be hidden by language syntax in a variety of ways, asshown by the examples in Figure 7.2below.
Update operations such as += are normally just shorthand for “temp =x; x = temp+1”, and hence can result in interleaving. Sometimes theshared location is accessed by different expressions. Sometimes theshared location is hidden by function calls. Even if each thread uses asingle instruction to fetch and update the location, there could beinterleaving, because the hardware might break the instruction intointerleaved reads and writes.
|Figure7.2 Race Conditions Hiding behind Language Syntax|
Intel Thread Checker is a powerful tool for detecting potential raceconditions. It can see past all the varieties of camouflage shown inFigure 7.2 because it deals interms of actual memory locations, nottheir names or addressing expressions.
Sometimes deliberate race conditions are intended and useful. Forexample, threads may be reading a location that is updatedasynchronously with a “latest current value.”
In such a situation, care must be taken that the writes and readsare atomic. Otherwise, garbled data may be written or read. Forexample, reads and writes of structure types are often done a word at atime or a field at a time. Types longer than the natural word size,such as 80-bit floating-point, might not be read or written atomically,depending on the architecture.
Likewise, misaligned loads and stores, when supported, are usuallynot atomic. If such an access straddles a cache line, the processorperforms the access as two separate accesses to the two constituentcache lines.
Data races can arise not only from unsynchronized access to sharedmemory, but also from synchronized access that was synchronized at toolow a level. Figure 7.3 below shows such an example.
The intent is to use a list to represent a set of keys. Each keyshould be in the list at most once. Even if the individual listoperations have safeguards against races, the combination suffers ahigher level race.
If two threads both attempt to insert the same key at the same time,they may simultaneously determine that the key is not in the list, andthen both would insert the key. What is needed is a lock that protectsnot just the list, but that also protects the invariant “no key occurstwice in list.”
|Figure7.3 A Higher-Level Race Condition Example.|
Adding the necessary lock to correct Figure 7.3 above exposes thefrustrating performance problem of locks. Building locks into low-levelcomponents is often a waste of time, because the high-level componentsthat use the components will need higher-level locks anyway. Thelower-level locks then become pointless overhead.
Fortunately, in such a scenario the high-level locking causes thelow-level locks to be uncontended, and most lock implementationsoptimize the uncontended case. Hence the performance impact is somewhatmitigated, but for best performance the superfluous locks should beremoved. Of course there are times when components should provide theirown internal locking. This topic is discussed later in the discussionof thread-safe libraries.
Dealing with Deadlocks
Race conditions are typically cured by adding a lock that protects theinvariant that might otherwise be violated by interleaved operations.Unfortunately, locks have their own hazards, most notably deadlock. Figure 7.4 below shows a deadlockinvolving two threads. Thread 1 has acquired lock A. Thread 2 hasacquired lock B. Each thread is trying to acquire the other lock.Neither thread can proceed.
|Figure7.4 Deadlock Caused by Cycle|
Though deadlock is often associated with locks, it can happen anytime a thread tries to acquire exclusive access to two more sharedresources. For example, the locks in Figure7.4 above could be files instead, where the threads are tryingto acquire exclusive file access. Deadlock can occur only if thefollowing four conditions hold true:
1. Access to each resourceis exclusive.
2. A thread is allowed to holdone resource while requesting another.
3. No thread is willing torelinquish a resource that it has acquired.
4. There is a cycle of threadstrying to acquire resources, where each resource is held by one threadand requested by another.
Deadlock can be avoided by breaking any one of these conditions.
Often the best way to avoid deadlock is to replicate a resource thatrequires exclusive access, so that each thread can have its own privatecopy. Each thread can access its own copy without needing a lock. Thecopies can be merged into a single shared copy of the resource at theend if necessary.
By eliminating locking, replication avoids deadlock and has thefurther benefit of possibly improving scalability, because the lockthat was removed might have been a source of contention.
If replication cannot be done, that is, in such cases where therereally must be only a single copy of the resource, common wisdom is toalways acquire the resources (locks) in the same order. Consistentlyordering acquisition prevents deadlock cycles. For instance, thedeadlock in Figure 7.4 above cannot occur if threads always acquire lock A before they acquire lockB.
The ordering rules that are most convenient depend upon the specificsituation. If the locks all have associated names, even something assimple as alphabetical order works. This order may sound silly, but ithas been successfully used on at least one large project.
For multiple locks in a data structure, the order is often based onthe topology of the structure. In a linked list, for instance, theagreed upon order might be to lock items in the order they appear inthe list.
In a tree structure, the order might be a pre-order traversal of thetree. Somewhat similarly, components often have a nested structure,where bigger components are built from smaller components. Forcomponents nested that way, a common order is to acquire locks in orderfrom the outside to the inside.
If there is no obvious ordering of locks, a solution is to sort thelocks by address. This approach requires that a thread know all locksthat it needs to acquire before it acquires any of them.
For instance, perhaps a thread needs to swap two containers pointedto by pointers x and y, and each container is protected by a lock. Thethread could compare “x < y" to determine which container comesfirst, and acquire the lock on the first container before acquiring alock on the second container, as Figure7.5 below illustrates.
|Figure7.5 Locks Ordered by their Addresses|
In large software projects, different programmers constructdifferent components, and by necessity should not have to understandthe inner workings of the other components. It follows that to preventaccidental deadlock, software components should try to avoid holding alock while calling code outside the component, because the call chainmay cycle around and create a deadlock cycle.
The third condition for deadlock is that no thread is willing togive up its claim on a resource. Thus another way of preventingdeadlock is for a thread to give up its claim on a resource if itcannot acquire the other resources. For this purpose, mutexes oftenhave some kind of “try lock” routine that allows a thread to attempt toacquire a lock, and give up if it cannot be acquired.
This approach is useful in scenarios where sorting the locks isimpractical. Figure 7.6 below sketches the logic for using a “try lock” approach to acquire twolocks, A and B. In Figure 7.6, a thread tries to acquire both locks,and if it cannot, it release both locks and tries again.
|Figure7.6 “Try and Back Off” Logic|
Figure 7.6 above has sometiming delays in it to prevent the hazard of live lock. Live lockoccurs when threads continually conflict with each other and back off. Figure 7.6 applies exponentialbackoff to avoid live lock. If a thread cannot acquire all the locksthat it needs, it releases any that it acquired and waits for a randomamount of time.
The random time is chosen from an interval that doubles each timethe thread backs off. Eventually, the threads involved in the conflictwill back off sufficiently that at least one will make progress. Thedisadvantage of backoff schemes is that they are not fair. There is noguarantee that a particular thread will make progress. If fairness isan issue, then it is probably best to use lock ordering to preventdeadlock.
Next in Part 2: Solutions topriority inversion and heavily contented locks.
Thisarticle was excerpted from Multi-CoreProgrammingby Shameem Akhter and Jason Roberts.Copyright © 2006 Intel Corporation. All rights reserved.
ShameemAkhter is a platform architect at Intel Corporation, focusing on singlesocket multi-core architecture and performance analysis. Jason Roberts is a senior softwareengineer at Intel, and has worked on a number of differentmulti-threaded software products that span a wide range of applicationstargeting desktop, handheld, and embedded DSP platforms.
To read more onEmbedded.com about the topics discussed in this article, go to “Moreabout multicores, multiprocessors and multithreading.”