Common multicore programming problems: Part 2 - Heavily Contended Locks - Embedded.com

Common multicore programming problems: Part 2 – Heavily Contended Locks

Proper use of lock to avoid race conditions can invite performanceproblems if the lock becomes highly contended. The lock becomes like atollgate on a highway. If cars arrive at the tollgate faster than thetoll taker can process them, the cars will queue up in a traffic jambehind the tollgate.

Similarly, if threads try to acquire a lock faster than the rate atwhich a thread can execute the corresponding critical section, thenprogram performance will suffer as threads will form a “convoy” waitingto acquire the lock. Indeed, this behavior is sometimes referred to asconvoying.

As mentioned in the discussion of time-slicing woes, convoyingbecomes even worse for fair locks, because if a thread falls asleep,all threads behind it have to wait for it to wake up. Imagine thatsoftware threads are cars and hardware threads are the drivers in thosecars. This might seem like a backwards analogy, but from a car'sperspective, people exist solely to move cars between parking places.If the cars form a convoy, and a driver leaves his or her car, everyoneelse behind is stuck.

Priority Inversion
Some threading implementations allow threads to have priorities. Whenthere are not enough hardware threads to run all software threads, thehigher priority software threads get preference.

For example, foreground tasks might be running with higherpriorities than background tasks. Priorities can be useful, butparadoxically, can lead to situations where a low-priority threadblocks a high-priority thread from running.

Figure 7.7 below illustratespriority inversion. Continuing our analogy with software threads ascars and hardware threads as drivers, three cars are shown, but thereis only a single driver. A low-priority car has acquired a lock so itcan cross a single-lane “critical section” bridge.

Behind it waits a high-priority car. But because the high-prioritycar is blocked, the driver is attending the highest-priority runnablecar, which is the medium-priority one. As contrived as this sounds, itactually happened on the NASA Mars Pathfinder mission.

Figure7.7. Priority Inversion Scenario, Where High Priority GetsBlocked and Medium Priority Gets the Cycles

In real life, the problem in Figure7.7 would be solved by bumping up the priority of the blockingcar until it is out of the way. With locks, this is called priorityinheritance.

When a high-priority thread needs to acquire a lock heldby a low-priority thread, the scheduler bumps up the priority of theblocking thread until the lock is released. Indeed, the Mars Pathfinderproblem was solved by turning on priority inheritance.

An alternative is priority ceilings in which a priority, called theceiling, is assigned to the mutex. The ceiling is the highest priorityof any thread that is expected to hold the mutex.

When a threadacquires the mutex, its priority is immediately bumped up to theceiling value for the duration that it holds the mutex. The priorityceilings scheme is eager to bump up a thread's priority. In contrast,the priority inheritance scheme is lazy by not bumping up a thread'spriority unless necessary.

Windows mutexes support priority inheritance by default. Pthreadsmutexes support neither the priority inheritance nor priority ceilingprotocols. Both protocols are optional in the pthreads standard. Ifthey exist in a particular implementation, they can be set for a mutexvia the function pthread_mutexattr_setprotocol and inspected with the function pthread_mutexattr_getprotocol.

Programmers “rolling their own” locks or busy waits may encounterpriority inversion if threads with different priorities are allowed toacquire the same lock. Hand-coded spin locks are a common example. Ifneither priority inheritance nor priority ceilings can be built intothe lock or busy wait, then it is probably best to restrict the lock'scontenders to threads with the same priority.
Solutions for Heavily ContendedLocks
Upon encountering a heavily contended lock, the first reaction of manyprogrammers is “I need a faster lock.” Indeed, some implementations oflocks are notoriously slow, and faster locks are possible. However, nomatter how fast the lock is, it must inherently serialize threads. Afaster lock can thus help performance by a constant factor, but willnever improve scalability. To improve scalability, either eliminate thelock or spread out the contention.

An earlier discussion of deadlock mentioned the technique ofeliminating a lock by replicating the resource. That is certainly themethod of choice to eliminate lock contention if it is workable. Forexample, consider contention for a counter of events. If each threadcan have its own private counter, then no lock is necessary. If thetotal count is required, the counts can be summed after all threads aredone counting.

If the lock on a resource cannot be eliminated, considerpartitioning the resource and using a separate lock to protect eachpartition. The partitioning can spread out contention among the locks.For example, consider a hash table implementation where multiplethreads might try to do insertions at the same time.

A simple approach to avoid race conditions is to use a single lockto protect the entire table. The lock allows only one thread into thetable at a time. The drawback of this approach is that all threads mustcontend for the same lock, which could become a serial bottleneck. Analternative approach is to create an array of sub-tables, each with itsown lock, as shown in Figure 7.8 below .

Keys can be mapped to the sub-tables via a hashing function. For agiven key, a thread can figure out which table to inspect by using ahash function that returns a sub-table index.

Insertion of a key commences by hashing the key to one of thesub-tables, and then doing the insertion on that sub-table whileholding the sub-table's lock. Given enough sub-tables and a good hashfunction, the threads will mostly not contend for the same sub-tableand thus not contend for the same lock.

Figure7.8 Spreading out Contention by Partitioning a Hash Table into MultipleSub-tables

Pursuit of the idea of spreading contention among multiple locksfurther leads to fine-grained locking. For example, hash tables arecommonly implemented as an array of buckets, where each bucket holdskeys that hashed to the same array element. In fine-grained locking,there might be a lock on each bucket.
This way multiple threads can concurrently access different bucketsin parallel. This is straightforward to implement if the number ofbuckets is fixed. If the number of buckets has to be grown, the problembecomes more complicated, because resizing the array may requireexcluding all but the thread doing the resizing.

A reader-writer lock helps solve this problem, as will be explainedshortly. Another pitfall is that if the buckets are very small, thespace overhead of the lock may dominate.

If a data structure is frequently read, but infrequently written,then a reader-writer lock may help deal with contention. A reader-writelock distinguishes readers from writers. Multiple readers can acquirethe lock at the same time, but only one writer can acquire it at atime. Readers cannot acquire the lock while a writer holds it andvice-versa. Thus readers contend only with writers.

The earlier fine-grained hash table is a good example of wherereader-write locks can help if the array of buckets must be dynamicallyresizable. Figure 7.9  below shows a possible implementation. The table consists of an arraydescriptor that specifies the array's size and location. Areader-writer mutex protects this structure. Each bucket has its ownplain mutex protecting it.

To access a bucket, a thread acquires two locks: a reader lock onthe array descriptor, and a lock on the bucket's mutex. The threadacquires a reader lock, not a writer lock, on the reader-writer mutexeven if it is planning to modify a bucket, because the reader-writermutex protects the array descriptor, not the buckets.

If a thread needs to resize the array, it requests a writer lock onthe reader-writer mutex. Once granted, the thread can safely modify thearray descriptor without introducing a race condition. The overalladvantage is that during times when the array is not being resized,multiple threads accessing different buckets can proceed concurrently.

The principle disadvantage is that a thread must obtain two locksinstead of one. This increase in locking overhead can overwhelm theadvantages of increased concurrency if the table is typically notsubject to contention.

Figure 7.9 Hash Table with Fine-grained Locking

If writers are infrequent, reader-writer locks can greatly reducecontention. However, reader-writer locks have limitations. When therate of incoming readers is very high, the lock implementation maysuffer from memory contention problems. Thus reader-writer locks can bevery useful for medium contention of readers, but may not be able tofix problems with high contention.

The reliable way to deal with high contention is to rework theparallel decomposition in a way that lowers the contention. Forexample, the schemes in Figure 7.8 and Figure 7.9 might becombined, so that a hash table is represented by a fixed number ofsub-tables, each with fine-grained locking.

To read Part 1, go to “Threads,data races, deadlocks and live locks.”
Next in Part 3: Non-blockingalgorithms, thread-safe functions and libraries

Thisarticle was excerpted from Multi-CoreProgrammingby Shameem Akhter and Jason Roberts.Copyright © 2006 Intel Corporation. All rights reserved.

ShameemAkhter is a platform architect at Intel Corporation, focusing on singlesocket multi-core architecture and performance analysis. Jason Roberts is a senior softwareengineer at Intel, and has worked on a number of differentmulti-threaded software products that span a wide range of applicationstargeting desktop, handheld, and embedded DSP platforms.

To read more onEmbedded.com about the topics discussed in this article, go to “Moreabout multicores, multiprocessors and multithreading.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.