One way to solve the problems introduced by locks is to not use locks.Algorithms designed to do this are called non-blocking.
The defining characteristic of a non-blocking algorithm is thatstopping a thread does not prevent the rest of the system from makingprogress. There are different non-blocking guarantees:
1) Obstructionfreedom. A thread makes progress as long as there is nocontention, but live lock is possible. Exponential backoff can be usedto work around live lock.
2) Lock freedom. The system as a whole makes progress.
3) Wait freedom. Every thread makes progress, even when faced with contention. Very fewnon-blocking algorithms achieve this.
Non-blocking algorithms are immune from lock contention, priorityinversion, and convoying. Non-blocking algorithms have a lot ofadvantages, but with these come a new set of problems that need to beunderstood.
Non-blocking algorithms are based on atomic operations. A fewnon-blocking algorithms are simple. Most are complex, because thealgorithms must handle all possible interleaving of instruction streamsfrom contending processors.
A trivial non-blocking algorithm is counting via an interlockedincrement instead of a lock. The interlocked instruction avoids lockoverhead and pathologies.
However, simply using atomic operations is not enough to avoid raceconditions, because as discussed before, composing thread-safeoperations does not necessarily yield a thread-safe procedure. As anexample, the C code in Figure 7.10below shows the wrong way and right way to decrement and test areference count p >ref_count.
In the wrong code, if the count was originally 2, two threadsexecuting the wrong code might both decrement the count, and then bothsee it as zero at the same time. The correct code performs thedecrement and test as a single atomic operation.
|Figure7.10. Atomic Operations and Race Conditions|
Most non-blocking algorithms involve a loop that attempts to performan action using one or more compare-and-swap (CAS) operations, andretries when one of the CAS operations fails. A simple and usefulexample is implementing a thread-safe fetch-and-op.
A fetch-and-op reads a value from a location, computes a new valuefrom it, and stores the new value. Figure7.11 below illustrates both a locked version and a non-blockingversion that operate on a location x.
The non-blocking version reads location x into a local temporaryx_old, and computes a new value x_new = op(x_old). The routineInterlockedCompareExchange stores the new value, unless x is nowdifferent than x_old. If the store fails, the code starts over until itsucceeds.
|Figure7.11. Comparison of Locked and Lockless Code for Fetch-and-op|
Fetch-and-op is useful as long as the order in which various threadsperform op does not matter. For example, op might be “multiply by 2.”The location x must have a type for which a compare-and-exchangeinstruction is available.
In Figure 7.11 above , there isa time interval between when a thread executes “x_old = x ” and when the threadexecutes InterlockedCompareExchange .During this interval, other processors might perform other fetch-and-opoperations.
For example, suppose the initial value read is A. An interveningsequence of fetch-and-op operations by other processors might change xto B and then back to A.
When the original thread executes InterlockedCompareExchange ,it will be as if the other processor's actions never happened. As longas the order in which op is executed does not matter, there is noproblem. The net result is the same as if the fetch-and-op operationswere reordered such that the intervening sequence happens before thefirst read.
|Figure7.12. Lockless Implementation of a Linked Stack that May Suffer fromABA Problem|
But sometimes fetch-and-op has uses where changing x from A to B toA does make a difference. The problem is indeed known as the ABAproblem. Consider the lockless implementation of a stack shown in Figure 7.12 above . It is written inthe fetch-and-op style, and thus has the advantage of not requiring anylocks.
But the “op” is no longer a pure function, because it deals withanother shared memory location: the field “next.” Figure 7.13 below shows a sequencewhere the function BrokenLockLessPop corrupts the linked stack.
When thread 1 starts out, it sees B as next on stack. Butintervening pushes and pops make C next on stack. But Thread 1's finalInterlockedCompareExchange does not catch this switch because it onlyexamines Top.
|Figure7.13. Sequence Illustrates ABA Problem for Code in Figure 7.12|
The solution to the ABA problem is to never reuse A. In agarbage-collected environment such as Java or .NET, this is simply amatter of not recycling nodes. That is, once a node has been popped,never push it again. Instead allocate a fresh node. The garbagecollector will do the hard work of checking that the memory for node Ais not recycled until all extant references to it are gone.
In languages with garbage collection, the problem is harder. An oldtechnique dating back to the IBM 370 changes ABA to ABA'. In otherwords, make A slightly different each time.
This is typically done by appending a serial number to the pointer.A special instruction that can do a double-wide compare-exchange isrequired. On IA-32, the instruction iscmpxchg8b , which does a compare-exchange on eight bytes.
On processors with Intel EM64T, it is cmpxchg16b . On Itanium processors,there is cmp8xchg16, which is not quite the same, because it comparesonly the first eight bytes, but exchanges all 16. However, as long asthe serial number is the first eight bytes, it works for turning ABAinto ABA'.
Another solution is to build a miniature garbage collector thathandles pointers involved in compare-exchange operations. Thesepointers are called hazard pointers, because they present a hazard tolockless algorithms.
Maged Michael's paper on hazard pointers (Michael2004) explains how to implement hazard pointers. Hazard pointersare a nontrivial exercise and make assumptions about the environment,so tread with caution.
Cache Line Ping-ponging
Non-blocking algorithms can cause a lot of traffic on the memory bus asvarious hardware threads keep trying and retrying to perform operationson the same cache line. To service these operations, the cache linebounces back and forth (“ping-pongs”) between the contending threads.
A locked algorithm may outperform the non-blocking equivalent iflock contention is sufficiently distributed and each lock says “handoff my cache line until I'm done.”
Experimentation is necessary to find out whether the non-blocking orlocked algorithm is better. A rough guide is that a fast spin lockprotecting a critical section with no atomic operations may outperforman alternative non-blocking design that requires three or more highlycontended atomic operations.
Memory Reclamation Problem
Memory reclamation is the dirty laundry of many non-blockingalgorithms. For languages such as C/C++ that require the programmer toexplicitly free memory, it turns out to be surprisingly difficult tocall free on a node used in a non-blocking algorithm. Programmersplanning to use non-blocking algorithms need to understand when thislimitation arises, and how to work around it.
The problem occurs for algorithms that remove nodes from linkedstructures, and do so by performing compare-exchange operations onfields in the nodes. For example, non-blocking algorithms for queues dothis.
The reason is that when a thread removes a node from a datastructure, without using a lock to exclude other threads, it neverknows if another thread still looking at the node.
The algorithms are usually designed so that the other thread willperform a failing compare-exchange on a field in the removed node, andthus know to retry. Unfortunately, if in the meantime the node ishanded to free, the field might be coincidentally set to the value thatthe compare-exchange expects to see.
The solution is to use a garbage collector or mini-collector likehazard pointers. Alternatively you may associate a free list of nodeswith the data structure and not free any nodes until the data structureitself is freed.
Non-blocking algorithms are currently a hot topic in research. Theirbig advantage is avoiding lock pathologies. Their primary disadvantageis that they are much more complicated than their locked counterparts.
Indeed, the discovery of a lockless algorithm is often worthy of aconference paper. Non-blocking algorithms are difficult to verify. Atleast one incorrect algorithm has made its way into a conference paper.Non-experts should consider the following advice:
1) Atomic increment,decrement, and fetch-and-add are generally safe to use in an intuitivefashion.
2) The fetch-and-op idiomis generally safe to use with operations that are commutative andassociative.
3) The creation ofnon-blocking algorithms for linked data structures should be left toexperts. Use algorithms from the peer-reviewed literature. Be sure tounderstand any memory reclamation issues.
Otherwise, for now, stick with locks. Avoid having more runnablesoftware threads than hardware threads, and design programs to avoidlock contention. This way, the problems solved by non-blockingalgorithms will not come up in the first place.
Thread-safe Functions and Libraries
The Foo example in Figure 7.2 in Part 1 underscoresthe importance of documenting thread safety. Defining a routine likeFoo that updates unprotected hidden shared state is a poor programmingpractice.
In general, routines should be thread safe; that is, concurrentlycallable by clients. However, complete thread safety is usuallyunrealistic, because it would require that every call do some locking,and performance would be pathetic.
Instead, a common convention is to guarantee that instance routinesare thread safe when called concurrently on different objects, but notthread safe when called concurrently on the same object.
This convention is implicit when objects do not share state. Forobjects that do share state, the burden falls on the implementer toprotect the shared state. Figure 7.14below shows a reference-counted implementation of strings wherethe issue arises. From the client's viewpoint, each string object is aseparate string object, and thus threads should be able to concurrentlyoperate on each object.
|Figure7.14. Implementer Should Ensure Thread Safety of Hidden Shared State|
In the underlying implementation, however, a string object is simplya pointer to a shared object that has the string data, and a referencecount of the number of string objects that point to it.
The implementer should ensure that concurrent accesses do notcorrupt the shared state. For example, the updates to the referencecount should use atomic operations.
When defining interfaces, care should be taken to ensure that theycan be implemented efficiently in a thread-safe manner. Interfacesshould not update hidden global state, because with multiple threads,it may not be clear whose global state is being updated.
The C library function strtok is one such offender. Clients use itto tokenize a string. The first call sets the state of a hidden parser,and each successive call advances the parser.
The hidden parser state makes the interface thread unsafe. Threadsafety can be obtained by having the implementation put the parser inthread-local storage. But this introduces the complexity of a threadingpackage into something that really should not need it in the firstplace.
A thread-safe redesign of strtok would make the parser object anexplicit argument. Each thread would create its own local parser objectand pass it as an argument. That way, concurrent calls could proceedblissfully without interference.
Some libraries come in thread-safe and thread-unsafe versions. Besure to use the thread-safe version for multi-threaded code. Forexample, on Windows, the compiler option /MD is required to dynamicallylink with the thread-safe version of the run-time library.
For debugging, the corresponding option is /MDd, which dynamicallylinks with the “debug” version of the thread-safe run-time. Read yourcompiler documentation carefully about these kinds of options. Becausethe compilers date back to the single-core era, the defaults are oftenfor code that is not thread safe.
Next in Part 4: “Memory andbandwidth issues and working with cache .”
Thisarticle was excerpted from Multi-CoreProgrammingby Shameem Akhter and Jason Roberts.Copyright © 2006 Intel Corporation. All rights reserved.
ShameemAkhter is a platform architect at Intel Corporation, focusing on singlesocket multi-core architecture and performance analysis. Jason Roberts is a senior softwareengineer at Intel, and has worked on a number of differentmulti-threaded software products that span a wide range of applicationstargeting desktop, handheld, and embedded DSP platforms.
To read more onEmbedded.com about the topics discussed in this article, go to “Moreabout multicores, multiprocessors and multithreading.”