Common multicore programming problems: Part 1 -Threads, data races, deadlocks and live locks

Shameem Akhter and Jason Roberts, Intel Corp.

October 26, 2007

Shameem Akhter and Jason Roberts, Intel Corp.

Thrashing virtual memory. A similar overhead, at a different level, is thrashing virtual memory. Most systems use virtual memory, where the processors have an address space bigger than the actual available memory. Virtual memory resides on disk, and the frequently used portions are kept in real memory. Similar to caches, the least recently used data is evicted from memory when necessary to make room.

Each software thread requires virtual memory for its stack and private data structures. As with caches, time slicing causes threads to fight each other for real memory and thus hurts performance. In extreme cases, there can be so many threads that the program runs out of even virtual memory.

The cache and virtual memory issues described arise from sharing limited resources among too many software threads. A very different, and often more severe, problem arises called convoying, in which software threads pile up waiting to acquire a lock.

Consider what happens when a thread's time slice expires while the thread is holding a lock. All threads waiting for the lock must now wait for the holding thread to wake up and release the lock. The problem is even worse if the lock implementation is fair, in which the lock is acquired in first-come first-served order. If a waiting thread is suspended, then all threads waiting behind it are blocked from acquiring the lock.

The solution that usually works best is to limit the number of "runnable" threads to the number of hardware threads, and possibly limit it to the number of outer-level caches.

For example, a dual-core Intel Pentium Processor Extreme Edition has two physical cores, each with Hyper-Threading Technology, and each with its own cache. This configuration supports four hardware threads and two outer-level caches.

Using all four runnable threads will work best unless the threads need so much cache that it causes fighting over cache, in which case maybe only two threads is best. The only way to be sure is to experiment. Never "hard code" the number of threads; leave it as a tuning parameter.

Runnable threads, not blocked threads, cause time-slicing overhead. When a thread is blocked waiting for an external event, such as a mouse click or disk I/O request, the operating system takes it off the round-robin schedule. Hence a blocked thread does not cause time-slicing overhead. A program may have many more software threads than hardware threads, and still run efficiently if most of the OS threads are blocked.

A helpful organizing principle is to separate compute threads from I/O threads. Compute threads should be the threads that are runnable most of the time. Ideally, the compute threads never block on external events, and instead feed from task queues that provide work. The number of compute threads should match the processor resources. The I/O threads are threads that wait on external events most of the time, and thus do not contribute to having too many threads.

< Previous
Page 2 of 4
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER