Software Design Issues for Multi-core/Multiprocessor Systems
With the increased use of multiprocessor and multi-core systems in embedded applications, software design considerations now include methods to distribute the software functionality across these computing resources.
This paper will look at multi-core and multiprocessor architectures to see how they influence software design decisions and also see how different configurations of the cores and processors may affect the design. The principal focus will be on the core and processor interfaces to the memory system with particular emphasis on the on-chip cache.
Most of the discussion will consider programming models where processes or threads share memory resources. With threads, the sharing of data between threads is implicit, that is, global data is shared unless otherwise specified. Processes typically require an explicit operation such as a system call to the OS to define a shared memory region.
Figure 1 below shows the major components of a multi-core and multiprocessor system. The figure shows two processors that are separate chips each with two cores on them. Each core has its own on-chip Level 1 (L1) and Level 2 (L2) cache that connects to a system bus accessing the System Memory.
|Figure 1. Multiprocessor/Multi-core System|
From the software perspective, this system will look like four separate processors sometimes referred to as logical processors. Each has its own set of resources and the same access over the system bus to the shared System Memory resource as well as other resources such as the I/O subsystems that are not shown in the figure. The cores on a processor chip do not interact with each other except over the shared system bus.
The figure also illustrates different memory types each having access times that can vary significantly from one type to another. Data held in processor registers can be accessed most quickly. The data is effectively available immediately to the CPU. Data and instructions in the L1 cache have the next smallest access time with the access time increasing from L2 to System Memory.
This memory hierarchy is an important consideration because the access time compared between an L2 access and a System Memory access can vary by a large amount, on some systems, as much as an order of magnitude.
When designing software to run on a multi-core or multiprocessor system, a software designer’s main consideration is how to allocate the work that will be done on the available processors. The most common way to allocate this work is to use a threading model where the work is broken into separate execution units called threads that can run on the different processors at the same time to achieve parallel execution.
If the threads are completely independent of one another, their design does not have to consider how they will interact. For example, two programs running on a system as separate processes each on its own core do not have any awareness of the other process. The performance of the programs is not affected unless they contend for a shared resource such as System Memory or an I/O device.
Typical threaded applications do not have the same level of independence between threads. Most threading models use a shared memory model where all of the threads have access to the program’s global data and can also share data local to a function if the threads are used for parallel execution of portions of a function such as a loop.
The techniques for thread design and the protection of shared data areas such as the threading API, synchronization, and critical sections are not going to be discussed in this paper but it will look at how the distribution of data across the cache of the processors can affect performance.
If Thread 1 in processor 0 is operating on the same data as Thread 2 in processor 1, each processor will have its own copy of the data in its cache. The system must perform operations to ensure that the copies of the data in each cache accurately reflect the sequence of reads and writes performed on them by the threads. This is called maintaining cache coherency. These operations are performed by the system hardware.
Cache coherency is a state where each processor in a multiprocessor system sees the same value for a data item in its cache as the value that is in System Memory. This state is transparent to the software but maintaining that state can affect the software’s performance. In some cases, the data values in a processor cache or System Memory may not be updated until the next time the data is accessed.
So, at a given point in time, the value of a data item may be different between a processor’s cache and System Memory. The system maintains cache coherency by detecting when a data item is about to be accessed and performing the operations needed to update the data item with the correct value before allowing the access to complete.
If Thread 1 and Thread 2 are both reading and writing to the same data item as discussed above, the system has to perform extra operations to ensure that the threads are seeing the same data value as each read and write operation occurs.
When Thread 1 writes to the data item that is shared with Thread 2, the data item is updated in its processor’s cache and in System Memory but not immediately updated in the cache of Thread 2’s processor because Thread 2 may no longer require access to the data item. If Thread 2 then accesses the data item, the cache subsystem on its processor must first obtain the new data value from System Memory. So, the write by Thread 1 forces Thread 2 to wait for a read from System Memory the next time it accesses the data.
This sequence only takes place if the data is modified by one of the threads. If a series of writes is done by each thread, the performance of the system can seriously degrade because of all of the time spent waiting to update the data value from System Memory. This situation is referred to as “pingponging” and avoiding it is an important software design consideration when running on multiprocessor and multi-core systems.
Note that in Figure 1 above, this pingponging condition could result if the two threads were each running on a core on the same processor or each on a core of a separate processor because all of the cores interface the same way; to System Memory over the system bus.
The cache subsystem keeps track of the state of each of its cache lines. Using a technique called “snooping” it monitors all transactions that take place on the system bus and detects when a read or write operation takes place on an address that is in its cache. When the cache subsystem “snoops” a read on the system bus to an address in its cache it changes the state of the cache line to “shared”. If it snoops a write to that address, it will change the state of the cache line to “invalid”.
Because it is snooping the system bus, the cache subsystem will know if it has the only copy of a data item in its cache. If such a data item is updated by its own CPU, the cache subsystem will change the state of the cache line from “exclusive” to “modified”. If the cache subsystem snoops an access to that data item by another processor, it can stop that access from occurring, update the data item in System Memory, and then allow the other processor’s access to proceed. It will also change the state of the cache line with the data item to “shared”.
Two common programming models that help in assigning work to threads in a software design are functional decomposition and domain decomposition. Domain decomposition is also referred to as data decomposition.
Functional decomposition looks at the operations that the software has to perform and assigns each operation to a thread. Multiple operations may be grouped into one thread but the result of functional decomposition is that each operation is performed by a particular thread.
Domain decomposition looks at the data sets that the software has to process and breaks the data into smaller components that can be operated on independently. The threads in the software replicate the operations to be performed on the data but each thread will operate on a separate data component.
The types of operations and characteristics of the data that the software will process have the most influence on the choice of a programming model but it is helpful to understand how these models perform on multiprocessor and multi-core systems. Figure 2 below is an example of a program that uses functional decomposition and Figure 3, later shows a program using domain decomposition. Each example is shown on a single processor that has two cores.
|Figure 2. Functional Decomposition|
The functional decomposition figure shows an example where the same data is being operated on by the thread in each core. The data is modified first by Thread 1 and then accessed by Thread 2. The bold lines show the movement of the data from the cache in Core 0 to System Memory and then from System Memory to the cache in Core 1 so that it can be accessed by Thread 2.
The potential for pingponging exists if Thread 1 accesses the data after Thread 2 has written to it but careful software design can ensure that this will not take place. If Thread 1 can complete its processing of the data before Thread 2 accesses it, the time spent doing the update to System Memory could have negligible impact to overall system performance.
For example, if a sequence of data items is being processed, Thread 2 may be processing a previous data item, say Data 0, while Thread 1 is processing Data 1. The Data 1 processing and update to System Memory could occur in parallel with Thread 2’s processing of Data 0.
The domain decomposition In Figure 3, below, shows a simpler relationship between the threads. Thread 1 and Thread 2 perform the same functions but on different data items. The apparent simplicity of this figure may suggest that the domain decomposition model is preferred over the functional decomposition model but the software design challenges needed to support the domain decomposition model can be significant.
|Figure 3. Domain Decomposition|
In some multi-core architectures, the on-chip L2 cache may be shared. Figure 4 below shows a processor with two cores each with its own L1 cache but sharing the L2 cache.
When looking at the two programming models, this configuration could be seen as “the best of both worlds” because the functional decomposition model does not have to absorb the potential overhead of writing data back to System Memory in order to make it accessible to the cores and, at first glance, domain decomposition is unaffected by this configuration of cores and L2 cache.
However, the functional decomposition model will still have some overhead because the L1 caches are still separate. In the same way that the System Memory had to be updated between a write by one core and a read by the other, L2 cache now must be updated to maintain coherency between the two L1 caches.
Because the time to access L2 cache is so much faster than the time to access System Memory, the performance impact will be quite small when compared to the separate L2 cache configuration. Pingponging can still occur between the L1 caches so the software should still be designed to reduce or eliminate the potential for this.
Because the L2 cache is shared in this configuration, the performance of the domain decomposition model can be affected when the separate data items being processed by the threads happen to occupy the same set of cache lines in L2 cache. With separate L2 caches, the cache lines would not have this potential conflict. This kind of conflict could occur when the threads are duplicates of each other using the same data structures, data layout, and data access sequences.
|Figure 4. Multi-core Processor with Shared L2 Cache|
In many cases, a design will combine elements of functional and domain decomposition. Some parts of the software may lend themselves to representation in the functional decomposition model and not incur a performance penalty because of the nature of the data handling. Other parts of the software may need to be adapted to a domain decomposition model to improve system performance.
The selection of a programming model looks at cases where the designer knows that data is shared between threads. False sharing results when separate data items that are accessed by separate threads are allocated to the same cache line.
Since a data access causes an entire cache line to be read into cache from System Memory, if one data item in the cache line is shared, all of the data items in that cache line will be treated as shared by the cache subsystem.
Two data items could be updated in unrelated transactions by two threads running on different cores but, if the two items are in the same cache line, the cache subsystem will have to update the System Memory in order to maintain cache coherency setting up a condition where pingponging can occur.
An array of structures could be used to organize data that will be accessed by multiple threads. Each thread could access one structure from the array following the domain decomposition model. By following this model, data sharing between threads will not occur and the system avoids the performance impact of maintaining consistency between the caches of each thread’s core unless the structures used by the threads are in the same cache line.
If the cache line size is 64 bytes and the structure size of a structure in the array is 32 bytes, two structures will occupy one cache line. If the two threads accessing the two structures are running on different cores, an update to one structure in the cache line will force the entire cache line to be written to System Memory. It will also invalidate the cache line in the cache on the second core.
The next time the second structure is accessed, it will have to be read from System Memory. If a sequence of updates is done to the structures, the performance of the system can seriously degrade.
One technique to avoid false sharing is to align data items on cache line boundaries using compiler alignment directives. However, over-use of this technique can result in cache lines that are only partially used. True sharing refers to cases where the sharing of data between threads is intended by the software designer.
An example would be the structures associated with locking primitives such as semaphores. The locks may be required to synchronize access to a data area but the designer can minimize the potential impact to system performance by ensuring that a minimal number of threads is used to access the data area. In some cases, it may make sense to create copies of the data that can be operated on by several threads and then fed back to a single thread that updates the shared data area.
Figure 5 below shows a system configuration with two processors each with two cores that have a shared L2 cache. In this configuration, the behavior of the cache subsystems will vary between two cores on the same processor and two cores on different processors. If the OS assigns threads or processes using default rules, the allocation to cores could vary between runs. In one case, all of the threads could be allocated to the cores on the same processor and the performance impact associated with maintaining cache coherency could be small.
|Figure 5. Multi-processor System with Cores That Share L2 Cache|
In another case, the threads could be assigned to cores on different processors and the impact to performance could be substantial.
A software designer may want to take advantage of the relatively low overhead of maintaining cache coherency on a multi-core processor with shared L2 cache and allow some potential for pingponging if it provides other benefits to the design.
But now that the design is dependent on a certain allocation of threads to cores, how does the software implementation ensure the required allocation? In an asymmetric multi-processing model, the application first starts on one processor and then allocates tasks or threads to the other processors in a system.
In this case, the application has control over the allocation of these execution units to cores. The cores will be numbered according to an architecture-specific convention. The application will also have to manage the use of the processor resource to correctly balance the load but this is usually considered a normal part of embedded software design.
A symmetric multiprocessing OS, such as Linux, allocates processes to cores based on a set of scheduling rules. Since these rules are designed to balance the processing load over all the available cores, the allocation will probably not be the best for the performance of a particular application. Also, the allocation of processes to cores will probably be different each time the application is executed because of the variable level of asynchronous processing that can occur when the processes are scheduled to run.
Processor affinity is a thread or process attribute that tells the
operating system which cores, or logical processors, a process can run
on. The following system calls in Linux can tell the application what
the processor affinity is for a particular process and can set the
processor affinity mask for a process:
If the processor id, pid, argument is 0, the call will return a mask in cpuset that shows which processors the process can run on for sched_getaffinity or set up a new mask for a sched_setaffinity call. The cpuset mask is a bit mask where each bit corresponds to a logical processor.
For example, if a process should only run on processor 2, the bit mask in hexadecimal used in cpuset would be 0x04 where the logical processors are numbered starting at 0. The numbering of the logical processors will be architecture-specific. As an example, the cores in Figure 5 could be numbered as: 0 = Core 0 on Processor 0; 1 = Core 1 on Processor 0; 2 = Core 0 on Processor 1; and 3 = Core 1 on Processor 1; If a process can run on either core on Processor 1, the bit mask for cpuset would be set to 0x0C.
This paper discussed some of the software design issues regarding the overhead that a multiprocessor or multi-core system incurs when maintaining cache consistency. This overhead can be considerable when data must be synchronized between separate caches.
The functional decomposition and domain decomposition programming models are two approaches to allocating the work done by the software to the threads or processes and organizing the data to take advantage of the parallel processing provided by the system. The functional decomposition model is typically a more natural way of representing the tasks performed by the software but has a potential to increase the synchronization activity of the processors’ cache subsystems.
The domain decomposition model is generally less likely to incur a high level of cache synchronization overhead but is typically not as natural a way to organize the software.
False sharing may cause significant performance degradation and is dependent on the way that data is stored in cache. Care must be taken in aligning the data to avoid this.
Processor affinity can be used to assign threads or processes to specific cores or logical processors to take advantage of particular system configurations and ensure consistent system performance.
There are many things to consider when designing software for multiprocessor and multi-core systems. The software designer must consider the nature of the processing to be done by the software and the system configuration. Some generalizations or design patterns may be useful in guiding the software design for these systems.
However, the designer needs to focus on the unique aspects of the software to take the best advantage of the available computing resources while avoiding potentially large and difficult to find performance degradation.
This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Silicon Valley 2006. Used with permission of the Embedded Systems Conference.Please visit www.embedded.com/esc/sv.
Steve Daily is a technical
consulting engineer in the Software Products Division at Intel Corp.
for Intel Embedded C/C++ compiler products. Steve Daily is a technical
consulting engineer in the Software Products Division at Intel Corp.
for Intel Embedded C/C++ compiler products. Steve has worked in various
embedded software capacities over the years, most recently, as a
firmware developer for system management controllers for
telecommunications server blades.
To learn about this general subject on Embedded.com go to More about multicores, multiprocessing and tools.
1. "IA-32 Intel Architecture Optimization Reference Manual", Intel Corporation, 2005
2. “Developing Multithreaded Applications: A Platform Consistent Approach”, Intel Corporation, March 2003
3. Don Anderson and Tom Shanley, “Pentium Processor System Architecture, 2nd Edition”, Addison Wesley, 1995