With the increased use of multiprocessor and multi-core systems inembedded applications, software design considerations now includemethods to distribute the software functionality across these computingresources.
This paper will look at multi-core and multiprocessor architecturesto see how they influence software design decisions and also see howdifferent configurations of the cores and processors may affect thedesign. The principal focus will be on the core and processorinterfaces to the memory system with particular emphasis on the on-chipcache.
Most of the discussion will consider programming models whereprocesses or threads share memory resources. With threads, the sharingof data between threads is implicit, that is, global data is sharedunless otherwise specified. Processes typically require an explicitoperation such as a system call to the OS to define a shared memoryregion.
Figure 1 below shows themajor components of a multi-core and multiprocessor system. The figureshows two processors that are separate chips each with two cores onthem. Each core has its own on-chip Level 1 (L1) and Level 2 (L2) cachethat connects to a system bus accessing the System Memory.
|Figure1. Multiprocessor/Multi-core System|
From the software perspective, this system will look like fourseparate processors sometimes referred to as logical processors. Eachhas its own set of resources and the same access over the system bus tothe shared System Memory resource as well as other resources such asthe I/O subsystems that are not shown in the figure. The cores on aprocessor chip do not interact with each other except over the sharedsystem bus.
The figure also illustrates different memory types each havingaccess times that can vary significantly from one type to another. Dataheld in processor registers can be accessed most quickly. The data iseffectively available immediately to the CPU. Data and instructions inthe L1 cache have the next smallest access time with the access timeincreasing from L2 to System Memory.
This memory hierarchy is an important consideration because theaccess time compared between an L2 access and a System Memory accesscan vary by a large amount, on some systems, as much as an order ofmagnitude.
When designing software to run on a multi-core or multiprocessorsystem, a software designer’s main consideration is how to allocate thework that will be done on the available processors. The most common wayto allocate this work is to use a threading model where the work isbroken into separate execution units called threads that can run on thedifferent processors at the same time to achieve parallel execution.
If the threads are completely independent of one another, theirdesign does not have to consider how they will interact. For example,two programs running on a system as separate processes each on its owncore do not have any awareness of the other process. The performance ofthe programs is not affected unless they contend for a shared resourcesuch as System Memory or an I/O device.
Typical threaded applications do not have the same level ofindependence between threads. Most threading models use a shared memorymodel where all of the threads have access to the program’s global dataand can also share data local to a function if the threads are used forparallel execution of portions of a function such as a loop.
The techniques for thread design and the protection of shared dataareas such as the threading API, synchronization, and critical sectionsare not going to be discussed in this paper but it will look at how thedistribution of data across the cache of the processors can affectperformance.
If Thread 1 in processor 0 is operating on the same data as Thread 2 inprocessor 1, each processor will have its own copy of the data in itscache. The system must perform operations to ensure that the copies ofthe data in each cache accurately reflect the sequence of reads andwrites performed on them by the threads. This is called maintainingcache coherency. These operations are performed by the system hardware.
Cache coherency is a state where each processor in a multiprocessorsystem sees the same value for a data item in its cache as the valuethat is in System Memory. This state is transparent to the software butmaintaining that state can affect the software’s performance. In somecases, the data values in a processor cache or System Memory may not beupdated until the next time the data is accessed.
So, at a given point in time, the value of a data item may bedifferent between a processor’s cache and System Memory. The systemmaintains cache coherency by detecting when a data item is about to beaccessed and performing the operations needed to update the data itemwith the correct value before allowing the access to complete.
If Thread 1 and Thread 2 are both reading and writing to the samedata item as discussed above, the system has to perform extraoperations to ensure that the threads are seeing the same data value aseach read and write operation occurs.
When Thread 1 writes to the data item that is shared with Thread 2,the data item is updated in its processor’s cache and in System Memorybut not immediately updated in the cache of Thread 2’s processorbecause Thread 2 may no longer require access to the data item. IfThread 2 then accesses the data item, the cache subsystem on itsprocessor must first obtain the new data value from System Memory. So,the write by Thread 1 forces Thread 2 to wait for a read from SystemMemory the next time it accesses the data.
This sequence only takes place if the data is modified by one of thethreads. If a series of writes is done by each thread, the performanceof the system can seriously degrade because of all of the time spentwaiting to update the data value from System Memory. This situation isreferred to as “pingponging” and avoiding it is an important softwaredesign consideration when running on multiprocessor and multi-coresystems.
Note that in Figure 1 above ,this pingponging condition could result if the two threads were eachrunning on a core on the same processor or each on a core of a separateprocessor because all of the cores interface the same way; to SystemMemory over the system bus.
The cache subsystem keeps track of the state of each of its cachelines. Using a technique called “snooping” it monitors all transactionsthat take place on the system bus and detects when a read or writeoperation takes place on an address that is in its cache. When thecache subsystem “snoops” a read on the system bus to an address in itscache it changes the state of the cache line to “shared”. If it snoopsa write to that address, it will change the state of the cache line to“invalid”.
Because it is snooping the system bus, the cache subsystem will knowif it has the only copy of a data item in its cache. If such a dataitem is updated by its own CPU, the cache subsystem will change thestate of the cache line from “exclusive” to “modified”. If the cachesubsystem snoops an access to that data item by another processor, itcan stop that access from occurring, update the data item in SystemMemory, and then allow the other processor’s access to proceed. It willalso change the state of the cache line with the data item to “shared”.
Two common programming models that help in assigning work to threads ina software design are functional decomposition and domaindecomposition. Domain decomposition is also referred to as datadecomposition.
Functional decomposition looks at the operations that the softwarehas to perform and assigns each operation to a thread. Multipleoperations may be grouped into one thread but the result of functionaldecomposition is that each operation is performed by a particularthread.
Domain decomposition looks at the data sets that the software has toprocess and breaks the data into smaller components that can beoperated on independently. The threads in the software replicate theoperations to be performed on the data but each thread will operate ona separate data component.
The types of operations and characteristics of the data that thesoftware will process have the most influence on the choice of aprogramming model but it is helpful to understand how these modelsperform on multiprocessor and multi-core systems. Figure 2 below is an example of aprogram that uses functional decomposition and Figure 3, later shows aprogram using domain decomposition. Each example is shown on a singleprocessor that has two cores.
|Figure2. Functional Decomposition|
The functional decomposition figure shows an example where the samedata is being operated on by the thread in each core. The data ismodified first by Thread 1 and then accessed by Thread 2. The boldlines show the movement of the data from the cache in Core 0 to SystemMemory and then from System Memory to the cache in Core 1 so that itcan be accessed by Thread 2.
The potential for pingponging exists if Thread 1 accesses the dataafter Thread 2 has written to it but careful software design can ensurethat this will not take place. If Thread 1 can complete its processingof the data before Thread 2 accesses it, the time spent doing theupdate to System Memory could have negligible impact to overall systemperformance.
For example, if a sequence of data items is being processed, Thread2 may be processing a previous data item, say Data 0, while Thread 1 isprocessing Data 1. The Data 1 processing and update to System Memorycould occur in parallel with Thread 2’s processing of Data 0.
The domain decomposition In Figure3, below , shows a simpler relationship between the threads.Thread 1 and Thread 2 perform the same functions but on different dataitems. The apparent simplicity of this figure may suggest that thedomain decomposition model is preferred over the functionaldecomposition model but the software design challenges needed tosupport the domain decomposition model can be significant.
|Figure3. Domain Decomposition|
In some multi-core architectures, the on-chip L2 cache may beshared. Figure 4 below shows aprocessor with two cores each with its own L1 cache but sharing the L2cache.
When looking at the two programming models, this configuration couldbe seen as “the best of both worlds” because the functionaldecomposition model does not have to absorb the potential overhead ofwriting data back to System Memory in order to make it accessible tothe cores and, at first glance, domain decomposition is unaffected bythis configuration of cores and L2 cache.
However, the functional decomposition model will still have someoverhead because the L1 caches are still separate. In the same way thatthe System Memory had to be updated between a write by one core and aread by the other, L2 cache now must be updated to maintain coherencybetween the two L1 caches.
Because the time to access L2 cache is so much faster than the timeto access System Memory, the performance impact will be quite smallwhen compared to the separate L2 cache configuration. Pingponging canstill occur between the L1 caches so the software should still bedesigned to reduce or eliminate the potential for this.
Because the L2 cache is shared in this configuration, theperformance of the domain decomposition model can be affected when theseparate data items being processed by the threads happen to occupy thesame set of cache lines in L2 cache. With separate L2 caches, the cachelines would not have this potential conflict. This kind of conflictcould occur when the threads are duplicates of each other using thesame data structures, data layout, and data access sequences.
|Figure4. Multi-core Processor with Shared L2 Cache|
In many cases, a design will combine elements of functional anddomain decomposition. Some parts of the software may lend themselves torepresentation in the functional decomposition model and not incur aperformance penalty because of the nature of the data handling. Otherparts of the software may need to be adapted to a domain decompositionmodel to improve system performance.
The selection of a programming model looks at cases where the designerknows that data is shared between threads. False sharing results whenseparate data items that are accessed by separate threads are allocatedto the same cache line.
Since a data access causes an entire cache line to be read intocache from System Memory, if one data item in the cache line is shared,all of the data items in that cache line will be treated as shared bythe cache subsystem.
Two data items could be updated in unrelated transactions by twothreads running on different cores but, if the two items are in thesame cache line, the cache subsystem will have to update the SystemMemory in order to maintain cache coherency setting up a conditionwhere pingponging can occur.
An array of structures could be used to organize data that will beaccessed by multiple threads. Each thread could access one structurefrom the array following the domain decomposition model. By followingthis model, data sharing between threads will not occur and the systemavoids the performance impact of maintaining consistency between thecaches of each thread’s core unless the structures used by the threadsare in the same cache line.
If the cache line size is 64 bytes and the structure size of astructure in the array is 32 bytes, two structures will occupy onecache line. If the two threads accessing the two structures are runningon different cores, an update to one structure in the cache line willforce the entire cache line to be written to System Memory. It willalso invalidate the cache line in the cache on the second core.
The next time the second structure is accessed, it will have to beread from System Memory. If a sequence of updates is done to thestructures, the performance of the system can seriously degrade.
One technique to avoid false sharing is to align data items on cacheline boundaries using compiler alignment directives. However, over-useof this technique can result in cache lines that are only partiallyused. True sharing refers to cases where the sharing of data betweenthreads is intended by the software designer.
An example would be the structures associated with lockingprimitives such as semaphores. The locks may be required to synchronizeaccess to a data area but the designer can minimize the potentialimpact to system performance by ensuring that a minimal number ofthreads is used to access the data area. In some cases, it may makesense to create copies of the data that can be operated on by severalthreads and then fed back to a single thread that updates the shareddata area.
Figure 5 below shows a systemconfiguration with two processors each with two cores that have ashared L2 cache. In this configuration, the behavior of the cachesubsystems will vary between two cores on the same processor and twocores on different processors. If the OS assigns threads or processesusing default rules, the allocation to cores could vary between runs.In one case, all of the threads could be allocated to the cores on thesame processor and the performance impact associated with maintainingcache coherency could be small.
|Figure5. Multi-processor System with Cores That Share L2 Cache|
In another case, the threads could be assigned to cores on differentprocessors and the impact to performance could be substantial.
A software designer may want to take advantage of the relatively lowoverhead of maintaining cache coherency on a multi-core processor withshared L2 cache and allow some potential for pingponging if it providesother benefits to the design.
But now that the design is dependent on a certain allocation ofthreads to cores, how does the software implementation ensure therequired allocation? In an asymmetric multi-processing model, theapplication first starts on one processor and then allocates tasks orthreads to the other processors in a system.
In this case, the application has control over the allocation ofthese execution units to cores. The cores will be numbered according toan architecture-specific convention. The application will also have tomanage the use of the processor resource to correctly balance the loadbut this is usually considered a normal part of embedded softwaredesign.
A symmetric multiprocessing OS, such as Linux, allocates processesto cores based on a set of scheduling rules. Since these rules aredesigned to balance the processing load over all the available cores,the allocation will probably not be the best for the performance of aparticular application. Also, the allocation of processes to cores willprobably be different each time the application is executed because ofthe variable level of asynchronous processing that can occur when theprocesses are scheduled to run.
Processor affinity is a thread or process attribute that tells theoperating system which cores, or logical processors, a process can runon. The following system calls in Linux can tell the application whatthe processor affinity is for a particular process and can set theprocessor affinity mask for a process:
If the processor id, pid, argument is 0, the call will return a maskin cpuset that shows which processors the process can run on forsched_getaffinity or set up a new mask for a sched_setaffinity call.The cpuset mask is a bit mask where each bit corresponds to a logicalprocessor.
For example, if a process should only run on processor 2, the bitmask in hexadecimal used in cpuset would be 0x04 where the logicalprocessors are numbered starting at 0. The numbering of the logicalprocessors will be architecture-specific. As an example, the cores inFigure 5 could be numbered as: 0 = Core 0 on Processor 0; 1 = Core 1 onProcessor 0; 2 = Core 0 on Processor 1; and 3 = Core 1 on Processor 1;If a process can run on either core on Processor 1, the bit mask forcpuset would be set to 0x0C.
This paper discussed some of the software design issues regarding theoverhead that a multiprocessor or multi-core system incurs whenmaintaining cache consistency. This overhead can be considerable whendata must be synchronized between separate caches.
The functional decomposition and domain decomposition programmingmodels are two approaches to allocating the work done by the softwareto the threads or processes and organizing the data to take advantageof the parallel processing provided by the system. The functionaldecomposition model is typically a more natural way of representing thetasks performed by the software but has a potential to increase thesynchronization activity of the processors’ cache subsystems.
The domain decomposition model is generally less likely to incur ahigh level of cache synchronization overhead but is typically not asnatural a way to organize the software.
False sharing may cause significant performance degradation and isdependent on the way that data is stored in cache. Care must be takenin aligning the data to avoid this.
Processor affinity can be used to assign threads or processes tospecific cores or logical processors to take advantage of particularsystem configurations and ensure consistent system performance.
There are many things to consider when designing software formultiprocessor and multi-core systems. The software designer mustconsider the nature of the processing to be done by the software andthe system configuration. Some generalizations or design patterns maybe useful in guiding the software design for these systems.
However, the designer needs to focus on the unique aspects of thesoftware to take the best advantage of the available computingresources while avoiding potentially large and difficult to findperformance degradation.
Thisarticle is excerpted from a paper of the same name presented at theEmbedded Systems Conference Silicon Valley 2006. Used with permissionof the Embedded Systems Conference.Please visit www.embedded.com/esc/sv.
Steve Daily is a technicalconsulting engineer in the Software Products Division at Intel Corp. for Intel Embedded C/C++ compiler products. Steve Daily is a technicalconsulting engineer in the Software Products Division at Intel Corp.for Intel Embedded C/C++ compiler products. Steve has worked in variousembedded software capacities over the years, most recently, as afirmware developer for system management controllers fortelecommunications server blades.
To learn about this general subject on Embedded.com go to Moreabout multicores, multiprocessing and tools.
1. “IA-32 Intel Architecture Optimization Reference Manual”, IntelCorporation, 2005
2. “Developing Multithreaded Applications: A Platform ConsistentApproach”, Intel Corporation, March 2003
3. Don Anderson and Tom Shanley, “Pentium Processor SystemArchitecture, 2nd Edition”, Addison Wesley, 1995