Today’s embedded devices are increasingly based on multi-coredesigns. Systems-on-a-chip (SoCs) often contain two or more processorcores in homogeneous or heterogeneous combinations, and FPGA-baseddesigns can include a virtually unlimited number and variety of cores.An asymmetric multiprocessing (AMP)-based RTOS is one approach toutilizing multi-core processors; symmetric multiprocessing (SMP) isanother.
But from a historical perspective, multiprocessor/multicore designshave been around almost as long as the computer in general. I believethat Burroughs introduced the first commercially available one back in1962; many more have followed. Most of them had very specific purposesin mind.
Many of the early ones were used in data centers, and for scientificapplications. You can also make a case for them being in the earlydesktops of the late 80s with the introduction of the math co-processorand device controllers such as an Ethernet controller.
The question is: why are embedded device manufacturers becoming soenamored with multi-core devices now? One reason is that embeddeddevices have more and more tasks being heaped on them.
My first cell phone was a StarTek. It had a rudimentary LCD displayand frankly was not very useful for anything other than making a phonecall. It had a 24-hour standby time and maybe an hour talk time oncethe battery was broken in. That was not a real problem though since myfirst plan was for 30 minutes a month.
But look at today’s phones and the demands made on them: built-incameras, video capabilities, nice graphical interfaces for browsing theWeb, calendars and all sorts of games. It is no wonder that designersare looking at ways to get more performance, and better form factorwhile extending battery life.
Why multicores? Adam Smith's answer
One answer to my question may lie in economic theory, specifically AdamSmith’s “Wealth of Nations.” It opens with a dramatic recommendationthat rather than each worker performing all tasks required for pinmaking: rolling the wire, cutting the wire into pieces, sharpening thewire and bobbing the end, each worker should specialize in performing asingle task. One worker would roll the wire; another would sharpen thewire, another bob the ends and so on.
Smith suggested that a pin factory that had adopted such a”specialization of labor” might produce tens of thousands of pins a daywhereas a pin factory in which each worker attempted to produce pins,from start to finish, would produce very few pins.
Heterogeneous multicore designs such as the OMAP processor, whichemploy both a DSP and an ARM core, embody Adam Smith’s concepts of bothdivision of labor and a specialization of labor. This approach is agood one for manufacturers of cell phones and other electronic devices,which perform a multitude of roles: the DSP efficiently performssignal-processing activities and the ARM core performs the role of ageneral-purpose processor. (Panel toright)
Since power consumption is both a function of voltage and frequency(power consumption increases proportionally with frequency and powerconsumption increases to the square of voltage increase, the divisionof labor aspect of multiprocessing is of particular interest tomanufacturers of portable devices.
Most portable electronic devices do not operate at peak capacity oftheir processors all the time. However, to handle peak demand,manufacturers have to provide a processor that can deliver theprocessing power to meet peak demands. While the processor can bethrottled down between periods of peak performance, this may not be anoption for devices that require operation in an intermediateperformance range. For most processors they must either run at fullspeed or be in an idle state.
This is where adding a number of slower processors may be appealingto device manufacturers. If the issue of peak performance can besegmented so that multiple processors can address peak performanceneeds, they can then be throttled down or put to sleep during periodsof inactivity. For periods where performance demands are low, one ortwo of the low performance processors can provide satisfactoryprocessing power.
The limits on your multicore design are essentially the same as theconstraints on classical multiprocessing systems: Amdahl’s law; serialcode execution; code execution granularity; load-imbalance;synchronization; and cache coherency.
Illustrated in Figure 1 above, Amdahl’s Law states, “Thenon-parallel fraction of the code (i.e., overhead) imposes the upperlimit on the scalability of the code”. Thus, the maximum parallelSpeedup S(p) for a program that has a parallel fraction f:
S(p) < 1/(1-f)
The non-parallel (serial) fraction of the program includes thecommunication and synchronization overhead.
So according to Amdahl, if you have a program or a suite of tasksthat is comprised of say 50 percent serial code, no matter how manyprocessors you add, you will only get a performance increase of 200percent. On the other hand, if your code contains no serial code, likefour tasks that have no communication requirements, no sharedresources, etc, then you can achieve a 1000 percent speed up by goingfrom one processor to 10.
Serial code execution
What is serial execution? Execution is considered serial when one ormore cores in a multi-core system are not executing code. This canoccur due to a variety of reasons. Sometimes it is because there is noreason for code to be executing on multiple cores. Basically, thesystem is idle. From a performance perspective, this is nothing toworry about. However, there are a number of situations where it wouldbe desirable for all of the processors to be executing code, but can’tbecause of data or resource conflict, load-imbalance, synchronizationor the need to maintain cache coherency.
Spinlocks . Oneprime example of how code becomes serialized is when there is thepotential for processes running on different processors to attempt togain access to a resource that is already in use by another process(Figure 2, below ).
OS developers use what is called a Spinlock to act as a gatekeeperfor a resource. A Spinlock (sometimes called a Spinlock Mutex) is aspecial data structure used in multiprocessing. It is similar to aMutex or a semaphore in that it protects access to resources thatcannot be accessed by more than one process at a time. The primarydifference between a Spinlock and a Mutex is that the Spinlock dependson a hardware ”test and set” mechanism that arbitrates between two ormore processes attempting access at the same time.
When a process attempts to access a resource that is in use byanother process, the “blocked” process(s) will “spin” until theresource becomes available; hence the name Spinlock. In essence, theprocesses queue up until they can get access to the resource. There aredifferent algorithms that can be used to assign priority to processeswaiting in a Spinlock. Regardless of the algorithm used, once a processgets access to a resource, they keep it until they decide to give itup.
The granularity of Spinlocks and the resources that they are used toprotect has dramatic impact on the degree to which application code isserialized. Devices such as hard drives and other IO devices come tomind when I think of shared resources but, depending on the type of OS,kernel objects can also be protected by Spinlocks. The granularity of aSpinlock is a function of the number and type of resources that theyare used to protect.
A coarse-grained implementation of a Spinlock would be used toprotect all IO devices. So, for example, once a process gains access toany IO device, even a process that needs access to an unused resourcewill spin in the Spinlock until the lock is released. The other extremewould be to have a Spinlock protecting each shared resource in thesystem.
Each approach has its advantages. A coarse-grained approach issimple to implement, but has the potential to serialize theapplication. The fine-grained approach makes more resources availablein parallel, but is more complicated and has the potential to exhibitproblems such as deadlock and other classical problems associated withresource protection schemes used in single processor systems. Thisproblem can be overcome to a large degree through the use of watchdogapplications and other types of heuristics. However, it does add to theoverall overhead of the system, so it comes at a cost.
One final note: Spinlocks don’t just protect system data orresources, they can also be applied to application data. For example,if some device configuration data consists of a string of bytes, itwill need to be locked during update to ensure data integrity.
Granularity of Parallel CodeExecution
There are two general types of coding approaches that you can take withmulticore devices: fine-grained parallelism (in the case of theconnection machine and others that have hundreds or thousands ofprocessors – this is called massively parallel) and coarse-grainedparallelism.
Fine-grained architectures (Table1, above) have between two and acouple of dozen processors. An example of a massively parallel problemapproach would be if I want to add 2048 numbers together and had 1024processors to do it with. In the first iteration, all 1024 processorseach add two numbers together. The second iteration of 512 processorsadds the 1024 numbers together and so on. It would only take 11iterations to add all 2048 numbers together in this fashion.
Course-grained programming (Table2, above and Figure 3 below ), onthe other hand, approaches the problem at a level we are all morefamiliar with; segmenting our programs at the task level. Legacy codecan be easily converted to run on multi-core architectures using thecourse-grained approach.
It is usually a good fit for engineering departments as teams areusually tasked at the application level. A good understanding of howthe overall system works is necessary so that the different tasks areprioritized so that the execution is ordered in a way that optimizesperformance. The down side of course-grained parallelism is that it maylead to load imbalance.
There are advantages and disadvantages to both approaches. Thefine-grained approach brings massive amounts of processing power tobear on a problem. It also has the potential to utilize the processorsmore equally in a load balancing system. That being said, other thandata plane applications there are very few control plane applicationsin the embedded space that could be easily implemented that wouldbenefit from massive parallelism. Furthermore fine-grained segmentationis almost impossible to do without the use of a compiler that isexplicitly designed to exploit massive parallelism.
On the other hand, if one has a compiler that is designed to exploitfine-grained parallelism, or you have hand written the code to exploita large number of processors, then it is easy for all the processors tobe utilized efficiently. Another disadvantage in exploitingfine-grained parallelism is that there is a large overhead to pay interms of synchronizing and communication overhead.
In the context of parallel execution, synchronization overhead is thewaiting and communicating incurred by interdependent tasks waiting forother tasks to intermediate results in order to begin execution again.Unlike waiting for a Spinlock to release because of resourcecontention, synchronization overhead is caused when a task cannotcontinue execution until an intermediate result is provided by anothertask.
In the sum of 2048 numbers example used above, delays incurred dueto memory latency, interrupt or other factors cause the execution timesto vary between the different processing elements. As a result of thedifferent execution times, tasks that are dependent on theseintermediate results essentially stall until the result is ready. Asthe parallelism of an application increases, the potential forsynchronism overhead also increases. In general, there is a point foralmost all problem domains where the addition of additional processingelements will cause a slow down in performance.
Fine-grained applications typically take advantage of doing the sameactivity many, many times in parallel. The example above used a loopunrolling approach. Course-grained segmentation approaches parallelismat the task level, because it is almost impossible to design andsegment code in such a way that each task has the same execution time.The different execution times lead to what is called “load imbalance.”
As shown in Figure 4 below, if there are a discrete number of tasksthat need to be performed and the execution times are unequal, somecores will be idle while others are still executing.
Cache Coherency in uniprocessordesigns
Cache memory is used to speed execution of a microprocessor. Cachememory is typically implemented as a relatively small amount ofhigh-speed memory. Unlike main memory, cache memory can satisfy memoryreads and writes at processor frequency.
Main memory on the other hand typically takes two or more cycles toaccess. When a main memory fetch occurs, the process stalls. This isavoided by keeping anticipated or frequently used data and instructionsin cache. The processor will only stall when the information it needsis not in cache.
One side effect of using this two-layer approach to memory access isthat if cache is written to, then cache and main memory no longercontain the same information. When this occurs, cache and main memoryare considered incoherent. There are different approaches that canaddress this problem. The two most common are “write through” and“write back.”
The simplest cache coherency mechanism is “write through.” Whenwriting to cache using “write through” each time the cache is writtento, it also writes through to main memory. Thus coherency ismaintained. The down side is that the processor stalls while the writethrough completes.
A more advanced implementation for maintaining cache coherency triesto address stalling during a write by buffering up writes so that theyoccur less frequently. When the buffered writes are written to memory,they are written in “burst mode.” This technique is called “writeback.”
The write back implementation for obvious reasons is morecomplicated than write through. The write back implementation requiresthat a list of “dirty” cache lines be maintained. The dirty entriesindicate cache lines whose contents are not the same as in main memory.
When the cache loads a cache line from main memory, its contents aremarked as clean. When the processor writes to a cache location, anentry is made to the cache list to indicate that the cache entry isdirty. Later, depending on when the “caching algorithm” decides, thedirty cache entries are written to main memory.
Both write through and write back schemes can be implemented eitherin hardware or in software. All modern commercial microprocessors I amaware of use hardware to enforce cache coherency.
Multi-core Cache Coherency
Utilizing cache in a multi-core design takes the problem of cachecoherency to the next level. There are a number of multicorearchitectures on the market now that utilize shared memory. Thesearchitectures also utilize cache memory to speed execution.
Now, in addition to maintaining coherency between memory and asingle cache, the cache coherency scheme must address simultaneousreads and writes to an individual processor’s cache while maintainingcoherency with the shared memory as well with all the other cachememories that are used by the other processors.
The only efficient mechanism for doing this is by using cachecoherency hardware. Regardless of the hardware assist method used tomaintain cache coherency, using cache in multi-core systems tends toserialize process execution. The question is to what degree. In laterarticles, we will explore how different hardware architectures addressthis.
Todd Brian is product marketingmanager at AcceleratedTechnology, a Mentor Graphics division .
PartTwo of this series will cover various aspects of multiprocessing froman RTOS perspective, looking at two different approaches – symmetric(SMP) and asymmetric (AMP) multiprocessing – their benefits andlimitations in real time, deterministic environments.
To learn more about this topic, go to Moreabout multicores, multiprocessing and tools