Putting Multicore Processing in Context: Part 2 - Embedded.com

Putting Multicore Processing in Context: Part 2

Is it a given that utilizing multicores will result in a speedup of anapplication? Amdahl’s Law is not the only thing that plays a role inthe speedup of an application.

In general, if speedup is the sole objective when addingmultiprocessors, the following must hold true: (1) the processor isoverloaded and is not processing the work available in a satisfactorytime frame; (2) the workload contains elements that can be divided andworked on in parallel; and, (3) a suitably faster processor cannotprovide the processing power needed to handle the workload in asatisfactory time.

Part 1 in this series examined the “classic” reasons why one doesnot get a proportional increase in performance by adding additionalprocessors to a computing machine. Most, if not all of them, were basedin some form or fashion on Amdahl’s Law.

Basically, Amdahl’s Law states that the upper limit on the speedupgained by adding additional processors is determined by the amount ofserial code that is contained in the application. Some of the reasonsfor serialized code are that it is explicitly written into the code.Another reason why the code becomes serialized is because the codeshares resources. This includes data sharing. Only one processor orcore can access shared data at a time.

The next step in the exploration of multicore processing and whetheror not it will be of benefit in your application is the hardware. Mostembedded designs use shared memory (all cores are able to access someor all of the memory on the chip) and they have the capability tocommunicate with each other in some fashion. For most applications, theaddition of more cores does not lead to a proportional increase inperformance.

It is how the cores are combined and how they utilize memory andother resources and their communication topologies that differentiatethe architectures into the various classes of multicore architectures.From a hardware perspective, there are several different standardapproaches used in designing multicore hardware.

I use the term “standard approaches” because there have been manydifferent ways that cores, memory, resource access and communicationmechanisms have been combined. Luckily, for the embedded space, most ofthat experimentation was addressed long before embedded designsincorporated multiple cores.

This article looks at two of the most commonly used multicoredesigns for the embedded space: Symmetric Multi-Processing (SMP) andAsymmetric Multi-Processing (AMP).

SMP Hardware Architectures
SMPs are characterized by the symmetrical nature of their organizationand layout. They utilize two or more identical (homogeneous) cores thathave access to a common shared memory. Another attribute of SMParchitectures is that not only are each core identical, but each corehas identical access to all the resources of the system: memory, disk,UARTs, ethernet controllers, etc. Analog Devices’ Blackfin 561 andARM’s MPCore are just two examples of (SMPs).

SMPs are a cost-effective way of increasing performance in a system,rather than replicating the entire system: core, memory, IO and otherresources. Multicore designers realize that the most heavily usedcomponent of a processor is the core. All other resources of aprocessor are idle, relative to core activity since multiple resourcescan not be utilized concurrently. As a result, the odds are thatdifferent cores can operate in parallel by utilizing differentresources and data streams.

Figure1: Symmetric Multiprocessing Hardware

Asymmetric Hardware Architectures
AMPs are characterized by the non-symmetrical nature of theirarchitecture. In other words, AMP architectures are not bound by theSMP rule which states that all cores used in it are identical and donot require equally shared resources by each processor. Similar toSMP-based architectures, AMPs seek additional perfomance gains byadding multiple cores.

Unlike SMP architectures, AMP architectures seek additionalperformance gains by utilizing different cores or hardwareconfigurations that are optimized to do very specific activities. TexasInstruments’ OMAP and Freescale’s i.MX family of application processorsare two examples of a homogeneous AMP architecture. The OMAP and i.MXfamilies combine a general purpose MCU with a digital signal processing(DSP) core.

DSPs are capable of doing highly specialized mathmatical operationsvery efficiently when compared to a general purpose MCU. What may takea general purpose MCU hundreds of cycles to do, a DSP can accomplish inonly a few cycles. By combining the cores in this fashion, designersnot only provide for the division of labor among cores, but also, likeAdam Smiths’ pin manufacturing example given in the first article,practice the concept of labor specialization. Due to the specializednature of an AMP architecture, its area of application is more narrowthan a more general purpose SMP architecure.

Because of the complex nature of the devices for which multicoresare an appealing solution, many multicore devices use a real-timeoperating system (RTOS) to provide operating system services such asscheduling, communication and task management. RTOSs suitable formulti-core architectures are as varied and specialized as thearchitectures they are geared toward. In general, they can be segmentedinto two camps, those suitable for SMP architectures, and thosesuitable for AMP-based architectures.

Figure2: Asymmetric Multiprocessing Hardware

Real-time operating systems are regarded as AMP or SMP because theyexploit the different hardware attributes of AMP and SMP architectures.Since all the cores or processors that make up the SMP architecture areexactly the same, SMP RTOSs are characterized by the use of a single(instance) of an RTOS image that runs on all the different cores at thesame time. SMP RTOSs perform what is called “load balancing.” Loadbalancing involves parceling out ready-to-run tasks to available (idle)processors.

Spin locks and SMP: a review
Since SMP architectures are based on equal access to resources,resource protection is a fundamental requirement in SMP systems. Aswill be demonstrated in the next few examples, “Spin Lock Granularity”is of supreme importance in its effect on the efficient operation of anSMP system and is worthy of further examination. The primary mechanismfor maintaining coherency in an operating system as well as itsapplications and data is the spin lock.

Spin locks are a logical abstraction that act as gatekeepers toresources such as shared data and tables, controllers and kernelservices such as the scheduler. They are typically implemented as“atomic” test and set locks. In other words, the first task to gainaccess to the spin lock gets it and keeps the spin lock until the taskis finished with it. Special hardware arbitrates between two or moretasks that attempt to gain access to a resource simultaneously. Tasksthat do not get access will “spin” until the resource becomesavailable.

System and application designers have to balance two conflictinggoals when determining the granularity of spin locks. From oneperspective, the more spin locks a designer uses to protect resources,the more that these resources may be utilized in parallel. The downsideof this is that as the granularity of the locking scheme grows, so doesthe overhead associated with maintaining the locks. Furthermore, as thelocking scheme gets finer, the opportunity increases for a deadlock tooccur. The addition of anti-deadlock algorithms also adds to theoverhead.

Since lock contention tends to serialize execution, it isintuitively obvious that the shorter the amount of time a lock is held,the less the potential for serializing to occur. The followingtechniques are used in both RTOSs and multicore programming to reducethe time that a lock is held.

Table1

Symmetric OSes: master/slave and the alternatives
Primitive SMP RTOS implementations usually have a single processor thatis designated as the master and activities like interrupt handling,resource arbitration and scheduling are performed on it. The masterprocessor is statically defined or determined at boot. In more advancedSMP implementation, load balancing is used for interrupt and schedulingtasks and responsibility floats among the various processors. Loadbalancing RTOS-related tasks provide an ideal platform for “highavailability” or “hot swap” capabilities.

SMP RTOSs that use the master/slave approach are the easiest toimplement but provide the least performance increase of all thedifferent SMP implementations. One reason for this is that the masteracts as a bottleneck to providing kernel services. As with any otherresource in a SMP RTOS, the kernel services must be protected fromsimultaneous access by different process demands.

One solution involves running the kernel in spin-lock mode, whereonly one kernel service at a time is serviced (for example, a memorycreate). Other processors seeking concurrent access to kernel services(for example, to the file system) have to wait until the first processis finished. Although the system services are protected from competingprocesses, it operates inefficiently.

An improvement on the basic master/slave approach would be toallocate the different kernel services to different spin locks. Forexample the scheduler, interrupt controller and file system would eachhave a spin lock. By increasing the granularity of the kernelprotection data structures, the increased granularity allows the kernelto service more tasks concurrently. Now, only processes that competefor identical kernel services will spin, waiting on a lock to bereleased.

A superior solution to the two implementations presented above is anSMP RTOS, that in addition to “load balancing” different tasks, alsoload balances kernel services. By threading the kernel services andallowing them to run in parallel on different microprocessors, thepossibility of an unavailable kernel service decreases dramatically.

Table2

There are other less obvious advantages to using SMP. Design teamstypically focus on a single problem domain (task) at a time. Segmentingthe problem domain into tasks is not only a natural approach from thehuman problem solving perspective, it is the same one used in theuni-processor world. Since this is how most company’s design teamswork, there is no need to re-organize an engineering team. Designing athreaded program is also a familiar technique used in the uni-processorworld. No new paradigms are needed to utilize SMP effectively.

Compiler technology can also be applied to problems that benefitfrom decomposing the problem into smaller discrete parts that can beprocessed independently from each other. An example of this would bearray processing where the array can be decomposed so that parts may beattacked independently of each other. It is unlikely that the embeddedarena will benefit from this type of solution anytime soon as it istypically applied to architectures that have dozens if not hundreds ofprocessors.

/Table3

RTOS strategies for asymmetric multiprocessing
Unlike SMP RTOSs, RTOSs for AMP architectures do not require that thehardware be symmetric or asymmetric. The primary characteristic of anAMP RTOS is that it only runs on a single core or a single processor.Put simply, an AMP solution requires one RTOS image for each core inthe design. An AMP RTOS is the one that everyone is familiar with andhas used in the embedded space for the last 50 years. It could be a“roll your own” or a commercial RTOS. Regardless, the RTOS image iscompiled for that core and only sees the resources that the designerdictates.

The fact that AMP-targeted RTOSs don’t require a symmetricalhardware architecture, and that each core may or may not have access todifferent resources makes it a very flexible in terms of how the totalsystem can be put together. This singular characteristic makes it avery good candidate for use in a number of situations.

For heterogeneous architectures like the OMAP platform where onecore is a DSP and the other is an ARM microcontroller unit (MCU) core,the AMP solution is the only one possible. One RTOS image is compiledfor and run on the DSP and the other is compiled for and run on theMCU. Just like the hardware designers’ ability to provide cores thathave specific functionality, software developers deploy a mixture ofRTOS functionality for each core.

As is the case of the OMAP platform, the DSP cannot see most of thesystem resources like network connections and storages devices; it onlyexists to crunch numbers. Therefore, developers of DSP applicationsrarely require anything beyond rudimentary RTOS services such as ascheduler and the ability to create and delete tasks.

The MCU developer on the other hand may need support for aman-machine interface, GUI, file system, memory management andnetworking and communication protocols. To minimize the combinedfootprint of the combined RTOSes, each one may be scaled so that itprovides only the functionality needed by the applications softwarerunning on each core.

There is no requirement that AMP-based RTOSs have to run onhetrogeneous architectures. There are a number of situations where adeveloper chooses to use an AMP operating system over an SMPimplementation.

One example would be in the case where deterministic real-timedeadlines have to be met. Unlike SMP-based solutions where thescheduler and interrupt handling mechanism may be shared amongstmultiple cores, an AMP implementation, with its own scheduler andinterrupt handling mechanism on each core, can respond to interruptsand deadlines without waiting to acquire a lock for a resource.

The developer can mix and match their RTOSs and hardware to providean optimal mix of performance and features. For example, the developercan choose an optimized deterministic, real-time RTOS for the coresthat respond to the real-time needs of the application, and RTOSs thatoffer a rich palette of services. Developers can also choosethird-party software for the non-real-time aspects of the system.

In an effort to reduce costs associated with software safetycertifications, developers can partition the system so that only thebare minimum code pertaining to the safe operation of the device runs(on one or more cores), and all the “non-safe” software can run on theother(s). While it may not make a difference for most applications, forthose that do, the final cost of developing code to meet FAA, FDA andindustrial standards starts at $60 to $100 per line of code.

Partitioning and legacy code issues
Partitioning the system is a viable way of reducing overall costs ofdelivering a safety certified system. Another reason is the case wherelarge amounts of legacy code are developed with a specific RTOS inmind. In this case, it is easier to reuse the total application intact,rather than port it to a newLegacy Code<> RTOS. Furthermore, it would take a significant investment inresources to determine the different interactions between the reusedlegacy code and the new code in a load-balancing system.

AMP solutions provide the designer with total control over how theirsystem functions. AMP solutions leave nothing to chance. Each processis tasked to a core and has a priority assigned to it. Since no kernelresources are shared, AMP provides very deterministic behavior. If oneword could describe an AMP solution, it would be “predictable.” Thedesigner binds each task to a processor at compile time. If tasksrunning on different cores need to interact, the designer governs how,when and where they will interact.

One thing that SMP software developers take for granted is theability to communicate between the different cores. Typically,AMP-based solutions have no built-in mechanism to provide a method forcommunication and synchronization between the two cores.

Since many of the multicore hardware solutions for the embeddedspace use shared memory, a very efficient inter-processor communicationmechanism can be utilized for inter-processor communication (IPC)between the two cores. A basic implementation would provide for datamarshaling if needed, a table and set of message buffers, a mechanismto protect and synchronize access to the message buffers and a methodto signal that the other core can access a message buffer.

A last word about tools
One final aspect of multicore development is the availability of toolsfor the RTOS and the hardware target. Debuggers, like multi-corehardware and software solutions vary in sophistication. SMP debuggersneed to be able to determine what tasks have been allocated todifferent cores to run. The inability to set and trigger breakpointsacross multiple cores makes finding and solving some bugs verydifficult. Choosing the development environment use for multicoredevelopment can be as important a decision in the success of a projectas the RTOS and hardware platform. The next article will address thetopic of multicore development tools in detail.

To read Part 1 in this series, go to ThePros and Cons of Multicore Architectures

Todd Brian, is product marketing manager in the AcceleratedTechnology group at Mentor GraphicsInc.

To learn about this general subject on Embedded.com go to Moreabout multicores, multiprocessing and tools.

For further information about upcoming activities in the industryrelating to multicore design, go to the Multicore Association Website.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.