Making the Most of Multi-Core Processors: Part 1 - Embedded.com

Making the Most of Multi-Core Processors: Part 1

The speed race is over. Faced with the growing energy consumption andexcessive operating temperatures caused by high CPU clock speeds,microprocessor vendors have adopted a new approach to boosting systemperformance: integrating multiple independent processor cores on asingle chip.

Intel, for example, has proclaimed that all of its new CPUs will usemulti-core architectures and has recently produced a roadmap thatdetails processors based on two, four, and eight cores.

Until now, multi-core processors for the desktop and server marketshave garnered the lion’s share of media attention. But multi-core isalso taking root in the embedded industry, with the introduction ofprocessors such as the dual-core Freescale MPC8641D, the dual-coreBroadcom BCM1255, the quad-core Broadcom BCM1455, and the dual-corePMC-Sierra RM9000x2.

Multi-core processors like these are poised to bring new levels ofperformance and scalability to networking equipment, control systems,videogame platforms, and a host of other embedded applications.

Compared to conventional uniprocessor chips, multi-core processorsdeliver significantly greater compute power through concurrency, offergreater system density, and run at lower clock speeds, thereby reducingthermal dissipation and power consumption issues.

There is one problem, however: most software designers and engineershave little or no expertise in the programming models and techniquesused for multi-core chips. Instead of relying on increasing clockspeeds to achieve greater performance, they must now learn how toachieve the highest possible utilization of every available core.

Take, for instance, the challenge of managing shared resources in amulti-core chip. In most cases, the cores have separate level 1 caches,but share a level 2 cache, memory subsystem, interrupt subsystem, andperipherals (see Figure 1, below ).As a result, the system designer may need to give each core exclusiveaccess to certain resources (for instance, specific peripheral devices)and ensure that applications running on one core don’t access resourcesdedicated to another core.

Figure1 — Anatomy of a typical multi-core processor.

The presence of multiple cores can also introduce greater designcomplexity. For instance, to cooperate with one another, applicationsrunning on different cores may require efficient interprocesscommunication (IPC) mechanisms, a shared-memory data infrastructure,and appropriate synchronization primitives to protect shared resources.

Code migration is also an issue. Most embedded OEMs have alreadyinvested in a massive code base created primarily for uniprocessorarchitectures. These companies need OS technology and development toolsthat help their code achieve the greatest possible resource utilizationon multi-core hardware, with minimal porting effort.

The OS chosen for a multi-core design can significantly reduce, orincrease, the effort required to address these challenges. It alldepends on how the OS supports the various multiprocessing modes that amulti-core chip may offer. These modes come in three basic flavors:asymmetric, symmetric and bound.

Asymmetric Multiprocessing
Asymmetric multiprocessing, or AMP, provides an execution environmentsimilar to that of conventional uniprocessor systems, which mostdevelopers already know and understand. Consequently, it offers arelatively straightforward path for porting legacy code. It alsoprovides a direct mechanism for controlling how each CPU core is usedand, in most cases, allows developers to work with standard debuggingtools and techniques.

AMP can be either homogeneous, where each core runs the same typeand version of OS, or heterogeneous, where each core runs either adifferent OS or a different version of the same OS. In a homogeneousenvironment, developers can make best use of the multiple cores bychoosing an OS, such as the QNX Neutrino® RTOS, that nativelysupports a distributed programming model.

Properly implemented, the model will allow applications running onone core to communicate transparently with applications and systemservices (device drivers, protocol stacks, etc.) on other cores, butwithout the high CPU utilization imposed by traditional forms ofinterprocessor communication.

A heterogeneous environment has somewhat different requirements. Inthis case, the developer must either implement a proprietarycommunications scheme or choose two OSs that share a commoninfrastructure (likely Internet Protocol-based) for interprocessorcommunications. To help avoid resource conflicts, the OSs should alsoprovide standardized mechanisms for accessing shared hardwarecomponents.

In virtually all cases, OS support for a lean and easy-to-usecommunications protocol will greatly enhance core-to-core operation. Inparticular, an OS built with the distributed programming paradigm inmind can take greater advantage of the parallelism provided by themultiple cores.

Inter-core IPC .Because AMP uses multiple OS instantiations, it typically requires acomplete networking infrastructure to support communications betweenapplications running on different cores. To implement the lowest levelof buffer transfer, an AMP system may use I/O peripherals (forinstance, Ethernet ports) assigned to each processor or use a sharedmemory/interrupt-based scheme, depending upon the system’s hardwarecapabilities. For heterogeneous operation, an AMP system will requirestandard protocols such as TCP/IP or possibly a clustering protocollike the Transparent Inter-Process Communications (TIPC) protocol.

In addition to standard protocols, an homogeneous AMP implementationmay use proprietary or OS-specific protocols. For instance, by usingthe QNX Transparent Distributed Processing (TDP) protocol, developerscan extend the native message-passing interface of the QNX NeutrinoRTOS to implement remote communications over any lower-levelinterconnect.

With this approach, local and remote communications become one andthe same: an application can use the same code to communicate withanother application, regardless of whether the other application is onthe local CPU core or on another core. Likewise, an application canaccess a device driver on a non-local processor as if that driver wererunning locally — this location transparency provides applications witheasy access to hardware resources “owned” by other cores or processors.

Allocatingresources . With AMP, the application designer has the power todecide how the shared hardware resources used by applications aredivided up between the cores. Normally, this resource allocation occursstatically during boot time and includes physical memory allocation,peripheral usage, and interrupt handling. While the system couldallocate the resources dynamically, doing so entails complexcoordination between the cores.

In an AMP system, a process will always be scheduled run on the samecore, even when that core is experiencing maximum CPU usage and othercores are running idle. As a result, one core or more cores can end upbeing over- or underutilized. To address the problem, the system couldallow applications to migrate dynamically from core to another.

Doing so, however, can involve complex checkpointing of stateinformation or a possible service interruption as the application isstopped on one core and restarted on another. Also, such migration isdifficult, if not impossible, if the cores run different OSs.

Symmetric Multiprocessing
Allocating resources in a multi-core design can be difficult,especially when multiple software components are unaware of how othercomponents are employing those resources.

Symmetric multiprocessing (SMP) addresses the issue by running onlyone copy of an OS on all of the chip’s cores. Because the OS hasinsight into all system elements at all times, it can allocateresources on the multiple cores with little or no input from theapplication designer. Moreover, the OS can provide built-instandardized primitives, such as pthread_mutex_lock,pthread_mutex_unlock, pthread_spin_lock, and pthread_spin_unlock, that let multiple applications share these resources safely and easily.

By running only one copy of the OS, SMP can dynamically allocateresources to specific applications rather than to CPU cores, therebyenabling greater utilization of available hardware. It also lets systemtracing tools gather operating statistics and application interactionsfor the multi-core chip as a whole, giving developers valuable insightinto how to optimize and debug applications.

For instance, the system profiler in the Momentics development suitecan track thread migration from one core to another, as well as OSprimitive usage, scheduling events, core-to-core messaging, and otherevents, all with high-resolution timestamping. Applicationsynchronization also becomes much easier since developers use standardOS primitives rather than complex IPC mechanisms.

Properly implemented, an SMP-enabled OS offers these benefitswithout forcing the developer to use specialized APIs or programminglanguages. In fact, developers have successfully used the POSIXstandard (specifically, the pthreads API) for many years in high-endSMP environments.

Not only is POSIX is widely used and well-documented, but developerscan write POSIX-based code that will run on both uniprocessor andmulti-core chips. In fact, some OSs allow the same binaries to run onboth processor types.

A well-designed SMP OS allows the threads of execution within anapplication to run concurrently on any core. This concurrency makes theentire compute power of the chip available to applications at alltimes. If the OS provides appropriate preemption andthread-prioritization capabilities, it can also help the applicationdesigner ensure that CPU cycles go to the application that needs themthe most.

Inter-core IPC. Because a single OS controls every core in an SMP system, all IPC isconsidered “local.” This approach can improve performance immensely, asthe system no longer needs a complex networking protocol to providecommunications between applications running on different cores.

Communications and synchronization can take the form of simple POSIXprimitives (such as semaphores) or a native local transport capability,both of which have much higher performance than networking protocols.This approach has the added benefit of a reduced footprint if anetworking protocol isn’t required for other reasons.

Bound Multiprocessing (BMP)
BMP, a new approach introduced by QNX Software Systems, offers thebenefits of SMP’s transparent resource management, but gives designersthe ability to lock software tasks to specific cores.

As with SMP, a single copy of the OS maintains an overall view ofall system resources, allowing them to be dynamically allocated andshared among applications. During application initialization, however,a setting determined by the system designer forces all of anapplication’s threads to execute only on a specified core.

Compared to full, floating SMP operation, this approach offersseveral advantages. It eliminates the cache thrashing that can reduceperformance in an SMP system by allowing applications that share thesame data set to run exclusively on the same core.

It also offers simpler application debugging than SMP since allexecution threads within an application run on a single core. And ithelps legacy applications written for uniprocessor environments to runcorrectly, again by letting them run on a single core.

With BMP, an application locked to one core can’t leverage othercores, even if they’re idle. That said, the OS vendor can offer toolsthat analyze resource utilization (including CPU usage) on aper-application basis and that suggest the best way to distributeapplications across the cores for maximum performance.

If the OS also provides hooks to dynamically change the designatedCPU core, then the user gains the freedom to dynamically switch fromone core to other, without having to worry about checkpointing andstopping/restarting the application.

SMP has long had the capability of tying a particular thread to asingle processor — this is known as thread affinity. BMP extends thisthread affinity to the process level by incorporating the concept ofrunmask inheritance.

The runmask is a thread-level entity that determines whichprocessors a thread can run on. In normal execution, threads arecreated with a runmask that allows them to execute on all processors.In BMP mode, however, all threads are created to inherit the runmaskfrom the parent thread. This has the effect of “binding” all of theprocess’s resources and threads to the same processing core (or set ofprocessing cores), giving the designer complete control over how aparticular core will be used by the overall application.

A viable migration strategy
As a midway point between AMP and SMP, BMP offers a viable migrationstrategy for users who wish to move towards full SMP, but are concernedthat their existing code may operate incorrectly in a truly concurrentexecution model. Users can port legacy code to the multi-core processorand initially bind it to a single core to ensure correct operation.

By judiciously binding applications (and possibly single threads) tospecific cores, designers can also isolate potential concurrency issuesdown to the application and thread level. Resolving these issues willallow the application to run fully concurrently, thereby maximizing theperformance gains provided by the multi-core processor.

Scaling Software on a Multi-CoreSystem
To scale software effectively on a multi-core system, applications mustbe designed with concurrent operation in mind. Typically, applicationswritten for most modern OSs today conform to a multithreaded,process-based model.

With this model, applications are broken down into processes thatact as containers for resources, such as memory, virtual address space,stack, and so on. Within each process, the application is furtherbroken down into threads.

A thread is the entity within the process that can be scheduled torun. A thread has configurable elements such as thread priority, whichtells the OS how to run the thread in relation to other threads in thesame or other processes.

In AMP mode, a process and all its threads are strictly tied to asingle processor core. To distribute the application, developers mustcreate and execute the process on each core and employ an appropriatecommunications protocol (for instance, TIPC or QNX TDP) to distribute,aggregate, and synchronize elements across the system.

Because of this additional complexity, developers may have to recodeor re-architect their software when migrating to processors with alarger number of cores.

Because each OS scheduler in an AMP system contains threadinformation only for its own CPU core, threads are scheduled only inrelation to other threads on that core. Consequently, a process thatconsumes 100% of a core can’t make use of CPU cycles from another core,even if the other core is completely idle.

Since multiple cores cannot share resources contained within aprocess, migrating the process from one core to another involvescheckpointing, stopping the process, and restarting the process on theother core. This assumes that the application can even run on the othercore, which may not be the case in a heterogeneous AMP system.

In SMP and BMP, individual threads are scheduled to run on all CPUs,and resources allocated to a process (and all of its threads) areavailable on all CPUs. This approach greatly simplifies thedistribution, aggregation, and synchronization of an application —threads are created to run on each CPU as required, and communicationsbetween threads are handled by simple POSIX primitives (semaphores,mutexes, etc.) rather than by complex networking protocols.

Once designed, the process can run equally well on a single-core,dual-core, or N-core system, the only potential change being the numberof threads that need to be created to maximize performance. In full SMPmode, an RTOS will schedule the highest-priority ready thread toexecute on the first available CPU core. As a result, applicationthreads can utilize the full extent of available CPU power rather thanbeing restricted to a single CPU.

Evaluating typical multi-coreconfigurations
In the homogenous example shown in Figure2, below , one core handles ingress traffic from a hardwareinterface while the other handles the egress traffic.

Because the traffic exists as two independent streams, the two coresdon’t need to communicate or share data with each other. As a result,the OS doesn’t have to provide core-to-core IPC. It must, however,provide the realtime performance needed to manage the traffic flows.

Figure2 — Using homogenous AMP to handle both ingress and egress traffic.

Figure 3 below shows anotherhomogenous example, but this time the two cores implement a distributedcontrol plane, with each core handling different aspects of a dataplane. To control the data plane correctly, applications running on themultiple cores must function in a coordinated fashion. To enable thiscoordination, the OS should provide strong IPC support, such as ashared memory infrastructure for routing table information.

Figure3 — Using homogenous AMP to implement a distributed control plane.

In the heterogeneous example shown in Figure 4 below, one core implementsthe control plane, while the other handles all the data plane traffic,which has realtime performance requirements.

In this case, the OSs running on the two cores need to support acommon IPC mechanism, such as the Transparent Inter-ProcessCommunication (TIPC) protocol, that allows the cores to communicateefficiently, possibly through shared data structures.

Figure4 — Using heterogeneous AMP for both the control plane and data plane.

In the control planescenario in Figure 5 below, SMP allows all of the threads in the various processes to run on anycore. For instance, the command-line interface (CLI) process can runone core while the routing application performs a compute-intensivecalculation on another core.

Figure5 — Using SMP in the control plane.

As shown in Figure 6, below ,a medical system is running in BMP mode on a quad-core processor, whereone core handles data acquisition, another graphics rendering, anotherthe human-machine interface, and the other database and data-processingoperations.

As in SMP, the OS is fully aware of what all the cores are doing,making operational and performance information for the system as awhole readily available. This approach spares developers the oneroustask of having to gather information from each of the cores separatelyand then somehow combining that information for analysis.

Figure6 — Using BMP in a quad-core medical system.

In the control plane/data plane example shown in Figure 7, below, the control planeapplications (command-line interface; operations, administration, andmaintenance; data plane management) run on core 0, while the data planeingress and egress applications run on core 1. Developers can easilyimplement the IPC for this scenario, using either local OS mechanismsor synchronized protected shared memory structures.

Figure7 — Using BMP for both control plane and data plane operations.

A Matter of Choice
Should a developer choose AMP, SMP, or BMP? The answer depends, ofcourse, on the problem the developer is trying to solve. It’simportant, therefore, that an OS offers robust support for each model,giving developers the flexibility to choose the best form ofmultiprocessing for the job at hand.

AMP works well with legacy applications, but has limited scalabilitybeyond two cores. SMP offers transparent resource management, but maynot work with software designed for uniprocessor systems.

BMP offers many of the same benefits as SMP, but allows uniprocessorapplications to behave correctly, greatly simplifying the migration oflegacy software. The flexibility to choose from any of these modelsenables developers to strike the optimal balance between performance,scalability, and ease of migration.

In Part 2, the authors will dealwith kinds of development tools which are needed to effectively dodesigns in complex multicore and multiprocessing environments .

Robert Craig, PhD, is senior softwaredeveloper, Dennis Keefe is development tools manager and Paul Leroux istechnology analyst at QNX Software Systems.

This article is excerpted from a paper of the same name presentedat the Embedded Systems Conference Silicon Valley 2006. Used withpermission of the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.

To learn about this general subject on Embedded.com go to Moreabout multicores, multiprocessing and tools.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.