Multicore Software Development: Fact and Fiction

Multicore is a hot topic. Half of all embedded designs have multipleprocessors, and 10% of embedded designs have multiple cores on asingle-chip. This percentage is slowly but surely increasing. In thesame way that it is difficult to find single core devices on thedesktop, it is only a matter of time before the same will be true inembedded systems.

As designers have begun to adopt multicore designs, much presshas been given to the challenges posed to software developers. In fact,it has become quite fashionable for industry pundits to wax histrionicabout the ills that must be endured by software developers who havebeen launched into a new hardware world without the proper softwaretools and ecosystem.

While many of the popular complaints are fiction, multicore softwaredevelopment does, in fact, pose some serious challenges. In thisarticle, we will try to separate fact from fiction as we discuss a fewof the key issues on the table today.

1. Refactoring embedded software toachieve concurrency is major challenge.
FICTION. It turns out that most embedded systems are already quiteheavily multithreaded. It is common for embedded developers to employreal-time operating systems, and every RTOS in the world has some formof threading primitive. Embedded designers use threads as a method tosimplify the management of the independent functions in the system.

On a unicore system, threads are logically concurrent, with theoperating system applying core processing power to each thread in turn.On a multicore processor, these threads are naturally and trulyconcurrent, usually with no change in the software required (assumingan symmetic multiprocssor-capableRTOS).

Furthermore, as embedded systems have grown in complexity, adding avariety of connectivity and multimedia functions, components mapnaturally to threads. If the device embeds a web server, this webserver uses one or more threads to serve connection requests. If thedevice has a file system, files are served by a number of file serverthreads.

Audio frameworks run as threads. CORBA and other connectivitysolutions use threads. As systems designers pile on more and moreapplications and middleware, the number of threads increases, enablingthe system to take immediate advantage of additional cores.

Of course, not all systems make optimal use of all the hardwarecores. Designers may indeed want to increase concurrency by refactoringthe code.

2. When refactoring software,maximize threads while minimizing processes.
FICTION. There are many ways to unlock concurrency, but coarse grainedparallelism (decomposing software into large sized pieces that aremapped to threads and/or processes) is arguably the most ubiquitous,portable, and effective.

Yet when deciding whether to map a new component to a thread(sharing memory space with other threads) or a process, most designersopt for the poorer choice of threads. In fact, until recently, manyembedded systems were characterized by large numbers of threads, allsharing the same memory space.

The reason for this can be attributed to the fact that the mostpopular RTOSes used in the 80s and early 90s did not support memoryprotection. Developers became accustomed to threads, and their legacycode lives on.

Of course, modern RTOSes today support memory protected processes.And while the cost (in terms of memory use and context switching time)of a process may be a bit higher than a thread, that cost has reachedthe threshold of negligibility with today's fast processor cores andmemory architectures.

In fact, designers should strive for a 1 to 1 ratio between threadsand processes. In other words, each memory protected component shouldhave only a single thread of execution. Of course it will not always bepossible to reach this ideal, but designers should strive to minimizethreads in each component, particularly in new code.

Whenever possible, each component should be owned by a singledeveloper, with clear, well-defined, message-based interfaces betweencomponents. This component management philosophy minimizes unforeseeninteractions and some of the nastier multithreading problems that arisewhen software uses many threads, synchronized with mutexes and othererror-prone constructs. Managing multithreaded components is simplymore difficult, even with the best visualization and thread-awaredebugging tools.

Regardless of whether threads or processes are used, an SMP-capableoperating system will automatically schedule the components onto theavailable cores. It is this automatic load balancing that is one of themost important efficiencies realized by moving to a multicore platform.

3. The industry is suffering from alack of multicore standardization.
FACT. A common, and valid, complaint. Multicore software needs theboost of pervasive standards. Sadly, the industry is suffering from alack of standards in key areas. And where standards do exist, they arehobbled by politics. Open standards are important for many areas of themulticore software ecosystem. We'll cover just a few of the morevisible ones: multithreading, interprocess communication, and dataplane accelerators.

Multithreading. This area is actually one of the better; I would grade the industrywith a B+ here. The reason is POSIX. POSIX is a collection ofopen standard APIs specified by the IEEE for operating system services.POSIX threads, or Pthreads, is the part of the standard that deals withmultithreading.

The Pthread APIs provide interfaces for run control of threads,synchronization primitives, and IPC mechanisms. While othermultithreading standards exist, Pthreads is the most generic, widelyapplicable standard. POSIX also provides primitives for managingprotected processes.

Pthreads and processes are supported by Linux, UNIX, and a widerange of embedded operating systems such as INTEGRITY, LynxOS, and QNX.Even Windows supports a POSIX interface. Due to the ubiquity of POSIX,there exists a large base of application code that can be reused forembedded designs. Another strong advantage of POSIX is that independentconformance validation is available from the Open Group. The list ofPOSIX implementations that have been certified conformant to the latestPOSIX specification can be found at PosixCertified.

By programming to the POSIX API, developers can write multithreadedapplications that can be ported to any multicore platform running aPOSIX conformant operating system. POSIX conformance is a requirementfor any operating system that expects to be used widely in multicoresystems.

Of course, POSIX is not the only standards effort formultithreading. Multithreading is built into some programminglanguages, such as Java and Ada. OpenMP allows developers to addparallelizing directives into C and C++ code. None of these otherstandards, however, has the widespread applicability, pedigree, andacceptance of POSIX.

POSIX itself needs some improvements for multicore. For example,while processes are part of the standard, scheduling primitives arethread-centric. As systems grow in complexity, it is desirable toschedule processes independently of threads. A designer may inheritlarge components, each with numerous constituent threads.

The designer should not need to understand the number or priority ofthreads within a component. Rather, the designer needs to be able toassign an allocation of CPU time to the component as a whole. Withinthe component, the normal thread scheduling can be used.

While some modern RTOSes provide this type of hierarchicalscheduling capability, the distinction between threads and processesfrom a scheduling perspective is currently absent from the POSIXstandard.

Core affinity is another conceptlacking in POSIX. Core affinity is a performance feature found in mostSMP-capable operating systems. When a thread migrates from one core toanother, the cache locality of the thread's code and data is lost andmust be reloaded on the new core.

When a thread needs to run (e.g. it is the highest priority runnablethread), and more than one core is available, the SMP operating systemmust intelligently choose which core to use.

The operating system usually keeps track of a thread's naturalaffinity. The natural affinity of a thread is defined as the core onwhich the thread last executed. By assigning threads to the cores thatmatch their natural affinity, migrations and cache misses areminimized, and the embedded software will realize superior performanceand power efficiency.

The other form of affinity is user-defined, and this is where theneed for an API standard arises. To see where user-defined affinity isuseful, consider the following example.

The SMP operating system typically provides the ability to mapinterrupts to specific cores. Inefficiency arises when an interruptfires on one core, but the operating system schedules a thread toperform handling of the interrupt on a different core.

The first core must use an interprocessor interrupt (IPI) to informthe second core of the scheduling event and preempt whatever wasrunning. Latency as well as overall system efficiency may be increasedby forcefully binding the thread to the first core.

Another scenario involves assigning multiple threads cooperating tofulfill a particular job (e.g. using shared data structures).Cooperating threads can be assigned the same core affinity to minimizeIPIs and maximize cache utilization.

The SMP operating system typically provides a system call to enablesoftware to assign core affinity in this manner. POSIX, however, doesnot yet have a standardized API for this.

InterprocessCommunication. My grade for IPC standards is a D. Messagepassing has long been a mechanism used to implement parallel computing,mainly because the multicomputers used historically to host massivelyparallel scientific computations lack a shared memory subsystem.

Rather, data for parallel computations are sent to the parallelcores using IPC, with the same IPC serving as a synchronizationmechanism. Although the target applications may differ from theirscientific brethren, multicore embedded systems often require IPC toachieve parallelism.

IPC comes in many flavors. In the scientific community, MPI (Message Passing Interface)is a widely used standard. POSIX, of course, specifies a variety ofmechanisms, including pipes, FIFOs, and sockets, that were designed forloosely coupled IPC.

Some IPC mechanisms strive to build in fault tolerance, a desirablefeature for loosely-coupled multicore systems, since one of the majoradvantages of these systems is the ability to recover from single-nodefailures. Although IPC does not inherently provide fault tolerance, thegoal is to provide the building blocks ” such as automatic link downdetection and retransmissions ” that enable developers to buildsurvivability into their multicore systems.

One example of an IPC that attempts to provide fault tolerancefeatures is TIPC (Transparent InterprocessCommunication). TIPC is an open source package that has beenported to multiple operating systems.

TIPC has provisions for reliable message delivery, retransmission,and communication link failover. Another standard is LINX, an open sourced version ofthe Link Handler provided in Enea's OSE operating system.

Therein lies the rub: there is no consensus on IPC standards in theembedded systems industry. In fact, another IPC standard recentlysprouted: the Multicore Association's CAPI;the good news is that this association has an impressive collection ofbig iron members (Intel, Freescale, TI, NEC) that has the potential tofoster widespread consensus. Stay tuned.

Data PlaneAccelerators. Many of the heterogeneous, tightly-coupledmulticore architectures involve the marriage of a general purposeprocessor (GPP) and one or more application-specific co-processors oraccelerators.

In fact, this is arguably the most common multicore architecture,since many embedded designs are powered by custom ASICs consisting of aGPP and some sort of custom application-specific compute engine.

Texas Instrument's OMAP and DaVinci product linesinclude examples of such a multicore architecture. One flavor ofDavinci marries an ARM core with a TI DSP and an additional videoprocessing subsystem. Network processors such as the Intel IXP areanother example, linking an Xscale GPP with network processingelements.

Unfortunately, due to the application-specific nature of theseco-processors, there is usually no generic API upon which developerscan rely. For example, TI provides a proprietary communicationslibrary, BiosLink, which must be used to communicate with the TI kernelrunning on the DSP.

TI also provides libraries for using the video accelerationcapabilities on some DaVinci processors. Therefore, the operatingsystem vendor must provide a port of these APIs in order for developersto make reasonable use of the part. And, of course, Bioslink does notrun on non-TI devices.

Inasmuch as the offload engines provide a similar function,standardized APIs should be developed and used. For example, it isconceivable that the various vendors of network processors couldstandardize some of the APIs used to manage packet traffic. Sadly, likeIPC, there is very little consensus in the area of heterogeneousintercore communications for software developers. Grade: C-.

4. Multicore debugging tools arelagging.
FICTION. This is another one of those mythical concerns. Although thereare certainly a number of IDEs that have failed to adapt to themulticore evolution, leading IDEs have been focusing multicore supportfor a long time. Let's take a look at the modern multicore debuggingtoolbox.

On-ChipDebugging. Tightly-coupled multicore processors often provide asingle on-chip debug port (e.g. JTAG) that enables a host debugger,connected with a hardware probe device, to debug multiple coressimultaneously. With this capability, developers can perform low level,synchronized run control of the multiple cores. Board bring-up anddevice driver development are two common uses of this type of solution.

For efficient use of this multicore hardware facility, thedevelopment tool must enable the developer to visualize all the coresof the system and choose any combination of the cores to debug, eachoptionally in its own window. At the same time, the tool must providecontrols for synchronized running and halting of the debugged cores.

The Probe advanced hardware debugdevice and the MULTI IDE areone example of a development tool system that meets these requirements.The Probe's multicore debugging features were launched in 2001. MULTI,launched in the early 90s, was named in part for its ability to debugmultiple processing elements simultaneously and elegantly.

Run-mode debugging is useful for both tightly and loosely-coupledmulticore systems, and for heterogeneous as well as homogeneoussystems. With run-mode debugging, the cores are never stopped. Rather,the debugger controls application threads using a communicationschannel (usually Ethernet) between the host PC and a target-residentdebug agent.

For efficient use of this facility, the operating system mustprovide an integrated debug agent (and the associated communicationsdevice drivers) that is operating system aware and provides flexibleoptions for interrogating the system.

For example, the Integrity operating systemcomes with a powerful debug agent that communicates with the MULTIdebugger, providing the capability to debug any combination of userthreads on any core, regardless of the homogeneity of the corearchitecture.

The user can set specialized breakpoints that enable user-definedgroups of threads to be halted when another thread hits the breakpoint.Some classes of bugs require this fine-grained level of control.

To be able to halt threads on a core separate from the core runningthe thread that hits the breakpoint, the operating system must handleall the behind-the-scenes communication that informs the appropriatecore, with minimal latency, of the event.

Multicore EventAnalyzers. Many operating system vendors provide an eventanalysis tool. A target-resident agent logs important operating systemlevel events, such as service calls, interrupts, context-switches, anduser-defined events.

The tool uploads this event log (either during execution orpost-mortem), and displays the events in a timeline. The tool allowsthe user to zoom, select specific events for further information,generate execution statistical reports, and other functions.

The event analyzer is an indispensable tool for developers ofmulticore software because it makes it easy to understand systembehavior and locate performance bottlenecks, livelocks, or otherproblems.

The event analyzer must be able to show events for all threads onall the cores, with the event streams synchronized to the same timescale. The tool must be able to display IPC between the cores.EventAnalyzer product is one example of a tool that meets theserequirements.

Multicore trace is another emerging behavioral analysis capability.With on-chip trace (available on a growing number of multicoreprocessors), the behavior of multiple cores can be recorded andsynchronized.

In order to take advantage of multicore trace, the developmenttoolset must be multicore trace aware. This means that the tools mustbe capable of visualizing multiple execution streams so that developerscan easily replay the execution of any combination of cores anddetermine what each core was doing at any particular time.

The TimeMachine suite is an example of a tool that takes these trace streams andprovides analysis tools and backwards-in-time debugging. These toolsnot only make some of the most insidious multicore bugs easy to find,but they are also critical for non-intrusive performance analysis.

The aforementioned tools, and a number of others, make up some ofthe mature, effective technologies available in the multicoredeveloper's arsenal. Multicore software enablement has become acontroversial topic. While there are significant challenges inarchitecture, design, and standardization strategies, multicoredevelopers can meet the evolving hardware landscape with confidence.

David Kleidermacher is chief technology officer at GreenHills Software where he has been designing compilers,software development environments, and real-time operating systems forthe past 16 years. David frequently publishes articles in tradejournals and presents papers at conferences on topics relating toembedded systems. He holds a BS in computer science from CornellUniversity, and can be reached at davek@ghs.com.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.