To help developers troubleshoot and optimize multi-core applications,vendors such as QNX have introduced system-tracing tools that provide acomprehensive view of a multi-core system.
As shown below, in Figure 1 below ,such tools allow the developer to visualize complex system interactionsand pinpoint potential bottlenecks such as excessive message passingbetween cores.
They can also help analyze resource utilization, including CPUusage, on a per-application basis and suggest how to distributeapplications across multiple cores for optimal performance to reduceexcessive IPC between cores, find opportunities for parallelism,eliminate resource contention, and reduce unnecessary thread migration,
|Figure1 —Analyzing system-level interactions between the many multi-coresoftware components is beyond the scope of traditional debuggersdesigned to detect errors within individual programs and requiressystem-tracing tools to analyze how the whole multi-core systembehaves.|
Reducing excessive IPC betweencores
Excessive message passing between cores can impede system performancein a multi-core SMP system. Every time a core sends a message, it mustwrite the message to memory and send an interrupt to the receivingcore. The receiving core must then service the interrupt, schedule thesoftware process or thread that will handle the message, and read themessage from memory.
Together, these operations entail overhead, especially since theymight not use information stored in the processor cache. As a result,processes or threads that communicate frequently with one anotheracross cores (for example, a process that provides services to clientapplications) can consume a noticeable amount of system capacity.
This overhead is still considerably less than would occur in an AMPsystem that uses a full networking protocol for IPC, since the protocolstack would have to process the packet before delivering it to theapplication.
A tool that performs system-level tracing is invaluable inidentifying this kind of behavior. Such a tool can gather a massiveamount of system information, including hardware interrupts, kernelcalls, scheduling events, thread-state changes, and various forms ofinterprocess communication (IPC), including signals and messages.
To be useful, however, the tool must allow the developer to filterout everything except for the information related to inter-corecommunications. That way, the developer can quickly see which threadsare communicating between cores and identify which cores are executinga given thread over time.
To further isolate the problem, the tool can provide statistics suchas interrupts per thread, which help identify threads that causeexcessive inter-core communication. The tool could also display CPUutilization to help the developer examine core utilization from asystem perspective, process perspective, and thread-priority basis.
A system perspective helps identify potential problem areas, such aswhether a core is underutilized or 100% utilized. A process perspectivehelps the developer drill down to see what threads are doing in aproblem area. And a thread-priority view helps the developer detectwhether certain threads are getting too much or too little CPU timebecause of their assigned priority. Using this information, thedeveloper can gauge the frequency and efficiency of inter-corecommunications.
From there, the developer can make an informed decision as to how tocorrect the problem. For instance, the developer could take processesthat have a high affinity for one another and bind them to the sameprocessor core.
Doing so will reduce the overhead of inter-core communications andeliminate the attendant cache thrashing that degrades performance. Thedeveloper could then use the same system-level tracing tool to profilethe new configuration and measure the efficiencies gained.
Finding opportunities forparallelism
To maximize application performance on multi-core processors, thedeveloper must take advantage of the true hardware parallelism offeredby these chipsets. But to do that, the developer must first determinewhere parallelism will have the biggest payoff.
Starting with a system-tracing tool, the developer can single outprocesses, or threads within processes, that consume large amounts ofCPU time. Once a particular process or set of processes has beenidentified, the developer can employ an application profiler, whichanalyzes function-level performance within individual processes, todetermine which code inside a process or thread is consuming the mostCPU.
For instance, Figure 2, below ,shows an application profiler session in which a function namednanospin_clock() uses the majority of the CPU time. Oncecompute-intensive functions have been identified, the developer canapply a parallelization strategy.
For instance, if a compute-intensive algorithm has independentsteps, breaking the algorithm into multiple functions and spawningthose functions as separate threads can improve performance in amulti-core symmetric multiprocessing (SMP) system.
This focused approach to parallelization using system-tracing toolshelps ease the transition to multi-core platforms and enablesapplications to realize the increased throughput and performance thatthese architectures promise.
|Figure2 — With an application profiler, a developer can quickly pinpointcompute-intensive functions – in this case, nanospin_clock()- as wellas display call-tree information, to identify the execution pathsconsuming the most CPU cycles.|
Eliminating resource contention
The first thing a developer may notice after moving software to amulti-core environment is that performance doesn’t increase as much asexpected. In many cases, this slower performance results from resourcecontention.
By allowing processes to run concurrently, a multi-core processorcan expose resource contention issues never encountered on auniprocessor system. For instance, a common SMP bottleneck occurs whentwo or more threads share a data structure protected by a mutualexclusion lock, or mutex.
A mutex protects a resource by preventing multiple threads fromaccessing the resource at the same time. Thus, if two threads onseparate cores both access resources locked by the mutex, those threadscan spend considerable time contending for the lock. Instead of runningconcurrently with one another — which is the main benefit of amulti-core design — the threads must take turns executing.
|Figure3 – A system profiler provides system-tracing information that can beused to optimize multicore performance. The top-left panel shows wherethe greatest amount of core-to-core communication and thread migrationoccurs. The top-right panel “zooms-in” to identify exactly when a giventhread is migrating from one core to another. The bottom panel showswhich part of the system-tracing log file is being analyzed.|
For an example of resource contention, consider the routing table ina networking application. In a uniprocessor environment, only oneprocess can access the routing table at a time.
In a multi-core environment, on the other hand, threads on each corecan access the table simultaneously, thereby creating contention. Beingable to identify the source of such contention can save a developer asignificant amount of time and frustration. In fact, eliminatingcontention can have a huge performance payoff, even in systems thatappear to run acceptably fast.
To help identify resource contention, a well-designed system-tracingtool can:
(1) highlight processes thatare frequently ready to run but blocked
(2) generate statistics forthreads that are blocked because of resource contention caused bythreads on other cores
(3) provide a graphicalrepresentation of core-to-core messaging.
Other features, such as a search facility for finding specificsystem events and a timeline that graphically displays the flow ofexecution with high-resolution time stamping, will allow the developerto examine high-runner cases and pinpoint the root cause of thecontention.
Armed with this information, the developer can decide how best tooptimize the system. For instance, the developer may break theapplication into more threads to further parallelize the computation orremove the source of contention by, say, replicating the resourceacross cores.
Reducing unnecessary threadmigration
To achieve optimal performance in a multi-core system, some OSs supportsoft processor affinity: the OS scheduler will always try to dispatch athread to the core where the thread last ran. That way, the core canoften fetch the thread’s instructions directly from L1 cache, ratherthan having to reload them from L2 cache or main memory.
In some cases, however, threads will drift from core to core,overwriting each other’s cached instructions and forcing the L1 cacheto be continuously reloaded. Each time a thread is rescheduled to runon another core, the performance advantage of using information in L1cache is lost.
Unnecessary thread migration can result from higher-priority threadsinterrupting lower-priority threads. Each time such an interruptoccurs, there is the possibility that the OS will schedule theinterrupt-handling thread on another core. The more cores on the chip,the higher the probability this migration will occur.
By providing performance-counter information, a system-tracing toolcan make it much easier to diagnose this condition. For instance, as Figure 3, above, illustrates, theprofiler can provide statistics on the number of core-to-coremigrations for a particular thread. By displaying these statisticsgraphically, the profiler can give the developer a system-wideperspective, making it easier to zoom in on potential problem areas.
A Comprehensive Approach
While the right tools can greatly simplify the task of troubleshootingand optimizing a multi-core design, they can’t work in isolation. Anoperating system, if designed correctly, will also help reduce thecomplexity of deploying software on a multi-core chip.
This is especially true if the OS can transparently manage theallocation of shared hardware resources in a multi-core chip. Moreover,an OS designed for multi-core can provide the flexibility to implementthe best optimization strategy that tools suggest.
Robert Craig, PhD is senior softwaredeveloper, Derrick Keefe is development tools manager and Paul Leroux istechnology analyst at QNX SoftwareSystems.
In Part 1 in this series of articles, theauthors evaluate the two most popular multiprocessing models –symmetric (SMP) and asymmetric (AMP)- and compare them to a newapproach the company has developed, bound multiprocessing (BMP).
Thisarticle is excerpted from a paper of the same name presented at theEmbedded Systems Conference Silicon Valley 2006. Used with permissionof the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.
To learn about this general subject on Embedded.com go to Moreabout multicores, multiprocessing and tools.