Development and Optimization Techniques for Multicore Processors

Microprocessor design is experiencing a shift away from a predominantfocus on pure performance to a balanced approach that optimizes forpower as well as performance.

Multi-core processors continuethis trend and are capable of sharingwork and executing tasks on independent execution cores concurrently.In many cases, taking full advantage of the performance benefits ofthese processors will require developers to thread their applications.

Multi-core processors are comprised of multiple processor cores inthe same package and offer increased performance, but at a cost to theembedded developer. In many cases, taking advantage of the performancebenefits requires developers to thread their applications. Effectivelythreading an application is a nontrivial task that requires domainknowledge in multi-core architecture, parallelism fundamentals, and athreading methodology.

This article is focused on three main topics: (1) an overview of multi-coreprocessor architecture and the benefit that multi-core processorsoffer; (2) a primer onparallelism and discusses key topics such as scalability, parallelism& threading techniques, and, (3) a threading methodology which can be employed to effectively thread,optimize, and debug an application.

This article is excerpted from a paper ofthe same name presented atthe Embedded Systems Conference Boston 2006. Used with permission ofthe Embedded Systems Conference. For more information, please visit

Multi-core Processors
There are several definitions of multi-core processor. In this paper,multi-core processor implies two or more homogenous processor cores inthe same physical package. An example of such a processor is theIntel Core Duo processor which iscomprised of two similarprocessor cores in the same die. Other more general definitions ofmulti-core may include heterogeneous processor cores and multipleprocessors in a system; however this paper limits discussion to theprevious.

Multi-core Architecture
The move to multi-core processor architectures in the embeddedprocessor industry has been motivated by several trends occurring overthe past few years. These trends are summarized as follows:

1) Moretransistors available per die ” As transistor process technologyshrinks, more transistors are available for use in a processor design.It is now feasible to place multiple processor cores in the samepackage.

2) Clock scalingreaching limits due to power constraints ” An empirical study suggeststhat a 1% clock increase results in a 3% power increase. Space, power,and fan noise of cooling is a key customer constraint in many marketsegments.

Figure 1 below providesinsight into why multi-core is particularly attractive today. The powerand performance of three different configurations running a set ofapplications are compared. These configurations from left-to-right are:

1. Standard processorover-clocked 20%
2. Standard processor
3. Two standard processors eachunder-clocked 20%

The observation for configuration1 compared to configuration 2 is that performance is increased 1.13x with a power increase of 1.73x.This compares with configuration3 where performance is increased 1.73x with a power increase of1.02x. Onthe right class of applications, multi-core processors offer higherperformance with smaller increase in power compared to straightfrequency scaling. Two caveats from this observation is that not allapplications will be able to take equal advantage of multi-coreprocessors and that more work is required from the developer to realizethese performance gains.

Figure1. Multi-core Performance & Power Scenario<

Figure 2, below shows twotypical multi-core processor architectures. A somewhat simple manner ofenvisioning a multi-core processor is as two processors connectedtogether inside the chip packaging as opposed to a multiprocessor wheremultiple processors may be in a system, but connected via the systembus. The two multi-core architectures in Figure 2 differ in the location ofconnection between the two processor cores.

Figure2a. Multi-core processor separate L2

Figure 2a above shows two independentprocessor cores, each with a level 2 cache, and sharing system bus andconnection to memory. Figure 2b below displays two independent processor cores sharing the level 2cache, system bus, and connection to memory.

Figure2b. Multi-core processor shared L2

There are advantages and disadvantages of the two layouts; howeverthe current trend is toward shared level 2 cache which offers thefollowing advantages:

1) Size allocatedper core can be dynamically adjusted with the potential of thetotal level 2 cache being available to applications that require it.
2) Sharingof itemsbetween cores. Threads on separate cores can synchronizethrough the faster level 2 cache as opposed to main memory or the nextlevel ofcache.

Benefits of Multi-core Processors
Multi-core processors offer developers the ability to apply morecompute resources at a particular problem. These additional resourcescan be employed to offer two types of advantages, improved turnaroundtime or solving larger problem domains. An improved turnaround timeexample is the processing of a transaction in a Point-Of-Sale system.

The time it takes to process a transaction may be improved by takingadvantage of multicore processors. While one processor core is updatingthe terminal display, another processor core could be tasked withprocessing the user input. A larger problem domain example would be theservers that handle backend processing of the transactions arrivingfrom the various Point-Of-Sale terminals. By taking advantage ofmulti-core processor, any one server could handle a greater number oftransactions in the desired response time.

Figure3. Multi-core Performance Scaling Goal

Figure 3, above showspossible performance scaling improvements derived from multi-coreprocessors. The area labeled “Supralinear” indicates a performanceimprovement that scales greater than the number of processors.Supralinear speedup is rare, but possible in cases where theapplication benefits from the collective increased cache size andlayout.

The first line shows the theoretical case where the performancescales with the number of processors. The “Typical scaling” curve showswhat a sample application may return performance-wise as the number ofprocessors used to run the application increases. This curve displayswhere there is diminishing performance as the number of processorsemployed is increased. Less than linear scaling is indicative of eitherinsufficient parallelism or communication overhead of the parallelism.The goal of your effort to employ parallelism is to increase the amountof scaling possible in your application.

Parallelism Primer
Understanding the process to effectively thread an application requiresa general understanding of key parallelism concepts. This sectionprovides an overview on parallelism including such topics as estimatingthe benefits of parallelism, different methods of employingparallelism, and the advantages and disadvantages of various threadingtechniques.

The implementation of parallelism in a system can take many forms;one commonly used type is shared memory parallelism which implies thefollowing:

1) Multiplethreads execute concurrently.
2) The threads sharethe same address space. This is compared to multiple processes whichcan execute in parallel but each with a different address space.
3) Threads coordinatetheir work.
4) Threads arescheduled by the underlying operating system and require OS support.

To illustrate the keys to effective parallelism, consider an examplein the physical world of multiple workers mowing a lawn. The firstconsideration is how to divide the work evenly. This even division oflabor has the effect of keeping each worker as active as possible.Second, the workers should each have their own lawn mower; not doing sowould significantly reduce the effectiveness of the multiple workers.Finally, access to items such as the fuel can and clipping containerneeds to be coordinated. The keys to parallelism illustrated throughthis example are generalized as follows:

1) Identify theconcurrent work.
2) Divide the workevenly.
3) Create privatecopies of commonly used resources.
4)   Synchronize access to unique shared resources.

Scalability is critical to any parallelism effort because the number ofprocessor cores available in a single package is projected to increasesubstantially over the next 10 years from two cores to perhaps 8 &16 cores.

Scalability with regards to parallelism is a measure of how theperformance of an application increases as the number of processorcores in the system increases.

Amdahl's law provides thekey insight into determining the scalability of an application. Inbrief, Amdahl's law states that the performance benefit of parallelismis limited by the amount of the application that must run serially,e.g. is not or cannot be parallelized. For example, take an imageprocessing application where 300 images are to be processed with thefollowing characteristics:

1) 10minutes ofinitialization that must run serially
2) 1 minute to process oneimage(processing of different images can beaccomplished in parallel)
3) 10 minutes ofpost-processing that must run serially

Table 1 below shows thecalculations of scalability and efficiency for a number of differentprocessing cores. Efficiency is a measure of how effectively the totalnumber of processors is being employed in running the application inparallel; the goal is a 100% measure of efficiency. In other words, toincrease the benefits of parallelism and take advantage of theavailable processor cores as much of the application should beparallelized as possible.

Table1. Image Processing Scalability

Parallelism Techniques
The benefits of future multi-core architectures scale with the effortemployed by the embedded system developer. The different levels ofeffort are summarized as follows:

1. No effort
2 . Take advantage of multipleprocesses executing on a symmetric multiprocessoroperating system kernel
3. Take advantage ofmultithreading on a symmetric multiprocessor operatingsystem kernel

The first level specifies no explicit effort to take advantage offuture multi-core processors. Some benefit can be reaped in this casefrom the ever increasing single thread processor performance beingoffered in successive generations of processor cores.

Second, parallelism can be derived in many cases just by employing asymmetric multiprocessor and multi-core aware operating system suchas Linux. System processes have theability to run concurrently withyour application process in this case. Scalability may be limited asthe number of cores made available increases; there are only a limitednumber of tasks an operating system needs to accomplish and canaccomplish in parallel.

The third level is true multithreading on a symmetric multiprocessor& multi-core aware operating system. This level can take advantageof the parallelism made available at the process level (level 2) aswell as parallelism that is expressed in the multiple threads in anapplication. In order to derive this benefit, it will be required tothread the application.

One consideration in parallelism is how to decompose the work in anapplication to enable dividing the task between the different processorcores. There are two categories of decomposition summarized as follows:

* Functionaldecomposition – division based upon the type of work
* Data decomposition- division based upon the data needing processing

Functional decomposition attempts to find parallelism betweenindependent steps in your application. For example, consider anintrusion detection system that performs the following checks on apacket stream:

1. Check forscanningattacks.
2. Check for Denial of Serviceattacks.
3. Check for penetrationattacks.

As long as each step above was an independent task, it would bepossible to apply functional decomposition and execute each stepconcurrently. Data decomposition involves finding parallelism betweendata items. For example, the image processing algorithm was an exampleof data decomposition, where each image (the data) could be processedindependently of the other images. In general, it is easier to scaleapplications with data decomposition as opposed to functionaldecomposition. Lastly, it is also possible to apply both types ofdecomposition to a particular problem.

Parallelism Limiters
All problems do not lend themselves to parallelization andunderstanding some common parallelism limiters is essential. Also, someproblems may be parallelized only after special handling of thesepotential limiters. The key limiters to parallelism are summarized as:

1. Data dependency – datavalues that have dependencies between them.
2. State in routines -routineswith values live across invocations

The code shown below (removing the OpenMP pragmas ) has adata dependency between iterations on the sum variable and wouldtypically limit running the code in parallel; however there are methodssuch as using the OpenMP reduction clause that allow parallelism.

An example of state in routines is a routine that maintains databetween invocations of the routine using statically defined variables.Examples of routine with state include typical memory allocationroutines, random number generators, and I/O routines.

There are two general techniques for creating routines and loopsthat can be parallelized. Routines that are made reentrant are capableof being run in parallel. These routines would not have dependenciesbetween other invocations of the same routines. A method of testing ifa particular loop can be parallelized is to run the loop backwards interms of its loop indexes.

Threading Techniques
There are several techniques in use today to thread applications. Twobroad categories are library-based and compiler-based threading.Examples of library-based threading include Windows Thread APIand POSIX threads. Two examples ofcompiler-based threading are OpenMP and auto-parallelization. The twoexamples of library-based threading technologies are explicit in thatthe developer is responsible for explicitly creating, controlling, anddestroying the threads via function calls.

Table 2 below shows acomparison between explicit threading technologies and compiler-basedtechnologies. Explicit threading tends to be more flexible thancompiler-based threading in adapting to the particular needs requiredto parallelize an application. Explicit threading can parallelize allof the same applications that compiler-based threading can handle andmore. Explicit threading tends to be more difficult to use thancompiler-based threading.

In compiler-based threading, the compiler handles threading of theapplication with input from the user coming from command line argumentsand/or language directives. Explicit threading can perform bothfunctional and data decomposition whereas compiler-based threading ismore suited for data decomposition. Finally, explicit threading modelssuch as POSIX have been ported to several operating systems. Theavailability of compiler-based threading is dependent on compilersupport and the underlying threading support.

Table2. Comparison of explicit threading and compiler-based threading

OpenMP is an open, portable, shared memory multiprocessing applicationprogram interface supported by multiple vendors on several operatingsystems and under the following programming languages: Fortran77,Fortran 90, C, and C++. OpenMP simplifies parallel applicationdevelopment by hiding many of the details of thread management andthread communication behind a simplified programming interface.

Developers specify parallel regions of code by adding pragmas to thesource code. In addition, these pragmas communicate other informationsuch as properties of variables and simple synchronization. The samplecode in the appendix is an OpenMP program that calculates the value ofpi by summing the area under a curve. The program is very similar tothe original serial version of the code except for the addition of afew lines of code. The key line is the following pragma,

#pragmaomp parallel for reduction (+:sum) private (x)

which specifies the for loop should be executed by a team ofthreads, temporary partial results represented by the sum variableshould be aggregated or reduced at the end of the parallel region byaddition, and finally the variable x is private, meaning each threadgets its own private copy. The keys in parallelizing the code aresummed up as follows:

* Identify theconcurrent work. The concurrent work is the area calculationencompassing different parts of the curve
* Divide the work evenly. The number of rectangle areas to compute is 100000 andis equally allocated between the threads.
* Create private copies ofcommonly used resources. The variable x needs to beprivate, as each thread's copy will be different.
* Synchronize access to uniqueshared resources. The only shared resource, step
 does not require synchronization in this example because it isonlyread by thethreads, not written.

Automatic Parallelization
Automatic Parallelization, which is also called auto-parallelization,analyzes loops and creates threaded code for the loops determined to bebeneficial to parallelize. Automatic parallelization is a good firsttechnique to try in parallelizing code, as the effort to do so isfairly low. The compiler will only parallelize loops that can bedetermined to be safe to parallelize. The following tips may improvethe likelihood of successful parallelization:

1) Expose thetrip count of loops whenever possible. Thecompiler has a greaterchance of parallelizing loops whose trip counts are staticallydeterminable.

2) Avoid placingfunction calls inside loop bodies. Function calls may have effects onthe loop that cannot be determined at compile time and may preventparallelization.

3) If available,use optimization reporting options. This option provides a summary ofthe compiler's analysis of every loop and in cases where a loop cannotbe parallelized, a reason as to why not. This is useful in that even ifthe compiler cannot parallelize the loop, the developer can use theinformation gained in the report to identify regions for manualthreading.

4) If available,adjust the command line option threshold needed forautoparallelization. The compiler estimates how much computation isoccurring not occur. This can be overridden by the threshold option. Reduce use of global and static variables. These types of variablesmake it more difficult for the compiler to identify and parallelizecode.

Explicit Threading
Explicit threading includes library-based methods such as Win32* APIand POSIXthreads. Spawning threads, thread synchronization, and threaddestruction are controlled by the developer by placing calls tothreading library routines in the application code.

Typically, a fair amount of modification is required to thread anexisting application using explicit threading. The process forthreading involves identifying regions to thread, encapsulating thecode into functions, and spawning threads based upon these functions.This process makes explicit threading very amenable to functionaldecomposition.

Other advantages over compiler-based threading include the abilityto create thread pools that make it possible to queue independent anddynamic tasks without explicit thread creation. In addition, explicitthreading allows threads to have priority and affinity which provides afiner level of control for performance and tuning.

For example, the supra-linear speedup observed in the threading ofan intrusion detection system was partially due to cache benefits madepossible by processor affinity.iv Explicit threading is very generaland can express many types of concurrency over compiler-basedtechniques, but does require some knowledge to know when it is best toapply explicit threadingv.

Threading Methodology
Embedded application development is aided by two components, the rightdevelopment tools and the right development process. This sectiondetails a threading development process that can aid developersemploying threading in their application. This process is iterative andconcludes when the application requirements have been met. The firststep before beginning this process is to tune the application forserial performance.

Serial tuning is less labor intensive and can lead to dramaticperformance improvements that can enhance the benefits of parallelizingif done properly. There is a caveat; some serial optimization can limitthe ability to parallelize. The best way to consider ifoveroptimization may inhibit parallelism is to determine if youroptimizations are adding the elements that inhibit parallelism definedin the previous Parallelism Limiters section.

A classic example of loop optimization that limits parallelism viathreading is stripmining which adds dependencies across iterations of aloop. Figure 4 below detailsthe threading development process which is summarized by these foursteps:

* Analysis ” Determining which portionof the application to thread.
* Design ” Deciding how to thread andimplementing.
* Debug ” Findingand fixing threading issues in the code.
* Tune ” Optimizing for threadperformance issues.

Figure4. Threading Development Cycle

Step 1: Analysis
The analysis phase seeks to answer the following questions:

* What areas ofthe code should be targeted for threading?
* How can theseareas be grouped together to offer the best opportunity for andperformance from threading?
* What is theexpected speedup from threading?

The target area of code can be determined by using a profiling toolto determine the hot spots in the code. A hot spot is a region of codethat executes relatively more frequently than other regions of code.Examples of performanceanalysis tools that can provide hot spot information include GNUgprof and IntelVTune Performance Analyzer.

These tools monitor the execution of an application with arepresentative input set and provide a correlation between source codeand the frequency of execution. Once the most time consuming functionsare identified, the next step is to determine if these hot spots can beeffectively threaded. Some functions may be very resource intensive,but do not lend themselves to being threaded.

This could be due to many reasons like a lack of parallelizableloops, or pe

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.