Multi-core processors are everywhere. In desktop computing, it isalmost impossible to buy a computer today that doesn't have amulti-core CPU inside. Multi-core technology is also having an impactin the embedded space, where increased performance per Watt presents acompelling case for migration.
Developers are increasingly turning to multi-core because theyeither want to improve the processing power of their product, or theywant to take advantage of some other technology that is 'bundled'within with the multi-core package. Because this new parallel world canalso represent an engineering challenge, this article offers seven tipsto help ease those first steps towards using these devices.
It's not unnatural to want to use the latest technology in ourfavourite embedded design. It is tempting to make a design atechnological showcase, using all the latest knobs, bells and whistles.However, it is worth reminding ourselves that what is fashion todaywill be 'old hat' within a relatively short period. If you have anapplication that works well the way it is, and is likely to keepperforming adequately within the lifetime of the product, then maybethere is no point in upgrading.
One of the benefits of recent trends within processor design hasbeen the focus on power efficiency. Prior to the introduction ofmulti-core, new performance barriers were reached by providing siliconthat could run at ever higher clock speeds. An unfortunate by-productof this speed race was that the heat dissipated from such devices madethem unsuitable for many embedded applications.
As clock speeds increased, the limits of the transistor technologyphysics were moving ever closer. Researchers looked for new ways toincrease performance without further increasing power consumption. Itwas discovered that by turning down the clock speeds and then addingadditional cores to a processor, it was possible to get a much improvedperformance per Watt measurement.
The introduction of multi-core, along with new gate technologies anda redesign of the most power-hungry parts of a CPU, has led toprocessors that use significantly less power, yet deliver greater rawprocessing performance than their antecedents.
An example is the Intel Atom, a low power IA processor which uses45nm Hi-K transistor gates. By implementing an in-order pipeline,adding additional deep sleep states, supporting SIMD (SingleInstruction Multiple Data) instructions and using efficient instructiondecoding and scheduling, Intel has produced a powerful but notpower-hungry piece of silicon. Taking advantage of the lower powerenvelope could in itself be a valid reason for using multi-core devicesin an embedded design ” even if the target application is stillsingle-threaded.
Use advanced architectural extensions
All the latest generation of CPUs have various architectural extensionsthat are there for 'free' and should be taken advantage of. One veryeffective but often underused extension is support for SIMD – that is,doing several calculations in one instruction.
The Atom processor, for example, has dedicated SIMD execution units,as can be seen in Figure 1 below.
|Figure1: The internals of the Intel low power IA architecture|
Often developers ignore these advanced operations because of theperceived effort of adding such instructions to application code. Whileit is possible to use these instructions by adding macros, inlineassembler or dedicated library functions to the application code, afavourite of many developers is to rely on the compiler toautomatically insert such instruction in the generated code.
One technique known as 'auto-vectorisation' can lead to asignificant performance boost of an application. In this technique thecompiler looks for calculations that are performed in a loop. Byreplacing such calculations with, say, Streaming SIMD Extension (SSE)instructions, the compiler effectively reduces the number of loopiterations required. Some developers have seen their applications runtwice as fast by turning on auto-vectorisation in the compiler.
Like the power gains of the previous section, using thesearchitectural extensions may be a valid reason in itself for using amulti-core processor, even if you are not developing threaded code.
Not all programs are good candidates for parallelism. Even if yourprogram seems to need a 'parallel facelift', it does not necessarilyfollow that going multi-core will help you. For example, say yourproduct is an application running real-time weather patternsimulations, based on data collected from a number of remote sensors.
The measurements of wind speed, direction, temperature and humidityare being used to calculate the weather pattern over the next 30minutes. Imagine that the application always produces its calculationresults too late, and the longer the application runs the worse thetimeliness of the simulation is.
One could assume that the poor performance is because the CPU is notpowerful enough to do the calculations in time. Going parallel might bethe right solution ” but how do we prove this? Of course, it could bethat the real bottleneck is an IO problem, the reason for the poorapplication performance being the implementation of the remote datacollection and not excessive CPU load.
<>There are a number of profiling tools available that can help form acorrect picture of the running program. Such analyserstypically rely on runtime architectural events that are generated bythe CPU. Before you migrate your application to multi-core, it would beworth analysing the application with such a tool, using the informationyou glean to help in the decision making process.
There are different ways that one can introduce parallelism into thehigh-level design of a program. Three common strategies available arefunctional parallelism, data parallelism and software pipe-lining.Infunctional parallelism, each task or thread is allocated a distinctjob; for example one thread might be reading a temperature transducer,while another thread is carrying out a series of CPU intensivecalculations.
In data parallelism, each task or thread carries out the same typeof activity. For example, a large matrix multiplication can be sharedbetween, say, four cores, thus reducing the time taken to perform thatcalculation by a factor of four.
A software pipeline is somewhat akin to a production line, where aseries of workers carry out a specific duty before passing the workonto the next worker in the production line. In a multi-coreenvironment, each worker ” or pipeline ” is assigned to a differentcore. In traditional parallel programming, much emphasis is laid on thescalability of an application. Good scalability implies that a programrunning on a dual-core processor would run twice as fast on aquad-core.
In embedded systems, computing scalability is less importantbecause the execution of the end product tends not to be changed; theshelf-life of the end product usually being measured in years ratherthan months. It may be that when moving to multi-core, the embeddedengineer should not be over-sensitive to the scalability of his design,but rather use a combination of data and functional parallelism thatdelivers the best performance.
Using high-level constructs
Threading is not a new discipline and most operating systems have anAPI that allows the programmer to create and manage threads. Using theAPIs directly in the code is quite tough, so the recommendation is touse a higher level of abstraction. One way of implementing threading isto use various high-level constructs or extensions to the programminglanguage.
OpenMP is a pragma-based language extension for C/C++ and FORTRANthat allows the programmer to very easily introduce parallelism into anexisting program. The standard has been adopted by a number of compilervendors including GNU, Intel, and Microsoft.
A full description of thestandard can be found at www.openmp.org With OpenMP it is easy toincrementally add parallelism to a program. Because the programming ispragma based, your code can still be built on compilers that don'tsupport OpenMP ” the compiler in this case would just issue a warningthat it has found an unsupported pragma.
As stated earlier, functional parallelism is potentially moreinteresting than data parallelism when developing an embeddedapplication. An alternative to using OpenMP is to use one of the newlyemerging language extensions which supply similar functionality. It isexpected that eventually such language extensions will be adopted by anappropriate standards committee. An experimental compiler with suchextensions can be found at www.whatif.intel.com.
Another approach to traditional programming languages is to use agraphical development environment. There are a number of 'program bydrawing' development tools that take care of all the low levelthreading implementation for the developer.
One example is National Instruments' LabVIEW, which allows theprogrammer to design his program diagrammatically, by connecting anumber of objects together. Support for multi-core is simply adding aloop block to the diagram.
When programs run in parallel, they can be very difficult to debug “especially when using tools that are not enabled for parallelism.Identifying and debugging issues related to using shared resources andshared variables, synchronisation between different threads and dealingwith deadlocks and livelocks are notoriously difficult.
However, there is a now a growing number of tools available fromdifferent vendors, specifically designed to aid debugging and tuning ofparallel applications. The Intel Thread Checker and Intel ThreadProfiler are examples of tools that can be can be used to debug andtune parallel programs
Where no parallel debugging tools are available for the embeddedtarget you are working on, it is a legitimate practice to use standarddesktop tools, carrying out the first set of tests on a desktop ratherthan the embedded target. It's a common experience that threadingissues appearing on the target can often be first captured by runningthe application code on a desktop machine.
Stephen Blair-Chappell is aTechnical Consulting Engineer at IntelCompiler Labs.