If we take the uniprocessor concept of multiple ALUs a step further andadd separate instruction pipelines, we get the idea behind symmetricmulti-threading (SMT). Referred to as hyper-threading by Intel, SMTmakes the execution units appear to be separate logical CPUs to thesoftware. Intel is not the only manufacturer with SMT-capableprocessors. Both DEC and Sun Microsystems also produced SMT-enabledprocessors.
Be careful to not confuse SMT with multi-core. With SMT, there isstill typically only one set of CPU registers, one L1 cache, etc. Thismeans that SMT will not result in massive increases in computationalpower because of contention over the shared resources. However,performance gains in the 20-40% range are possible provided that theinstruction streams are sufficiently independent.
Symmetric Multi-Processing and Multi-Core
If we duplicate all of the register sets, caches, etc. of a processor,then we move up to a new level of performance. We now have the abilityto run multiple independent data streams on multiple separateprocessors. This is referred to as Multiple Instruction-Multiple Dataor MIMD processing.
If the processors are of the same type and see the same memory andbusses, then we have a symmetric multi-processing (SMP) machine. This“shared everything” model allows for the operating system as well asapplications to run on any processor that is available at the time.This has good and bad points.
On the plus side, we can perform load balancing across theprocessors. With N processors in a perfect world, we should get N timesthe performance of a single processor. Unfortunately, the world israrely perfect. Resource contention, especially with respect to thememory bus, will prevent us from achieving perfect speed up with eachadditional processor.
However, with highly multi-threaded code and little or nocommunications between software entities (e.g., between processes) youcan see 80-90% of the “N times” speed up.
SMP has been supported by many operating systems over the years.Windows XP, Linux, QNX, LynxOS and others have long supported SMPsystems. Up to 64-way SMP systems have been built, but memory/buscontention typically starts to become a significant factor beyond 4processors.
With multiple separate CPUs, even the distance between theprocessors starts to become a factor with signal propagation delays.This is one of the rationales behind the multi-core processor. Figure1, below, shows a wildly simplified view of AMD’s Athlon-64 X2dual-core processor.
|Figure1. AMD dual-core processor architecture (Source: AMD)|
In this diagram, we can see two separate CPU cores with separate L1and L2 caches. Notice that the DDR memory subsystem is still sharedbetween processors. Consequently, we will never be able to achieveperfect scalability. That being said, the X2 package is compatible withmany existing Socket 939 motherboards. This means that you can upgrademany single-processor systems and convert them to SMP by simplyswapping out the CPU.
These SMP monster CPUs are significant power consumers. Intel’sDual-core Pentiums start at 115 Watts. AMD’s processors use a bit less,but even so, neither of these processor families lends themselves tobattery operation.
On the other hand, the Intel Centrino Duo and the forthcoming AMD X2Turions are both dual-core processors with significantly lower powerrequirements. The Centrino Duo at 2.16 GHz is estimated to be in the 24watts range with the Turion being just slightly higher. And the low-enddual core Centrinos are expected to be closer to 12 watts.
This is starting to make it conceivable to run with passive heatsinks while maintaining acceptable performance in applications such asuser kiosks and point-of-sale systems for streaming media aladownloading music to your iPod from a vending machine.
Currently there are 3 dual-core laptops on the market with more inthe offing. Unfortunately most of the early adopters of this technologywill probably be somewhat disappointed. This is largely due to the lackof multi-threaded applications that can take advantage of themulti-core processors. As the software catches up, the use ofmulti-core will, no doubt, increase.
Heterogeneous Multi-Core and Convergent Processors
While Intel and AMD duke it out for control of the desktop, othercompanies such as Texas Instruments and Analog Devices are more firmlypursuing the embedded space. TI’s OMAP processor line combines ARM 9 orARM 11 CPUs with their legendary TMS320 series DSPs onto a single pieceof silicon. Given the number of interfaces on this part, as can be seenin Figure 2, below, we really have a very highly integrated SOC.
In fact, a new term has already been coined to refer to these ASICs.They’re called Super-Systems on Chip (SSOCs). The manufacturers ofthese parts do not typically release them for general consumption.Rather, they require a commitment of several thousand units before theybegin production. This easily lends itself to manufacturers of cellphone handsets where it is not uncommon for a manufacturer to produceseveral million handsets of a particular model.
|Figure2: TI OMAP 1710 SSOC (Source: TI)|
The availability of the DSP core lends itself well to streamingmultimedia applications such as audio and video CODECs. Whereas mostgeneral-purpose microprocessors are targeted at multi-taskingcapabilities, the DSP element is more targeted at setting up dedicatedpipelining for data transforms. This orientation allows the developerto off-load computationally intensive tasks to the dedicated DSPprocessor.
These DSP cores generally have low power consumption that makes themwell suited for battery operation. Unfortunately, the typical approachto DSP code development is still to write the majority of code inassembly language. This significantly slows development by requiringdevelopers with specialized skill sets.
To address these skill set requirements, another new class ofprocessors is becoming available. These so-called “convergent”processor cores combine RISC and DSP features into a single core.Examples of this class of processors include the Analog DevicesBlackfin and the StarCore SC1000, SC2000 and SC v5 processors.
The approach to reduce the requirements for assembly languagedevelopment is the increased use of intrinsic functions. Intrinsicfunctions look like function calls, but implement complex DSPfunctions. Unfortunately, since there is no current standard for thenaming conventions of intrinsic functions, there is a large possibilityfor incompatibilities between intrinsic libraries from differentvendors (and for different processors).
Another possible solution can be provided by C language extensions.For example, a subset of the DSP-C language extensions is now part ofthe ISO Embedded C specification. This set of extensions is designed tosimplify the programming of DSPs using saturating types, fixed-pointtypes and circular addressing features. With these extensions, afixed-point dot product might look like:
Additional extensions might include SIMD vector operations thatleverage coprocessors such as the Altivec. There were many suchextensions that became part of FORTRAN 95, and another set that can befound in the processing libraries such as Mercury Computer’s ScientificApplication Library. As processors become more complex, we could startseeing a whole host of language extensions that should make it easierto harness the power of these advanced processor features.
Another feature of modern processors is the introduction of FPGAs withmultiple processor cores on chip. Examples such as the Xilinx Virtex-IIand Virtex-4 families provide for both custom intellectual property(IP) building blocks coupled with as many as 4 separate processors onthe same silicon. Unlike multi-core SMP processors, these processorsare typically only loosely coupled and are referred to as a “sharedsomething” resource model.
Because these processor cores do not share the same memoryresources, they are asymmetric. However, the inter-processorcommunications is typically bus oriented rather than network oriented.Essentially, these processors are miniature, non-uniform memoryarchitecture (NUMA) clusters.
This type of architecture is best suited to multiple, disjointapplications that require only occasional communications between theprocessing elements. Each processor has its own L1 & L2 cache and aseparate SDRAM block.
This approach boosts total memory bandwidth by having each processorwith a separate memory bus. The separate memory busses then avoid thepotential bus contention issues encountered with SMP systems. However,communications between the processors starts to become the limitingfactor. The less communications between the processor elements, thebetter the performance improvement.
Asymmetric multi-processing (AMP) is generally encountered insituations where we network multiple processors together and assignthem separate tasks. But, if we shrink this down to a single chip, wecan reduce the communications overhead resulting in higher performancein a smaller footprint while simultaneously reduction power consumptionand thermal dissipation.
Processors are mutating at an almost alarming rate. New tweaks toarchitectures, deeper pipelines, larger caches and faster clock speedsmean that we, as developers, must take the time to understand theprocessor architecture if we are to derive the maximum performance forour applications. Unlike the desktop world where we can hope that thevirtual machine is fast enough for our application and simply boost theprocessor clock speed until we can meet our deadlines, embeddeddevelopers have to deal with the constraints associated with limitedmemory footprints, battery operation and thermal dissipation.
Michael E. Anderson is CTO/ChiefScientist at The PTR Group, Inc .
In Part 1 in this series, the author looked at advanced singleprocessor alternatives involving the various SISD, SIMD andsuper-scalar architectures, where they are appropriate and theprogramming issues involved.
Thisarticle is excerpted from a paper of the same name presented at theEmbedded Systems Conference Silicon Valley 2006. Used with permissionof the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.
To learn about this general subject on Embedded.com go to Moreabout multicores, multiprocessing and tools.