Multicore microprocessors and embedded multicore SOCs have very different needs -

Multicore microprocessors and embedded multicore SOCs have very different needs

The term “multicore” seems to be getting a lot of use these days. Forexample, there's an industry association dedicatedto the idea and the IEEE Computer Society's Computer [1,2}magazinerecently devoted two cover stories to the concept. Like the poem aboutthe blind men and the elephant [3], the term appears to mean manydifferent things to different people depending on the context.

When used to describe PC-class microprocessors, the phrase nearlyalways refers to on-chip arrays of identical, single-ISA (instruction-set architecture)processors that handle processing loads using homogeneous orsymmetricmultiprocessing (SMP) andshared memory.

For SOC designs, the term may refer to shared-memory SMParchitectures but it can also mean heterogeneous (single-ISA ormultiple-ISA), single-chip, asymmetricmultiprocessing (AMP) designs, with or without shared memory.Therefore, whenever you see a reference to a multicore chip or design,you need to dig deeper to clarify how the term is being used.

SMP and AMP approaches with and without shared memory can be used tosolve processing problems that are beyond the capabilities of anindividual microprocessor. Multicore PC and server microprocessorsbased on the x86 architecture started to appear after Intel and AMD hitthe clock-rate wall and could no longer increase single-core-processorclock rates the way they did throughout the 1990s.

The maximum clock rates of these processors approached 4 GHz, at thecost of excessive power consumption, heat dissipation, andelectromigration-related reliability concerns.

The path to further increases in processor performance throughincreased clock rates appeared to be blocked. An alternate pathinvolved putting two and then four identical processor cores (and latereight and probably 16 processor cores) on a chip with both coresrunning at a lower clock rate to reduce power consumption and heatdissipation.

Figure 1, below , adaptedfrom an article in Microprocessor Report [4], shows high-level blockdiagrams for upcoming quad-core processors from Intel and AMD. Althoughthere are some architectural differences, both designs show the resultof the need to distribute the processing load of a large operatingsystem (generally Microsoft Windows) and its large application programsover several processors.

A large shared DRAM memory (hundreds or thousands of megabytes) isrequired to hold the operating system, applications, and data. Each ofthe processor cores therefore has at least two levels of SRAM cachememory (three in the case of AMD's Barcelona processor) to serve asspeed adapters that isolate each processor's high-speed executionengine from relatively slow shared memory.

Figure1: AMD and Intel Quad-Core x86 Microprocessors>

Like barnacles, SRAM cache hierarchies have accumulated aroundgeneral-purpose processor cores as the disparity between processorclock rate and memory speed has grown. Although essential to this sortof architecture, cache hierarchies are inherently inefficient becausethey keep multiple copies of data and instruction blocks.

There is always an overhead cost (in terms of time, powerdissipation, and silicon area) associated with moving information amongcache hierarchy levels although processor architects work hard tominimize these overhead penalties.

Other vendors of server processors have also taken the multicorepath. Figure 2 below shows ahigh-level block diagram of Sun's Niagara II server processor. NiagaraII contains eight multithreaded processor cores. Each processor corehas its own level-1 instruction and data caches and the processor coresshare a large level-2 cache. Four memory controllers keep the cachesfilled.

Figure2: Sun Niagara II 8-Core Processor

These first three examples of multicore microprocessors illustratethe stamp-and-repeat nature of multicore design for general-purposeprocessors. Because each general-purpose processor core must be able tohandle any system task, the processor cores tend to be identical andtend to be arranged in regular, symmetric arrays.

Multicore processor chips and SOC designs for embedded applicationscan resemble the general-purpose multicore arrays, as shown by the IBM Cell Broadband Engine blockdiagram in shown Figure 3 below [3].

The 9-core Cell Broadband Engine contains eight independentsynergistic processor elements (SBEs). Instead of a cache hierarchy,each SBE has a 256-kbyte local memory store that it uses for holdinginstructions and data. An SBE cannot directly access memory outside ofits local store.

Instead, it relies on the intervention of an associated memory flowcontroller (MFC) to transfer words between the SBE's local memory andmain memory. Transfers take place across a sophisticated, high-speed(205 Gbytes/sec), 4-ring interconnect called the element interconnectbus (EIB).

The ninth on-chip processor core, which is a general-purposeprocessor, is also attached to the EIB network and acts as thetaskmaster, scheduling and initiating processing tasks on the SBEs.Only the on-chip general-purpose processor has cache memories.

Figure3: IBM's 9-Core Cell Broadband Engine

Although the largely symmetric configuration of the block diagramfor IBM's multicore CBE superficially resembles the block diagrams ofthe general-purpose multicore processor arrays from AMD, Intel, and Sunshown in Figures 1 and 2, there are important differences.

First, each of the CBE's SBEs has a local memory instead of a cache.Further, the SBEs do not share memory. Although the SBEs can dip into ashared memory through requests issued to their MFCs, the shared-memoryaddress space is in none of the SBE's direct memory spaces.

The MFCs contain memory-management units that provide access to theseparate shared-memory space using the virtual address mapping definedby the lone on-chip general-purpose processor.

IBM's CBE architecture demonstrates a key difference betweengeneral- purpose computing and server applications and embeddedapplications. Shared memory spaces benefit general-purpose computingapplications while dissimilar, real-time tasks executed for embeddedapplications – such as audio, video, image, and network processing -benefit from more separation between the multiple processor cores.

Due to the highly asymmetric nature of embedded tasks, even withinthe same system on the same chip, the silicon efficiency of embeddedmulticore SOC designs can benefit from the use of diverse processorcores to execute the diverse tasks.

The block diagram of a Super 3G cellphone handset processor, shownin Figure 4 below , illustratesthis situation [5]. (Tasks amenable to processor-based execution areshown in gray.)

Figure4: Block Diagram of a Super 3G Cellphone Processor

Some of the tasks in the Super 3G handset processor involvemultimedia (audio, video, and image) processing; some tasks involverunning the user interface; some are baseband-processing tasks; andsome (those on the left) are associated with transmission processing.

None of these tasks resembles the other (like the parts of theelephant in Saxe's poem). Many of these tasks will run on theirassigned processor without needing an RTOS (real-time operatingsystem). Others will need only the thinnest kernel while some mayrequire an RTOS for task supervision.

Although general-purpose processor cores can perform all of thetasks shown in Figure 4 above ,they cannot perform them efficiently and will likely need relativelyhigh clock rates to perform the required processing in the allottedtime. Processor cores more closely matched to the tasks can executethese tasks in far fewer clock cycles.

Such tailored processors will therefore be able to run at lowerclock rates and will consequently consume less energy. Reducing energyconsumption is absolutely critical in battery-powered applications suchas a cellphone handset and is increasingly important in evenline-powered embedded applications as energy costs climb.

Although the benefits of general-purpose, SMT multicoremicroprocessors are clear, the law of diminishing returns can rear itsugly head above four cores [6] except for server applications where thenumber of users can be large.

The Super 3G cellphone processor shown in Figure 4 above illustrates that thelarge number of tasks running on complex embedded SOCs can benefit froma large number of heterogeneous processor cores. Many complex embeddedapplications similarly benefit.

Steven Leibson is the TechnologyEvangelist for Tensilica, Inc. He recentlyco-authored a book, Engineering the Complex SOC, with Tensilica'sPresident and CEO Chris Rowen. Leibson formerly served as the VicePresident of Content and Editor in Chief of the Microprocessor Report.

[1] Nidhi Aggarwal, et al,”Isolationin Commodity Multicore Processors,” Computer Magazine, June,2007, pages 49-59.
[2] Michael Gschwind, et al,”AnOpen Source Environment for Cell Broadband Engine System Software,”Computer Magazine, June, 2007, pages 37-47.
[3] John Godfrey Saxe, “TheBlind Men and the Elephant.”
[4] Jim McGregor, “The New x86Landscape,” Microprocessor Report, May 14, 2007.
[5] Eisuke Miki, “Cell PhoneTechnology for Super 3G and Beyond,” Microprocessor Forum, San Jose,CA, May 22, 2007.
[6] Rakesh Kumar, et al,”HomogeneousChip Multiprocessors,” Computer Magazine, November, 2005,pages 32-38.

To read more about this topic, goto Moreabout multicores and multiprocessors.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.