By Steve Leibson, Tensilica, Inc.
The term "multicore" seems to be getting a lot of use these days. For
example, there's an
industry association dedicated
to the idea and the IEEE Computer Society's Computer [1,2}magazine
recently devoted two cover stories to the concept. Like the poem about
the blind men and the elephant [3], the term appears to mean many
different things to different people depending on the context.
When used to describe PC-class microprocessors, the phrase nearly
always refers to on-chip arrays of identical, single-ISA (instruction-set architecture)
processors that handle processing loads using homogeneous or symmetric
multiprocessing (SMP) and
shared memory.
For SOC designs, the term may refer to shared-memory SMP
architectures but it can also mean heterogeneous (single-ISA or
multiple-ISA), single-chip, asymmetric
multiprocessing (AMP) designs, with or without shared memory.
Therefore, whenever you see a reference to a multicore chip or design,
you need to dig deeper to clarify how the term is being used.
SMP and AMP approaches with and without shared memory can be used to
solve processing problems that are beyond the capabilities of an
individual microprocessor. Multicore PC and server microprocessors
based on the x86 architecture started to appear after Intel and AMD hit
the clock-rate wall and could no longer increase single-core-processor
clock rates the way they did throughout the 1990s.
The maximum clock rates of these processors approached 4 GHz, at the
cost of excessive power consumption, heat dissipation, and
electromigration-related reliability concerns.
The path to further increases in processor performance through
increased clock rates appeared to be blocked. An alternate path
involved putting two and then four identical processor cores (and later
eight and probably 16 processor cores) on a chip with both cores
running at a lower clock rate to reduce power consumption and heat
dissipation.
Figure 1, below, adapted
from an article in Microprocessor Report [4], shows high-level block
diagrams for upcoming quad-core processors from Intel and AMD. Although
there are some architectural differences, both designs show the result
of the need to distribute the processing load of a large operating
system (generally Microsoft Windows) and its large application programs
over several processors.
A large shared DRAM memory (hundreds or thousands of megabytes) is
required to hold the operating system, applications, and data. Each of
the processor cores therefore has at least two levels of SRAM cache
memory (three in the case of AMD's Barcelona processor) to serve as
speed adapters that isolate each processor's high-speed execution
engine from relatively slow shared memory.
 |
| Figure
1: AMD and Intel Quad-Core x86 Microprocessors> |
Like barnacles, SRAM cache hierarchies have accumulated around
general-purpose processor cores as the disparity between processor
clock rate and memory speed has grown. Although essential to this sort
of architecture, cache hierarchies are inherently inefficient because
they keep multiple copies of data and instruction blocks.
There is always an overhead cost (in terms of time, power
dissipation, and silicon area) associated with moving information among
cache hierarchy levels although processor architects work hard to
minimize these overhead penalties.
Other vendors of server processors have also taken the multicore
path. Figure 2 below shows a
high-level block diagram of Sun's Niagara II server processor. Niagara
II contains eight multithreaded processor cores. Each processor core
has its own level-1 instruction and data caches and the processor cores
share a large level-2 cache. Four memory controllers keep the caches
filled.
 |
| Figure
2: Sun Niagara II 8-Core Processor |
These first three examples of multicore microprocessors illustrate
the stamp-and-repeat nature of multicore design for general-purpose
processors. Because each general-purpose processor core must be able to
handle any system task, the processor cores tend to be identical and
tend to be arranged in regular, symmetric arrays.
Multicore processor chips and SOC designs for embedded applications
can resemble the general-purpose multicore arrays, as shown by the IBM Cell Broadband Engine block
diagram in shown Figure 3 below
[3].
The 9-core Cell Broadband Engine contains eight independent
synergistic processor elements (SBEs). Instead of a cache hierarchy,
each SBE has a 256-kbyte local memory store that it uses for holding
instructions and data. An SBE cannot directly access memory outside of
its local store.
Instead, it relies on the intervention of an associated memory flow
controller (MFC) to transfer words between the SBE's local memory and
main memory. Transfers take place across a sophisticated, high-speed
(205 Gbytes/sec), 4-ring interconnect called the element interconnect
bus (EIB).
The ninth on-chip processor core, which is a general-purpose
processor, is also attached to the EIB network and acts as the
taskmaster, scheduling and initiating processing tasks on the SBEs.
Only the on-chip general-purpose processor has cache memories.
 |
| Figure
3: IBM's 9-Core Cell Broadband Engine |
Although the largely symmetric configuration of the block diagram
for IBM's multicore CBE superficially resembles the block diagrams of
the general-purpose multicore processor arrays from AMD, Intel, and Sun
shown in Figures 1 and 2, there are important differences.
First, each of the CBE's SBEs has a local memory instead of a cache.
Further, the SBEs do not share memory. Although the SBEs can dip into a
shared memory through requests issued to their MFCs, the shared-memory
address space is in none of the SBE's direct memory spaces.
The MFCs contain memory-management units that provide access to the
separate shared-memory space using the virtual address mapping defined
by the lone on-chip general-purpose processor.
IBM's CBE architecture demonstrates a key difference between
general- purpose computing and server applications and embedded
applications. Shared memory spaces benefit general-purpose computing
applications while dissimilar, real-time tasks executed for embedded
applications - such as audio, video, image, and network processing -
benefit from more separation between the multiple processor cores.
Due to the highly asymmetric nature of embedded tasks, even within
the same system on the same chip, the silicon efficiency of embedded
multicore SOC designs can benefit from the use of diverse processor
cores to execute the diverse tasks.
The block diagram of a Super 3G cellphone handset processor, shown
in Figure 4 below, illustrates
this situation [5]. (Tasks amenable to processor-based execution are
shown in gray.)
 |
| Figure
4: Block Diagram of a Super 3G Cellphone Processor |
Some of the tasks in the Super 3G handset processor involve
multimedia (audio, video, and image) processing; some tasks involve
running the user interface; some are baseband-processing tasks; and
some (those on the left) are associated with transmission processing.
None of these tasks resembles the other (like the parts of the
elephant in Saxe's poem). Many of these tasks will run on their
assigned processor without needing an RTOS (real-time operating
system). Others will need only the thinnest kernel while some may
require an RTOS for task supervision.
Although general-purpose processor cores can perform all of the
tasks shown in Figure 4 above,
they cannot perform them efficiently and will likely need relatively
high clock rates to perform the required processing in the allotted
time. Processor cores more closely matched to the tasks can execute
these tasks in far fewer clock cycles.
Such tailored processors will therefore be able to run at lower
clock rates and will consequently consume less energy. Reducing energy
consumption is absolutely critical in battery-powered applications such
as a cellphone handset and is increasingly important in even
line-powered embedded applications as energy costs climb.
Although the benefits of general-purpose, SMT multicore
microprocessors are clear, the law of diminishing returns can rear its
ugly head above four cores [6] except for server applications where the
number of users can be large.
The Super 3G cellphone processor shown in Figure 4 above illustrates that the
large number of tasks running on complex embedded SOCs can benefit from
a large number of heterogeneous processor cores. Many complex embedded
applications similarly benefit.
Steven Leibson is the Technology
Evangelist for Tensilica, Inc. He recently
co-authored a book, Engineering the Complex SOC, with Tensilica's
President and CEO Chris Rowen. Leibson formerly served as the Vice
President of Content and Editor in Chief of the Microprocessor Report.
References:
[1] Nidhi Aggarwal, et al,
"
Isolation
in Commodity Multicore Processors," Computer Magazine, June,
2007, pages 49-59.
[2] Michael Gschwind, et al,
"
An
Open Source Environment for Cell Broadband Engine System Software,"
Computer Magazine, June, 2007, pages 37-47.
[3] John Godfrey Saxe, "
The
Blind Men and the Elephant."
[4] Jim McGregor, "The New x86
Landscape," Microprocessor Report, May 14, 2007.
[5] Eisuke Miki, "Cell Phone
Technology for Super 3G and Beyond," Microprocessor Forum, San Jose,
CA, May 22, 2007.
[6] Rakesh Kumar, et al,
"
Homogeneous
Chip Multiprocessors," Computer Magazine, November, 2005,
pages 32-38.
To read more about this topic, go
to More
about multicores and multiprocessors.