Processor Architectures Step Out to
New Performance, Power Levels
Embedded processor architectures proliferate, with higher clock
rates, chip-level MP, and full MP systems in SOCs. Only a few years
ago, embedded processors were a ho-hum mix of a few RISCs, some
RISC cores, and a set of aging microcontrollers. Things have
changed. Today's designers have a veritable buffet of processor and
co-processor designs and design technologies to choose from.
Processor design, powered by faster, denser, and more plentiful
silicon, has broadened beyond a simple CPU, caches, and on-chip
peripherals on a chip.
Today's embedded processor choices include:
Chip-level MP—Multiprocessing on-chip is a
reality. Implementations range from handset SOCs integrating a DSP
+ uC, to multiple DSPs on a common bus.
True Systems-on-Chip—SOCs have gone beyond just a
processor, its memory bus and peripherals. SOCs are taking on the
aspects of systems, with multiple processors, common memory, and
peripherals with sophisticated system buses to tie it all
together
Parallel Processor Element Designs—Alternate multiple
Processing Element architectures that can deliver massive amounts
of processing power from ganged PEs. These processors may function
as a co-processor or be integrated with a host CPU in an SOC.
Extensible Processors—RISCs that can be extended at
the ISA level that rely on system level logic synthesis to
integrate the designs.
Add-On Functionality—RISC, DSP architectures that
enable 3rd parties and vendors to add logic functionality at the
ISA level. They rely on logic synthesis to integrate the new
functions into the design.
And last, but not least, CPU speeds are up. Embedded processors are
moving up the speed curve. Embedded PowerPCs, for example, are
pushing past 500 MHz, with 700 MHz in sight by year's end. And some
embedded processors are cracking the 1 GHz barrier. SiByte has a
proprietary MIPS64 implementation, the SB-1, that will hit 1 GHz.
Silicon-Driven Revolution
Everybody knows it, but it's still true: that silicon technology
follows Moore's Law, roughly doubling the number of transistors
every 18 months. And we are seeing the benefits of this relentless
silicon march up the silicon curve. It is a technical commonplace,
a cliché, that silicon technology follows Moore's Law, roughly
doubling the number of transistors (or functionality, or clock
rates, or capabilities) every 18 months to two years. It is also a
truism. And we are now seeing the benefits of this relentless
silicon advance up the silicon curve (time vs. capability).
We are now seeing lower core voltages, accompanied by lower power
dissipation, higher clock rates, smaller geometries, and much, much
higher transistor counts. We can do more, faster, with less power,
and even cheaper than before. Core voltages are dropping to 1.2 to
1.8 V for the more advanced designs, on its way to 1.0 V or less.
Silicon resolution for CMOS designs has come down to 0.12 to 0.18
microns.
Such lower core voltages and smaller silicon, coupled with power
saving design techniques, have brought down chip level power
dissipation, in spite of higher clock rates. Power dissipation for
many embedded processors, even for advanced processors, has come
down to a matter of a few Watts or even 500 mW or so for low power
SOCs. Transistor counts for basic processors are now up to 4 to 7 M
transistors to accommodate more on-chip caches and RAM, going out
to 45 M transistors or more for high-end media processors.
This silicon is not free. But it is available. And with today's
higher and higher clock rates, processors need more on-chip memory
to minimize off-chip memory access delays. So processors are
bulking up on on-chip memory. Most RISCs, for example, can now
afford to run with 16 KB or even 32 KB Instruction and Data caches.
And many, following Intel's Pentium III, are moving toward large
on-chip L2 caches to localize processing and to minimize off-chip
memory access delays. Motorola PowerPC 750CX for example, brings
the L2 cache on-chip, adding 256 KB of L2 cache, and removing the
backside bus from its package.
Silicon Magic's DVine network/media processor takes advantage of
today's plentiful silicon. It bulks out at 45 M transistors, many
of which are used for on-chip memory and on-chip DRAM. The media
processor has 4 MB of on-chip DRAM, organized in 1-MB memory units.
Taking advantage of plentiful silicon, these memory units have
their own Memory Streaming Processor processing the data as it
streams into or out of the memory unit.
The cement that binds these technical advantages to more advanced
designs is logic synthesis. Without HDL-based design and the logic
synthesis to transform HDL designs into silicon, current designs
would not be possible. With chips running from 7 to 45 M+
transistors, the time for gate-level hand design is over. Today's
complex designs are the products of a new generation of synthesis
tools. These tools enable designers to easily add IP or to extend
their processor designs.
Chip-Level MP
This year has been a watershed in embedded processor design. This
was the year that chip-level MP became a reality. The year that
SOCs moved from being a way to integrate a processor with its
peripherals on one piece of silicon, to the point when SOCs started
taking on the characteristics of true systems. Multiple processors
on a SOC became a working reality, one that designers could count
on for delivering a large amount of MP processing power within a
realistic silicon budget.
SOC MP ranges from paired processors, such as a RISC paired with a
uC, to full-scale MP architectures with multiple RISC or DSP
processors. In addition, a new class of MP processing has emerged,
that of multiple specialized processors or coprocessors arranged in
sequential processing order or in processing arrays. This latter
class represents the deployment of specialized math, vector,
graphic, or media processors, which collectively can deliver a very
high level of performance at modest clock rates.
Taking advantage of today's plentiful silicon, vendors are packing
multiple processors on a single die to minimize design chip counts
and costs. For example, Motorola is now integrating its 32-bit
M•CORE RISC with its 4th generation, 16-bit fixed-point
StarCore VLIW DSP for one-chip wireless transceivers. Another
example of such high-level on-chip integration is Lucent's StarPro,
which integrates 3 StarCore DSPs with 768 KB of RAM and on-chip
peripherals. A third example is Infinion's Carmel DSP, which
supports on-chip MP, with 4 or more DSPs integrated on a single
chip.
Clocks vs. Execution Units
There's a new variation on an age-old: clock rates vs. execution
units. The idea is that we don't have to go faster if we do more in
parallel. Many designers are making an interesting tradeoff: clock
rates vs. execution units based on the idea that maybe we don't
have to go faster if we can have lots of parallel execution units.
We can then run the execution units at slower clock rates and get
GHz level performance without straining the silicon. It's a
variation of the "wider rather than faster" design theme. If you
think about it, that's precisely what superscalar RISC, VLIW and
SIMD are all about, essentially deploying more execution units in
parallel.
Sounds good, but most superscalar RISCs, VLIWs or SIMDs, can't get
that many execution units chugging away in parallel. For example, a
4-way superscalar RISC will run 4 execution units in parallel. At
best, a VLIW like TI's C6x with an 8-way VLIW has 8 units executing
in parallel. SIMDs do a bit better, especially for 8-bit
operations: a 128-bit SIMD like Motorola's PowerPC G4 does 16
executions in parallel. But if you need 16-bit accuracy, it only
does eight operations in parallel.
However, there's another way to get more parallel processing power
to deliver massive amounts of execution MIPS at relatively low
clock rates. New architecture designers have done this by basically
upping the number of parallel execution units that can be deployed
in tandem. Today's emerging parallel designs are all over the place
architecturally, but basically all get their top-level performance
by ganging multiple parallel execution units for massive
parallelism.
A number of these massively parallel execution unit CPU designs
have emerged. They include:
BOPS Mantra— A very innovative MP design,
arranging arrays of PEs for math-oriented or graphics-oriented
processing. The Mantra has a 2 x 2 array, ganging together 20
parallel execution units for large scale processing. The array
links to an interchange switch, which can crossover results to the
PEs for the next step in complex vector and DSP processing. Each PE
has 5 execution units (ALU, MAU, DSU, LOAD, STORE, a data memory
and a VLIW instruction memory) to drive the execution units. The
processor can act as an SIMD, with the same instructions being
stored locally at each PE. At 200 MHz, the Mantra can deliver 4,000
peak MIPS.
Chameleon CS2000— This is a dynamically reconfigurable
MP design with an ARC RISC on-chip host with a 32-bit
reconfigurable processing fabric. The fabric has 4 slices, with 3
tiles per slice. Each tile has 4 memory buffers, 7 datapath units,
and 2 multipliers for a total of 24 multipliers and 84 datapath
units or computing cells. Each datapath unit can take in two data
streams doing shifts, word swapping, dual 16-bit adds, and Verilog
or C operators. It is configurable with FPGA-like programmable
local and layer interconnects and datapath cells. Running at 125
MHz, each slice can deliver 3,387 peak MIPS (125 MHz x 3 tiles x 9
units). If you count the pipelined slices (one feeding the next in
a dataflow fabric), it rises to 13,448 peak MIPS
Improv's Jazz—This is a multi-level, chip-level MP
architecture. The Jazz PSA-32-STD5's first layer consists of 5 task
engines each with I and D memory, and a shared memory with the next
task processor over. The task processors connect to a common I/O
module and link via a circular Q-Bus, one to the next, to the I/O
module. The task engines are individually configurable VLIWs with
up to 16 execution units linked with a crossbar to 4 memory ports
and a control unit. The STD-5 task engine is configured with 7
execution units: 2 ALUs, 2 MACs, 1 SHIFT, 1 Counter, and 1
BYTESWAP. This is more of a dataflow architecture with data flowing
between the task engines or processing stages, down to the
individual processing elements. Running at 100 MHz, each task
engine delivers up to 700 peak 32-bit MIPS. A 5 task engine Jazz
PSA-STD5 delivers up to 3,500 32-bit MIPS.
Silicon Magic's DVine SM2700—A heavyweight, scalable
media processor that deploys multiple PEs coupled with large arrays
of on-chip DRAM. The SM2700 includes 6 RISC processors and 6 Vector
Processors. The units and memory are linked point-to-point with an
on-chip crossbar. The memory units are front-ended by a pipelined
streaming memory processor. It does media manipulation on the data
as it streams by (interpolation, decimation, data alignment,
address translation). The RISC is scalar and the Vector Processor
operates on 128-bit vectors and supports SIMD operations, as well
as zero-overhead looping, and horizontal data swapping. Running at
200 MHz, the SM2700 can deliver 6,000 peak 32-bit MIPS (200 x (6 +
(6 x 4)). For 16-bit operations, the peak MIPS rate ups to 10,800.
And that doesn't count the streaming memory processors, which
contribute 2 operations per cycle (on 128-bit data).
Of course, you can also bulk up the number of deployed execution
units by integrating multiple superscalar RISCs. Running n RISCs,
each with 4 execution units, can deliver a peak of 4 x n parallel
execution units. One such example is that of Infinion's Carmel DSP,
which now supports on-chip MP operations on a common on-chip FPI
bus. Four such DSPs can deliver 1.2 MIPS performance, running at
166 MHz. Each DSP can deliver peak 3 operations per DSP, deploying
3 functional executions units per DSP, supporting 12 parallel
execution units.
Extensible Processors
Maybe fixed instruction sets and standardized ISAs aren't the best
way to get efficient for embedded applications. Maybe the better
way is to tailor the instruction set of your SOC processor. A few
good instructions can save milliseconds. Years ago, engineers and
programmers settled on standardized processor architectures with a
fixed ISA to maximize software life and to minimize reprogramming.
Instruction sets remained fixed, while architectural
implementations varied, taking advantage of new technology and
design methodologies. While this approach did deliver
cost-effective processing, it forced the programmers to tailor
their software to the application problem, even if a hardware
assist would be much more cost effective.
Today, there is a third way, one between fixed instruction sets and
custom ISAs. With this approach, you modify the hardware and
instruction set to add specialized functionality, but do so with
full software compatibility. A new instruction or hardware
capability would be automatically included in the assemblers,
compilers, libraries, and operating software. Thus you can add a
new instruction, one that the operating system will automatically
use, or one that provides new hardware functionality available to
programmers as one or more new instructions used in functional
libraries. This addition of instruction functionality can make a
big difference for many embedded applications that handle
specialized interfacing and packet processing tasks. The right
specialized instruction could handle some tasks in one instruction
that now take 10 to 20 or more standard instructions.
Two fabless processor vendors—ARC and Tensilica—have
taken this extensible processor approach for their RISC CPUs. Both
companies based their designs on a RISC processor base, and both
designs enable developers to pare down instructions for a minimal
design, or to add new instruction functionality and coprocessors to
the CPU. ARC has made a name for itself by fielding a minimal RISC
core with a small footprint and a scalable instruction set.
Tensilica, a more recent arrival to the processor business, has a
16-/24-bit instruction word RISC. Both companies rely on synthesis
to handle the instruction integration.
The ARC architecture and tool chain enabled designers to tailor the
RISC CPU to their design needs, especially letting them eliminate
unneccesary instructions and functions. They could also add new
instructions, registers and logic resources as needed, relying on
logic synthesis to integrate the logic. Later developments opened
up this add-on capability to third parties, not just licensees,
allowing the third party to add functions (and the instruction to
access them) to a library for licensee's use. ARC now has a Plug-In
program to encourage 3rd party developers to supply new functions.
These ARC Plug-Ins can include instructions, new registers, memory,
peripheral IP, custom register flags, A DSP function unit, and bus
interfaces. The Plug-Ins are automatically supported by the ARC
software tool chain.
Tensilica has taken a language and layout approach to CPU
extensibility. You can specify their added functionality in
Tensilica's special design language, TIE (Tensilica Instruction
Extension Language), a Verilog-like language. With TIE, you can
define the hardware resources, such as registers and functional
units, or coprocessors and their operations for new instructions.
Or you can extend existing instructions with new resources, such as
more registers. But the additions have to fit into Tensilica's
efficient chip layout scheme. You can also add processor and user
states, register files, instructions, coprocessors, and C data
types.
Actually, other CPU vendors are moving toward adding extensibility
to their architectures. Infinion, for example, has added a hardware
plug-in capability, PowerPlug, to its Carmel DSP. This feature
enables the vendor or a customer to customize the Carmel SOC via
logic synthesis. Up to 4 PowerPlug units can be added to the Carmel
core. Using these plug-ins, users can add two Infinion-supplied
MACs to the DSP core doubling from 2 to 4 the number of MACs
performed per clock cycle. User generated PowerPlug units can be
attached to new instructions that are added to the software tool
chain, including the compilers.
To Bus Or Not To Bus
To many designers buses are a way of life. To easily add logic or
canned functions, a bus is the way to do it. Maybe so at the board
level, but maybe not at the chip level. To most engineers,
systems-on-chip (SOC) silicon is just another extension of basic
design techniques, a variation of the age-old comment: "the more
things change, the more they stay the same." On a logic level,
board-level design first developed the concept of standard buses,
that of plug-in buses that enabled designers to add additional
logic or functionality via a modular form. Instead of adding logic
and simply connecting the signals to existing logic, buses enabled
designers to easily add or remove logic functions without affecting
the other logic.
To most engineers, buses are a natural architectural extension to
be added to silicon-based designs such as SOCs. These special
on-chip buses would provide the same advantages to silicon logic
that they provided for board level designs. With such on-chip
buses, developers could easily add peripherals, processors, or
specialized functions. Instead of integrating these elements at the
signal level, designers integrate them to an intermediate form, the
bus, where each element interfaces to the bus via a standard
interface.
But there is also an opposing argument, namely that with current
and future logic synthesis tools, who needs buses for system
integration? The argument goes that logic chunks and functions can
be easily integrated at the logic level using logic synthesis.
Moreover, with this synthesis approach, bus overhead and bus real
estate can be eliminated, delivering significant savings in chip
timing and transistors. Bus controllers are eliminated and element
interconnections can take on the logic characteristics of
point-to-point or of FPGA-like regularized interconnects. Both
eliminate the logic overhead of a systems bus (arbitration,
synchronization, acknowledgment, recovery, ....).
On the bus expansion side, a good example would be Lucent's
recently introduced StarPro. This is an MP SOC with 3 16-bit, 4th
generation, StarCore VLIW DSPs and 768 KB of on-chip memory. Each
DSP is encapsulated with 16 KB I and D caches, and 32 KB of local
memory. All these elements—the DSP modules, main memory, and
peripherals—are integrated with a split-transaction bus.
Another example is Lexra's NetVortex network processor, which gangs
up 16 processors (15 packet and 1 control CPU) on a
split-transaction bus. These processors run with 16 KB of code RAM
and 16 KB of dual-ported data RAM (32 KB code and data RAMs, and 16
KB dual-ported RAM for the control CPU).
The split-transaction buses support multi-master operation, acting
as a central connecting resource for the multiple processors. It
also serves as a common interfacing mechanism for main memory and
the on-chip peripherals. All can be added via the bus and can be
accessed via the bus. Accessing any unit—processor, memory,
or peripheral—is standardized; it's a bus access.
On the synthesis side, we have the Chameleon network processor,
which implements a reconfigurable processing fabric. This fabric is
integrated with an ARC RISC host CPU. The fabric itself consists of
slices, made up of multiple tiles. All told, for the CS200 there
are 4 slices, 3 tiles per slice, with 2 multipliers, 7 datapath
unit cells and 4 memory buffers per tile. All these are dynamically
programmable. Developers can program the chip-to-slice,
tile-to-tile connections much like the global and local
programmable interconnects in an FPGA. And like an FPGA, they can
program the individual datapath cells, even implementing C or
Verilog functions.
The result is that the CS2000 provides a dynamically configurable
processing fabric with configurable datapath processing cells.
Developers can program a dataflow processor that moves the data
across the chip, passing though different processing stages, all
without the need for a datapath bus. They have access to a
massively parallel and sequential set of compute resources: 24
multipliers and 84 datapath cells.
RISC, Superscalar, VLIW, and SIMD
Today's processor design techniques include RISC, Superscalar, VLIW
and SIMD. Each of these techniques enable designers to get more out
of their silicon by squeezing down cycle logic, executing
instructions in parallel, or multiplying the number of operations a
single instruction can execute respectively. The trick is to get
more done in the same amount of clock time.
RISC—In classic RISCs, the trick was to
squeeze down the register-to-ALU-to register cycle for higher
execution speeds. One way to get it faster was to simplify the
logic: to simplify the instruction set, use fixed multi-word
addressing, use a Load/Store architecture (operate only on
registers), pipelining to sequentially stage execution (enabling
the next instruction to start before the current one finished), and
use fixed instruction words. These design techniques enabled RISCs
to run faster than the older CISC (complex instruction set
computer) processors.
Superscalar—The next step to up RISC performance was
adding superscalar execution. Superscalar designs can issue more
than one RISC instruction per cycle, using multiple execution units
to execute multiple instructions in parallel. For example, many
RISCs can issue and execute an integer and a floating-point
instruction in parallel. But superscalar design techniques ran into
some natural limits, namely that the more instructions you issue,
the more intermediate stuff you have to hold in case something goes
wrong, such as having to take a branch, which negates the
instructions that follow it in sequence. Superscalar has settled
out into implementations that can issue 2,3 or 4 instructions in
parallel.
VLIW—Some new design techniques have evolved from
RISC. These include VLIW and SIMD. VLIW (very long instruction
word) implementations are a relatively successful attempt to bypass
the problems of superscalar RISC. VLIW is very like RISC
superscalar; both techniques issue a number of RISC instructions.
The difference is that RISC superscalar does it dynamically in
hardware, deciding which instructions to issue and to handle
intermediate scheduling problems. VLIW lets the compiler handle the
scheduling, with the hardware receiving and issuing a block of RISC
instructions.
|
Microprocessor Report: The Best Way to
Track Processor Architectures
The best way to track processor architectures and to explore the
different designs in some depth is the Microprocessor Report. This
monthly technical newsletter has been a Silicon Valley standard for
a number of years and provides the best architectural analysis
available. It is published by Cahners' MicroDesign Resources, which
also holds special industry conferences, like the Embedded
Processor Forum and the Microprocessor Forum. A yearly subscription
to the Microprocessor Report costs $695.
MicroDesign Resources, Sunnyvale, CA. 408-328-3900. (www.MDRonline.com).
|
SIMD—It turns out that SIMD
(single instruction, multiple data) has been around a long time. It
means that a single instruction controls the operation on multiple
data elements. For example, an ADD instruction causes n units to do
an add. SIMD have proved to be a very powerful mechanism,
especially for 8-, 16-bit, and 32-bit DSP and graphics operations
done on large register words. SIMD was a natural extension for
floating-point units in RISC and the X86 PC processors. Originally
pioneered by Sun for its SPARC and picked up by Intel for its
Pentium, SIMD enables one instruction to be applied to multiple
fields in a floating-point register word. For a 64-bit word, that
can be eight 8-bit adds, four 16-bit adds, or two 32-bit adds,
delivering a 8x, 4x or 2x speedup. SIMD has now been extended to
other architectures and designs: Motorola's PowerPC G4 implements a
128-bit vector engine co-processor with a G3 PPC core. The latest
SIMD designs are moving to a separate 128-bit vector unit instead
of the earlier 64-bit Floating-Point Execution Units.