Embedded Processors 2000

Processor Architectures Step Out toNew Performance, Power Levels

Embedded processor architectures proliferate, with higher clockrates, chip-level MP, and full MP systems in SOCs. Only a few yearsago, embedded processors were a ho-hum mix of a few RISCs, someRISC cores, and a set of aging microcontrollers. Things havechanged. Today's designers have a veritable buffet of processor andco-processor designs and design technologies to choose from.Processor design, powered by faster, denser, and more plentifulsilicon, has broadened beyond a simple CPU, caches, and on-chipperipherals on a chip.

Today's embedded processor choices include:

Chip-level MP —Multiprocessing on-chip is areality. Implementations range from handset SOCs integrating a DSP+ uC, to multiple DSPs on a common bus.

True Systems-on-Chip —SOCs have gone beyond just aprocessor, its memory bus and peripherals. SOCs are taking on theaspects of systems, with multiple processors, common memory, andperipherals with sophisticated system buses to tie it alltogether

Parallel Processor Element Designs —Alternate multipleProcessing Element architectures that can deliver massive amountsof processing power from ganged PEs. These processors may functionas a co-processor or be integrated with a host CPU in an SOC.

Extensible Processors —RISCs that can be extended atthe ISA level that rely on system level logic synthesis tointegrate the designs.

Add-On Functionality —RISC, DSP architectures thatenable 3rd parties and vendors to add logic functionality at theISA level. They rely on logic synthesis to integrate the newfunctions into the design.

And last, but not least, CPU speeds are up. Embedded processors aremoving up the speed curve. Embedded PowerPCs, for example, arepushing past 500 MHz, with 700 MHz in sight by year's end. And someembedded processors are cracking the 1 GHz barrier. SiByte has aproprietary MIPS64 implementation, the SB-1, that will hit 1 GHz.

Silicon-Driven Revolution
Everybody knows it, but it's still true: that silicon technologyfollows Moore's Law, roughly doubling the number of transistorsevery 18 months. And we are seeing the benefits of this relentlesssilicon march up the silicon curve. It is a technical commonplace,a cliché, that silicon technology follows Moore's Law, roughlydoubling the number of transistors (or functionality, or clockrates, or capabilities) every 18 months to two years. It is also atruism. And we are now seeing the benefits of this relentlesssilicon advance up the silicon curve (time vs. capability).

We are now seeing lower core voltages, accompanied by lower powerdissipation, higher clock rates, smaller geometries, and much, muchhigher transistor counts. We can do more, faster, with less power,and even cheaper than before. Core voltages are dropping to 1.2 to1.8 V for the more advanced designs, on its way to 1.0 V or less.Silicon resolution for CMOS designs has come down to 0.12 to 0.18microns.

Such lower core voltages and smaller silicon, coupled with powersaving design techniques, have brought down chip level powerdissipation, in spite of higher clock rates. Power dissipation formany embedded processors, even for advanced processors, has comedown to a matter of a few Watts or even 500 mW or so for low powerSOCs. Transistor counts for basic processors are now up to 4 to 7 Mtransistors to accommodate more on-chip caches and RAM, going outto 45 M transistors or more for high-end media processors.

This silicon is not free. But it is available. And with today'shigher and higher clock rates, processors need more on-chip memoryto minimize off-chip memory access delays. So processors arebulking up on on-chip memory. Most RISCs, for example, can nowafford to run with 16 KB or even 32 KB Instruction and Data caches.And many, following Intel's Pentium III, are moving toward largeon-chip L2 caches to localize processing and to minimize off-chipmemory access delays. Motorola PowerPC 750CX for example, bringsthe L2 cache on-chip, adding 256 KB of L2 cache, and removing thebackside bus from its package.

Silicon Magic's DVine network/media processor takes advantage oftoday's plentiful silicon. It bulks out at 45 M transistors, manyof which are used for on-chip memory and on-chip DRAM. The mediaprocessor has 4 MB of on-chip DRAM, organized in 1-MB memory units.Taking advantage of plentiful silicon, these memory units havetheir own Memory Streaming Processor processing the data as itstreams into or out of the memory unit.

The cement that binds these technical advantages to more advanceddesigns is logic synthesis. Without HDL-based design and the logicsynthesis to transform HDL designs into silicon, current designswould not be possible. With chips running from 7 to 45 M+transistors, the time for gate-level hand design is over. Today'scomplex designs are the products of a new generation of synthesistools. These tools enable designers to easily add IP or to extendtheir processor designs.

Chip-Level MP
This year has been a watershed in embedded processor design. Thiswas the year that chip-level MP became a reality. The year thatSOCs moved from being a way to integrate a processor with itsperipherals on one piece of silicon, to the point when SOCs startedtaking on the characteristics of true systems. Multiple processorson a SOC became a working reality, one that designers could counton for delivering a large amount of MP processing power within arealistic silicon budget.

SOC MP ranges from paired processors, such as a RISC paired with auC, to full-scale MP architectures with multiple RISC or DSPprocessors. In addition, a new class of MP processing has emerged,that of multiple specialized processors or coprocessors arranged insequential processing order or in processing arrays. This latterclass represents the deployment of specialized math, vector,graphic, or media processors, which collectively can deliver a veryhigh level of performance at modest clock rates.

Taking advantage of today's plentiful silicon, vendors are packingmultiple processors on a single die to minimize design chip countsand costs. For example, Motorola is now integrating its 32-bitM•CORE RISC with its 4th generation, 16-bit fixed-pointStarCore VLIW DSP for one-chip wireless transceivers. Anotherexample of such high-level on-chip integration is Lucent's StarPro,which integrates 3 StarCore DSPs with 768 KB of RAM and on-chipperipherals. A third example is Infinion's Carmel DSP, whichsupports on-chip MP, with 4 or more DSPs integrated on a singlechip.

Clocks vs. Execution Units
There's a new variation on an age-old: clock rates vs. executionunits. The idea is that we don't have to go faster if we do more inparallel. Many designers are making an interesting tradeoff: clockrates vs. execution units based on the idea that maybe we don'thave to go faster if we can have lots of parallel execution units.We can then run the execution units at slower clock rates and getGHz level performance without straining the silicon. It's avariation of the “wider rather than faster” design theme. If youthink about it, that's precisely what superscalar RISC, VLIW andSIMD are all about, essentially deploying more execution units inparallel.

Sounds good, but most superscalar RISCs, VLIWs or SIMDs, can't getthat many execution units chugging away in parallel. For example, a4-way superscalar RISC will run 4 execution units in parallel. Atbest, a VLIW like TI's C6x with an 8-way VLIW has 8 units executingin parallel. SIMDs do a bit better, especially for 8-bitoperations: a 128-bit SIMD like Motorola's PowerPC G4 does 16executions in parallel. But if you need 16-bit accuracy, it onlydoes eight operations in parallel.

However, there's another way to get more parallel processing powerto deliver massive amounts of execution MIPS at relatively lowclock rates. New architecture designers have done this by basicallyupping the number of parallel execution units that can be deployedin tandem. Today's emerging parallel designs are all over the placearchitecturally, but basically all get their top-level performanceby ganging multiple parallel execution units for massiveparallelism.

A number of these massively parallel execution unit CPU designshave emerged. They include:

BOPS Mantra — A very innovative MP design,arranging arrays of PEs for math-oriented or graphics-orientedprocessing. The Mantra has a 2 x 2 array, ganging together 20parallel execution units for large scale processing. The arraylinks to an interchange switch, which can crossover results to thePEs for the next step in complex vector and DSP processing. Each PEhas 5 execution units (ALU, MAU, DSU, LOAD, STORE, a data memoryand a VLIW instruction memory) to drive the execution units. Theprocessor can act as an SIMD, with the same instructions beingstored locally at each PE. At 200 MHz, the Mantra can deliver 4,000peak MIPS.

Chameleon CS2000 — This is a dynamically reconfigurableMP design with an ARC RISC on-chip host with a 32-bitreconfigurable processing fabric. The fabric has 4 slices, with 3tiles per slice. Each tile has 4 memory buffers, 7 datapath units,and 2 multipliers for a total of 24 multipliers and 84 datapathunits or computing cells. Each datapath unit can take in two datastreams doing shifts, word swapping, dual 16-bit adds, and Verilogor C operators. It is configurable with FPGA-like programmablelocal and layer interconnects and datapath cells. Running at 125MHz, each slice can deliver 3,387 peak MIPS (125 MHz x 3 tiles x 9units). If you count the pipelined slices (one feeding the next ina dataflow fabric), it rises to 13,448 peak MIPS

Improv's Jazz —This is a multi-level, chip-level MParchitecture. The Jazz PSA-32-STD5's first layer consists of 5 taskengines each with I and D memory, and a shared memory with the nexttask processor over. The task processors connect to a common I/Omodule and link via a circular Q-Bus, one to the next, to the I/Omodule. The task engines are individually configurable VLIWs withup to 16 execution units linked with a crossbar to 4 memory portsand a control unit. The STD-5 task engine is configured with 7execution units: 2 ALUs, 2 MACs, 1 SHIFT, 1 Counter, and 1BYTESWAP. This is more of a dataflow architecture with data flowingbetween the task engines or processing stages, down to theindividual processing elements. Running at 100 MHz, each taskengine delivers up to 700 peak 32-bit MIPS. A 5 task engine JazzPSA-STD5 delivers up to 3,500 32-bit MIPS.

Silicon Magic's DVine SM2700 —A heavyweight, scalablemedia processor that deploys multiple PEs coupled with large arraysof on-chip DRAM. The SM2700 includes 6 RISC processors and 6 VectorProcessors. The units and memory are linked point-to-point with anon-chip crossbar. The memory units are front-ended by a pipelinedstreaming memory processor. It does media manipulation on the dataas it streams by (interpolation, decimation, data alignment,address translation). The RISC is scalar and the Vector Processoroperates on 128-bit vectors and supports SIMD operations, as wellas zero-overhead looping, and horizontal data swapping. Running at200 MHz, the SM2700 can deliver 6,000 peak 32-bit MIPS (200 x (6 +(6 x 4)). For 16-bit operations, the peak MIPS rate ups to 10,800.And that doesn't count the streaming memory processors, whichcontribute 2 operations per cycle (on 128-bit data).

Of course, you can also bulk up the number of deployed executionunits by integrating multiple superscalar RISCs. Running n RISCs,each with 4 execution units, can deliver a peak of 4 x n parallelexecution units. One such example is that of Infinion's Carmel DSP,which now supports on-chip MP operations on a common on-chip FPIbus. Four such DSPs can deliver 1.2 MIPS performance, running at166 MHz. Each DSP can deliver peak 3 operations per DSP, deploying3 functional executions units per DSP, supporting 12 parallelexecution units.

Extensible Processors
Maybe fixed instruction sets and standardized ISAs aren't the bestway to get efficient for embedded applications. Maybe the betterway is to tailor the instruction set of your SOC processor. A fewgood instructions can save milliseconds. Years ago, engineers andprogrammers settled on standardized processor architectures with afixed ISA to maximize software life and to minimize reprogramming.Instruction sets remained fixed, while architecturalimplementations varied, taking advantage of new technology anddesign methodologies. While this approach did delivercost-effective processing, it forced the programmers to tailortheir software to the application problem, even if a hardwareassist would be much more cost effective.

Today, there is a third way, one between fixed instruction sets andcustom ISAs. With this approach, you modify the hardware andinstruction set to add specialized functionality, but do so withfull software compatibility. A new instruction or hardwarecapability would be automatically included in the assemblers,compilers, libraries, and operating software. Thus you can add anew instruction, one that the operating system will automaticallyuse, or one that provides new hardware functionality available toprogrammers as one or more new instructions used in functionallibraries. This addition of instruction functionality can make abig difference for many embedded applications that handlespecialized interfacing and packet processing tasks. The rightspecialized instruction could handle some tasks in one instructionthat now take 10 to 20 or more standard instructions.

Two fabless processor vendors—ARC and Tensilica—havetaken this extensible processor approach for their RISC CPUs. Bothcompanies based their designs on a RISC processor base, and bothdesigns enable developers to pare down instructions for a minimaldesign, or to add new instruction functionality and coprocessors tothe CPU. ARC has made a name for itself by fielding a minimal RISCcore with a small footprint and a scalable instruction set.Tensilica, a more recent arrival to the processor business, has a16-/24-bit instruction word RISC. Both companies rely on synthesisto handle the instruction integration.

The ARC architecture and tool chain enabled designers to tailor theRISC CPU to their design needs, especially letting them eliminateunneccesary instructions and functions. They could also add newinstructions, registers and logic resources as needed, relying onlogic synthesis to integrate the logic. Later developments openedup this add-on capability to third parties, not just licensees,allowing the third party to add functions (and the instruction toaccess them) to a library for licensee's use. ARC now has a Plug-Inprogram to encourage 3rd party developers to supply new functions.These ARC Plug-Ins can include instructions, new registers, memory,peripheral IP, custom register flags, A DSP function unit, and businterfaces. The Plug-Ins are automatically supported by the ARCsoftware tool chain.

Tensilica has taken a language and layout approach to CPUextensibility. You can specify their added functionality inTensilica's special design language, TIE (Tensilica InstructionExtension Language), a Verilog-like language. With TIE, you candefine the hardware resources, such as registers and functionalunits, or coprocessors and their operations for new instructions.Or you can extend existing instructions with new resources, such asmore registers. But the additions have to fit into Tensilica'sefficient chip layout scheme. You can also add processor and userstates, register files, instructions, coprocessors, and C datatypes.

Actually, other CPU vendors are moving toward adding extensibilityto their architectures. Infinion, for example, has added a hardwareplug-in capability, PowerPlug, to its Carmel DSP. This featureenables the vendor or a customer to customize the Carmel SOC vialogic synthesis. Up to 4 PowerPlug units can be added to the Carmelcore. Using these plug-ins, users can add two Infinion-suppliedMACs to the DSP core doubling from 2 to 4 the number of MACsperformed per clock cycle. User generated PowerPlug units can beattached to new instructions that are added to the software toolchain, including the compilers.

To Bus Or Not To Bus
To many designers buses are a way of life. To easily add logic orcanned functions, a bus is the way to do it. Maybe so at the boardlevel, but maybe not at the chip level. To most engineers,systems-on-chip (SOC) silicon is just another extension of basicdesign techniques, a variation of the age-old comment: “the morethings change, the more they stay the same.” On a logic level,board-level design first developed the concept of standard buses,that of plug-in buses that enabled designers to add additionallogic or functionality via a modular form. Instead of adding logicand simply connecting the signals to existing logic, buses enableddesigners to easily add or remove logic functions without affectingthe other logic.

To most engineers, buses are a natural architectural extension tobe added to silicon-based designs such as SOCs. These specialon-chip buses would provide the same advantages to silicon logicthat they provided for board level designs. With such on-chipbuses, developers could easily add peripherals, processors, orspecialized functions. Instead of integrating these elements at thesignal level, designers integrate them to an intermediate form, thebus, where each element interfaces to the bus via a standardinterface.

But there is also an opposing argument, namely that with currentand future logic synthesis tools, who needs buses for systemintegration? The argument goes that logic chunks and functions canbe easily integrated at the logic level using logic synthesis.Moreover, with this synthesis approach, bus overhead and bus realestate can be eliminated, delivering significant savings in chiptiming and transistors. Bus controllers are eliminated and elementinterconnections can take on the logic characteristics ofpoint-to-point or of FPGA-like regularized interconnects. Botheliminate the logic overhead of a systems bus (arbitration,synchronization, acknowledgment, recovery, ….).

On the bus expansion side, a good example would be Lucent'srecently introduced StarPro. This is an MP SOC with 3 16-bit, 4thgeneration, StarCore VLIW DSPs and 768 KB of on-chip memory. EachDSP is encapsulated with 16 KB I and D caches, and 32 KB of localmemory. All these elements—the DSP modules, main memory, andperipherals—are integrated with a split-transaction bus.Another example is Lexra's NetVortex network processor, which gangsup 16 processors (15 packet and 1 control CPU) on asplit-transaction bus. These processors run with 16 KB of code RAMand 16 KB of dual-ported data RAM (32 KB code and data RAMs, and 16KB dual-ported RAM for the control CPU).

The split-transaction buses support multi-master operation, actingas a central connecting resource for the multiple processors. Italso serves as a common interfacing mechanism for main memory andthe on-chip peripherals. All can be added via the bus and can beaccessed via the bus. Accessing any unit—processor, memory,or peripheral—is standardized; it's a bus access.

On the synthesis side, we have the Chameleon network processor,which implements a reconfigurable processing fabric. This fabric isintegrated with an ARC RISC host CPU. The fabric itself consists ofslices, made up of multiple tiles. All told, for the CS200 thereare 4 slices, 3 tiles per slice, with 2 multipliers, 7 datapathunit cells and 4 memory buffers per tile. All these are dynamicallyprogrammable. Developers can program the chip-to-slice,tile-to-tile connections much like the global and localprogrammable interconnects in an FPGA. And like an FPGA, they canprogram the individual datapath cells, even implementing C orVerilog functions.

The result is that the CS2000 provides a dynamically configurableprocessing fabric with configurable datapath processing cells.Developers can program a dataflow processor that moves the dataacross the chip, passing though different processing stages, allwithout the need for a datapath bus. They have access to amassively parallel and sequential set of compute resources: 24multipliers and 84 datapath cells.

RISC, Superscalar, VLIW, and SIMD
Today's processor design techniques include RISC, Superscalar, VLIWand SIMD. Each of these techniques enable designers to get more outof their silicon by squeezing down cycle logic, executinginstructions in parallel, or multiplying the number of operations asingle instruction can execute respectively. The trick is to getmore done in the same amount of clock time.

RISC —In classic RISCs, the trick was tosqueeze down the register-to-ALU-to register cycle for higherexecution speeds. One way to get it faster was to simplify thelogic: to simplify the instruction set, use fixed multi-wordaddressing, use a Load/Store architecture (operate only onregisters), pipelining to sequentially stage execution (enablingthe next instruction to start before the current one finished), anduse fixed instruction words. These design techniques enabled RISCsto run faster than the older CISC (complex instruction setcomputer) processors.

Superscalar —The next step to up RISC performance wasadding superscalar execution. Superscalar designs can issue morethan one RISC instruction per cycle, using multiple execution unitsto execute multiple instructions in parallel. For example, manyRISCs can issue and execute an integer and a floating-pointinstruction in parallel. But superscalar design techniques ran intosome natural limits, namely that the more instructions you issue,the more intermediate stuff you have to hold in case something goeswrong, such as having to take a branch, which negates theinstructions that follow it in sequence. Superscalar has settledout into implementations that can issue 2,3 or 4 instructions inparallel.

VLIW —Some new design techniques have evolved fromRISC. These include VLIW and SIMD. VLIW (very long instructionword) implementations are a relatively successful attempt to bypassthe problems of superscalar RISC. VLIW is very like RISCsuperscalar; both techniques issue a number of RISC instructions.The difference is that RISC superscalar does it dynamically inhardware, deciding which instructions to issue and to handleintermediate scheduling problems. VLIW lets the compiler handle thescheduling, with the hardware receiving and issuing a block of RISCinstructions.

Microprocessor Report: The Best Way toTrack Processor Architectures

The best way to track processor architectures and to explore thedifferent designs in some depth is the Microprocessor Report. Thismonthly technical newsletter has been a Silicon Valley standard fora number of years and provides the best architectural analysisavailable. It is published by Cahners' MicroDesign Resources, whichalso holds special industry conferences, like the EmbeddedProcessor Forum and the Microprocessor Forum. A yearly subscriptionto the Microprocessor Report costs $695.

MicroDesign Resources, Sunnyvale, CA. 408-328-3900. (www.MDRonline.com).

SIMD —It turns out that SIMD(single instruction, multiple data) has been around a long time. Itmeans that a single instruction controls the operation on multipledata elements. For example, an ADD instruction causes n units to doan add. SIMD have proved to be a very powerful mechanism,especially for 8-, 16-bit, and 32-bit DSP and graphics operationsdone on large register words. SIMD was a natural extension forfloating-point units in RISC and the X86 PC processors. Originallypioneered by Sun for its SPARC and picked up by Intel for itsPentium, SIMD enables one instruction to be applied to multiplefields in a floating-point register word. For a 64-bit word, thatcan be eight 8-bit adds, four 16-bit adds, or two 32-bit adds,delivering a 8x, 4x or 2x speedup. SIMD has now been extended toother architectures and designs: Motorola's PowerPC G4 implements a128-bit vector engine co-processor with a G3 PPC core. The latestSIMD designs are moving to a separate 128-bit vector unit insteadof the earlier 64-bit Floating-Point Execution Units.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.