CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Embedded Processors 2000
DSPnet Site Director



TechOnline


Processor Architectures Step Out to New Performance, Power Levels

Embedded processor architectures proliferate, with higher clock rates, chip-level MP, and full MP systems in SOCs. Only a few years ago, embedded processors were a ho-hum mix of a few RISCs, some RISC cores, and a set of aging microcontrollers. Things have changed. Today's designers have a veritable buffet of processor and co-processor designs and design technologies to choose from. Processor design, powered by faster, denser, and more plentiful silicon, has broadened beyond a simple CPU, caches, and on-chip peripherals on a chip.

Today's embedded processor choices include:

Chip-level MP—Multiprocessing on-chip is a reality. Implementations range from handset SOCs integrating a DSP + uC, to multiple DSPs on a common bus.

True Systems-on-Chip—SOCs have gone beyond just a processor, its memory bus and peripherals. SOCs are taking on the aspects of systems, with multiple processors, common memory, and peripherals with sophisticated system buses to tie it all together

Parallel Processor Element Designs—Alternate multiple Processing Element architectures that can deliver massive amounts of processing power from ganged PEs. These processors may function as a co-processor or be integrated with a host CPU in an SOC.

Extensible Processors—RISCs that can be extended at the ISA level that rely on system level logic synthesis to integrate the designs.

Add-On Functionality—RISC, DSP architectures that enable 3rd parties and vendors to add logic functionality at the ISA level. They rely on logic synthesis to integrate the new functions into the design.

And last, but not least, CPU speeds are up. Embedded processors are moving up the speed curve. Embedded PowerPCs, for example, are pushing past 500 MHz, with 700 MHz in sight by year's end. And some embedded processors are cracking the 1 GHz barrier. SiByte has a proprietary MIPS64 implementation, the SB-1, that will hit 1 GHz.


Silicon-Driven Revolution
Everybody knows it, but it's still true: that silicon technology follows Moore's Law, roughly doubling the number of transistors every 18 months. And we are seeing the benefits of this relentless silicon march up the silicon curve. It is a technical commonplace, a cliché, that silicon technology follows Moore's Law, roughly doubling the number of transistors (or functionality, or clock rates, or capabilities) every 18 months to two years. It is also a truism. And we are now seeing the benefits of this relentless silicon advance up the silicon curve (time vs. capability).

We are now seeing lower core voltages, accompanied by lower power dissipation, higher clock rates, smaller geometries, and much, much higher transistor counts. We can do more, faster, with less power, and even cheaper than before. Core voltages are dropping to 1.2 to 1.8 V for the more advanced designs, on its way to 1.0 V or less. Silicon resolution for CMOS designs has come down to 0.12 to 0.18 microns.

Such lower core voltages and smaller silicon, coupled with power saving design techniques, have brought down chip level power dissipation, in spite of higher clock rates. Power dissipation for many embedded processors, even for advanced processors, has come down to a matter of a few Watts or even 500 mW or so for low power SOCs. Transistor counts for basic processors are now up to 4 to 7 M transistors to accommodate more on-chip caches and RAM, going out to 45 M transistors or more for high-end media processors.

This silicon is not free. But it is available. And with today's higher and higher clock rates, processors need more on-chip memory to minimize off-chip memory access delays. So processors are bulking up on on-chip memory. Most RISCs, for example, can now afford to run with 16 KB or even 32 KB Instruction and Data caches. And many, following Intel's Pentium III, are moving toward large on-chip L2 caches to localize processing and to minimize off-chip memory access delays. Motorola PowerPC 750CX for example, brings the L2 cache on-chip, adding 256 KB of L2 cache, and removing the backside bus from its package.

Silicon Magic's DVine network/media processor takes advantage of today's plentiful silicon. It bulks out at 45 M transistors, many of which are used for on-chip memory and on-chip DRAM. The media processor has 4 MB of on-chip DRAM, organized in 1-MB memory units. Taking advantage of plentiful silicon, these memory units have their own Memory Streaming Processor processing the data as it streams into or out of the memory unit.

The cement that binds these technical advantages to more advanced designs is logic synthesis. Without HDL-based design and the logic synthesis to transform HDL designs into silicon, current designs would not be possible. With chips running from 7 to 45 M+ transistors, the time for gate-level hand design is over. Today's complex designs are the products of a new generation of synthesis tools. These tools enable designers to easily add IP or to extend their processor designs.


Chip-Level MP
This year has been a watershed in embedded processor design. This was the year that chip-level MP became a reality. The year that SOCs moved from being a way to integrate a processor with its peripherals on one piece of silicon, to the point when SOCs started taking on the characteristics of true systems. Multiple processors on a SOC became a working reality, one that designers could count on for delivering a large amount of MP processing power within a realistic silicon budget.

SOC MP ranges from paired processors, such as a RISC paired with a uC, to full-scale MP architectures with multiple RISC or DSP processors. In addition, a new class of MP processing has emerged, that of multiple specialized processors or coprocessors arranged in sequential processing order or in processing arrays. This latter class represents the deployment of specialized math, vector, graphic, or media processors, which collectively can deliver a very high level of performance at modest clock rates.

Taking advantage of today's plentiful silicon, vendors are packing multiple processors on a single die to minimize design chip counts and costs. For example, Motorola is now integrating its 32-bit M•CORE RISC with its 4th generation, 16-bit fixed-point StarCore VLIW DSP for one-chip wireless transceivers. Another example of such high-level on-chip integration is Lucent's StarPro, which integrates 3 StarCore DSPs with 768 KB of RAM and on-chip peripherals. A third example is Infinion's Carmel DSP, which supports on-chip MP, with 4 or more DSPs integrated on a single chip.


Clocks vs. Execution Units
There's a new variation on an age-old: clock rates vs. execution units. The idea is that we don't have to go faster if we do more in parallel. Many designers are making an interesting tradeoff: clock rates vs. execution units based on the idea that maybe we don't have to go faster if we can have lots of parallel execution units. We can then run the execution units at slower clock rates and get GHz level performance without straining the silicon. It's a variation of the "wider rather than faster" design theme. If you think about it, that's precisely what superscalar RISC, VLIW and SIMD are all about, essentially deploying more execution units in parallel.

Sounds good, but most superscalar RISCs, VLIWs or SIMDs, can't get that many execution units chugging away in parallel. For example, a 4-way superscalar RISC will run 4 execution units in parallel. At best, a VLIW like TI's C6x with an 8-way VLIW has 8 units executing in parallel. SIMDs do a bit better, especially for 8-bit operations: a 128-bit SIMD like Motorola's PowerPC G4 does 16 executions in parallel. But if you need 16-bit accuracy, it only does eight operations in parallel.

However, there's another way to get more parallel processing power to deliver massive amounts of execution MIPS at relatively low clock rates. New architecture designers have done this by basically upping the number of parallel execution units that can be deployed in tandem. Today's emerging parallel designs are all over the place architecturally, but basically all get their top-level performance by ganging multiple parallel execution units for massive parallelism.

A number of these massively parallel execution unit CPU designs have emerged. They include:

BOPS Mantra— A very innovative MP design, arranging arrays of PEs for math-oriented or graphics-oriented processing. The Mantra has a 2 x 2 array, ganging together 20 parallel execution units for large scale processing. The array links to an interchange switch, which can crossover results to the PEs for the next step in complex vector and DSP processing. Each PE has 5 execution units (ALU, MAU, DSU, LOAD, STORE, a data memory and a VLIW instruction memory) to drive the execution units. The processor can act as an SIMD, with the same instructions being stored locally at each PE. At 200 MHz, the Mantra can deliver 4,000 peak MIPS.

Chameleon CS2000— This is a dynamically reconfigurable MP design with an ARC RISC on-chip host with a 32-bit reconfigurable processing fabric. The fabric has 4 slices, with 3 tiles per slice. Each tile has 4 memory buffers, 7 datapath units, and 2 multipliers for a total of 24 multipliers and 84 datapath units or computing cells. Each datapath unit can take in two data streams doing shifts, word swapping, dual 16-bit adds, and Verilog or C operators. It is configurable with FPGA-like programmable local and layer interconnects and datapath cells. Running at 125 MHz, each slice can deliver 3,387 peak MIPS (125 MHz x 3 tiles x 9 units). If you count the pipelined slices (one feeding the next in a dataflow fabric), it rises to 13,448 peak MIPS

Improv's Jazz—This is a multi-level, chip-level MP architecture. The Jazz PSA-32-STD5's first layer consists of 5 task engines each with I and D memory, and a shared memory with the next task processor over. The task processors connect to a common I/O module and link via a circular Q-Bus, one to the next, to the I/O module. The task engines are individually configurable VLIWs with up to 16 execution units linked with a crossbar to 4 memory ports and a control unit. The STD-5 task engine is configured with 7 execution units: 2 ALUs, 2 MACs, 1 SHIFT, 1 Counter, and 1 BYTESWAP. This is more of a dataflow architecture with data flowing between the task engines or processing stages, down to the individual processing elements. Running at 100 MHz, each task engine delivers up to 700 peak 32-bit MIPS. A 5 task engine Jazz PSA-STD5 delivers up to 3,500 32-bit MIPS.

Silicon Magic's DVine SM2700—A heavyweight, scalable media processor that deploys multiple PEs coupled with large arrays of on-chip DRAM. The SM2700 includes 6 RISC processors and 6 Vector Processors. The units and memory are linked point-to-point with an on-chip crossbar. The memory units are front-ended by a pipelined streaming memory processor. It does media manipulation on the data as it streams by (interpolation, decimation, data alignment, address translation). The RISC is scalar and the Vector Processor operates on 128-bit vectors and supports SIMD operations, as well as zero-overhead looping, and horizontal data swapping. Running at 200 MHz, the SM2700 can deliver 6,000 peak 32-bit MIPS (200 x (6 + (6 x 4)). For 16-bit operations, the peak MIPS rate ups to 10,800. And that doesn't count the streaming memory processors, which contribute 2 operations per cycle (on 128-bit data).

Of course, you can also bulk up the number of deployed execution units by integrating multiple superscalar RISCs. Running n RISCs, each with 4 execution units, can deliver a peak of 4 x n parallel execution units. One such example is that of Infinion's Carmel DSP, which now supports on-chip MP operations on a common on-chip FPI bus. Four such DSPs can deliver 1.2 MIPS performance, running at 166 MHz. Each DSP can deliver peak 3 operations per DSP, deploying 3 functional executions units per DSP, supporting 12 parallel execution units.


Extensible Processors
Maybe fixed instruction sets and standardized ISAs aren't the best way to get efficient for embedded applications. Maybe the better way is to tailor the instruction set of your SOC processor. A few good instructions can save milliseconds. Years ago, engineers and programmers settled on standardized processor architectures with a fixed ISA to maximize software life and to minimize reprogramming. Instruction sets remained fixed, while architectural implementations varied, taking advantage of new technology and design methodologies. While this approach did deliver cost-effective processing, it forced the programmers to tailor their software to the application problem, even if a hardware assist would be much more cost effective.

Today, there is a third way, one between fixed instruction sets and custom ISAs. With this approach, you modify the hardware and instruction set to add specialized functionality, but do so with full software compatibility. A new instruction or hardware capability would be automatically included in the assemblers, compilers, libraries, and operating software. Thus you can add a new instruction, one that the operating system will automatically use, or one that provides new hardware functionality available to programmers as one or more new instructions used in functional libraries. This addition of instruction functionality can make a big difference for many embedded applications that handle specialized interfacing and packet processing tasks. The right specialized instruction could handle some tasks in one instruction that now take 10 to 20 or more standard instructions.

Two fabless processor vendors—ARC and Tensilica—have taken this extensible processor approach for their RISC CPUs. Both companies based their designs on a RISC processor base, and both designs enable developers to pare down instructions for a minimal design, or to add new instruction functionality and coprocessors to the CPU. ARC has made a name for itself by fielding a minimal RISC core with a small footprint and a scalable instruction set. Tensilica, a more recent arrival to the processor business, has a 16-/24-bit instruction word RISC. Both companies rely on synthesis to handle the instruction integration.

The ARC architecture and tool chain enabled designers to tailor the RISC CPU to their design needs, especially letting them eliminate unneccesary instructions and functions. They could also add new instructions, registers and logic resources as needed, relying on logic synthesis to integrate the logic. Later developments opened up this add-on capability to third parties, not just licensees, allowing the third party to add functions (and the instruction to access them) to a library for licensee's use. ARC now has a Plug-In program to encourage 3rd party developers to supply new functions. These ARC Plug-Ins can include instructions, new registers, memory, peripheral IP, custom register flags, A DSP function unit, and bus interfaces. The Plug-Ins are automatically supported by the ARC software tool chain.

Tensilica has taken a language and layout approach to CPU extensibility. You can specify their added functionality in Tensilica's special design language, TIE (Tensilica Instruction Extension Language), a Verilog-like language. With TIE, you can define the hardware resources, such as registers and functional units, or coprocessors and their operations for new instructions. Or you can extend existing instructions with new resources, such as more registers. But the additions have to fit into Tensilica's efficient chip layout scheme. You can also add processor and user states, register files, instructions, coprocessors, and C data types.

Actually, other CPU vendors are moving toward adding extensibility to their architectures. Infinion, for example, has added a hardware plug-in capability, PowerPlug, to its Carmel DSP. This feature enables the vendor or a customer to customize the Carmel SOC via logic synthesis. Up to 4 PowerPlug units can be added to the Carmel core. Using these plug-ins, users can add two Infinion-supplied MACs to the DSP core doubling from 2 to 4 the number of MACs performed per clock cycle. User generated PowerPlug units can be attached to new instructions that are added to the software tool chain, including the compilers.


To Bus Or Not To Bus
To many designers buses are a way of life. To easily add logic or canned functions, a bus is the way to do it. Maybe so at the board level, but maybe not at the chip level. To most engineers, systems-on-chip (SOC) silicon is just another extension of basic design techniques, a variation of the age-old comment: "the more things change, the more they stay the same." On a logic level, board-level design first developed the concept of standard buses, that of plug-in buses that enabled designers to add additional logic or functionality via a modular form. Instead of adding logic and simply connecting the signals to existing logic, buses enabled designers to easily add or remove logic functions without affecting the other logic.

To most engineers, buses are a natural architectural extension to be added to silicon-based designs such as SOCs. These special on-chip buses would provide the same advantages to silicon logic that they provided for board level designs. With such on-chip buses, developers could easily add peripherals, processors, or specialized functions. Instead of integrating these elements at the signal level, designers integrate them to an intermediate form, the bus, where each element interfaces to the bus via a standard interface.

But there is also an opposing argument, namely that with current and future logic synthesis tools, who needs buses for system integration? The argument goes that logic chunks and functions can be easily integrated at the logic level using logic synthesis. Moreover, with this synthesis approach, bus overhead and bus real estate can be eliminated, delivering significant savings in chip timing and transistors. Bus controllers are eliminated and element interconnections can take on the logic characteristics of point-to-point or of FPGA-like regularized interconnects. Both eliminate the logic overhead of a systems bus (arbitration, synchronization, acknowledgment, recovery, ....).

On the bus expansion side, a good example would be Lucent's recently introduced StarPro. This is an MP SOC with 3 16-bit, 4th generation, StarCore VLIW DSPs and 768 KB of on-chip memory. Each DSP is encapsulated with 16 KB I and D caches, and 32 KB of local memory. All these elements—the DSP modules, main memory, and peripherals—are integrated with a split-transaction bus. Another example is Lexra's NetVortex network processor, which gangs up 16 processors (15 packet and 1 control CPU) on a split-transaction bus. These processors run with 16 KB of code RAM and 16 KB of dual-ported data RAM (32 KB code and data RAMs, and 16 KB dual-ported RAM for the control CPU).

The split-transaction buses support multi-master operation, acting as a central connecting resource for the multiple processors. It also serves as a common interfacing mechanism for main memory and the on-chip peripherals. All can be added via the bus and can be accessed via the bus. Accessing any unit—processor, memory, or peripheral—is standardized; it's a bus access.

On the synthesis side, we have the Chameleon network processor, which implements a reconfigurable processing fabric. This fabric is integrated with an ARC RISC host CPU. The fabric itself consists of slices, made up of multiple tiles. All told, for the CS200 there are 4 slices, 3 tiles per slice, with 2 multipliers, 7 datapath unit cells and 4 memory buffers per tile. All these are dynamically programmable. Developers can program the chip-to-slice, tile-to-tile connections much like the global and local programmable interconnects in an FPGA. And like an FPGA, they can program the individual datapath cells, even implementing C or Verilog functions.

The result is that the CS2000 provides a dynamically configurable processing fabric with configurable datapath processing cells. Developers can program a dataflow processor that moves the data across the chip, passing though different processing stages, all without the need for a datapath bus. They have access to a massively parallel and sequential set of compute resources: 24 multipliers and 84 datapath cells.




RISC, Superscalar, VLIW, and SIMD
Today's processor design techniques include RISC, Superscalar, VLIW and SIMD. Each of these techniques enable designers to get more out of their silicon by squeezing down cycle logic, executing instructions in parallel, or multiplying the number of operations a single instruction can execute respectively. The trick is to get more done in the same amount of clock time.

RISC—In classic RISCs, the trick was to squeeze down the register-to-ALU-to register cycle for higher execution speeds. One way to get it faster was to simplify the logic: to simplify the instruction set, use fixed multi-word addressing, use a Load/Store architecture (operate only on registers), pipelining to sequentially stage execution (enabling the next instruction to start before the current one finished), and use fixed instruction words. These design techniques enabled RISCs to run faster than the older CISC (complex instruction set computer) processors.

Superscalar—The next step to up RISC performance was adding superscalar execution. Superscalar designs can issue more than one RISC instruction per cycle, using multiple execution units to execute multiple instructions in parallel. For example, many RISCs can issue and execute an integer and a floating-point instruction in parallel. But superscalar design techniques ran into some natural limits, namely that the more instructions you issue, the more intermediate stuff you have to hold in case something goes wrong, such as having to take a branch, which negates the instructions that follow it in sequence. Superscalar has settled out into implementations that can issue 2,3 or 4 instructions in parallel.

VLIW—Some new design techniques have evolved from RISC. These include VLIW and SIMD. VLIW (very long instruction word) implementations are a relatively successful attempt to bypass the problems of superscalar RISC. VLIW is very like RISC superscalar; both techniques issue a number of RISC instructions. The difference is that RISC superscalar does it dynamically in hardware, deciding which instructions to issue and to handle intermediate scheduling problems. VLIW lets the compiler handle the scheduling, with the hardware receiving and issuing a block of RISC instructions.

Microprocessor Report: The Best Way to Track Processor Architectures

The best way to track processor architectures and to explore the different designs in some depth is the Microprocessor Report. This monthly technical newsletter has been a Silicon Valley standard for a number of years and provides the best architectural analysis available. It is published by Cahners' MicroDesign Resources, which also holds special industry conferences, like the Embedded Processor Forum and the Microprocessor Forum. A yearly subscription to the Microprocessor Report costs $695.

MicroDesign Resources, Sunnyvale, CA. 408-328-3900. (www.MDRonline.com).

SIMD—It turns out that SIMD (single instruction, multiple data) has been around a long time. It means that a single instruction controls the operation on multiple data elements. For example, an ADD instruction causes n units to do an add. SIMD have proved to be a very powerful mechanism, especially for 8-, 16-bit, and 32-bit DSP and graphics operations done on large register words. SIMD was a natural extension for floating-point units in RISC and the X86 PC processors. Originally pioneered by Sun for its SPARC and picked up by Intel for its Pentium, SIMD enables one instruction to be applied to multiple fields in a floating-point register word. For a 64-bit word, that can be eight 8-bit adds, four 16-bit adds, or two 32-bit adds, delivering a 8x, 4x or 2x speedup. SIMD has now been extended to other architectures and designs: Motorola's PowerPC G4 implements a 128-bit vector engine co-processor with a G3 PPC core. The latest SIMD designs are moving to a separate 128-bit vector unit instead of the earlier 64-bit Floating-Point Execution Units.

1

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS



WEBINAR
WEBINAR
WEBINAR
WEBINAR




 :