Savvy chip architects are seeking alternatives to apply to well-worn superscalar designs. The reason: superscalar processors have become unwieldy, bursting with pipelines and caches aimed at boosting performance. Enter very long instruction word (VLIW) architectures-a blast from the past that has gained new-found popularity recently.
The hallmark of VLIW is the ability to extract a highly parallel instruction stream from the applications program with which it is supplied, and dole those machine instructions out to the chip's numerous execution units. But to be truly VLIW, the silicon design must be far simpler than a comparable superscalar part. Indeed, when it comes to VLIW, less is supposed to be more.
As I alluded to above, a VLIW chip typically has numerous execution units, laid out in a neat grid, that run multiple instructions on each clock cycle. VLIW also requires smart compiler software to schedule those instructions (no small feat, as we shall see).
Historically, the first rumblings of VLIW were heard in the early '80s, when a bevy of minicomputer companies-most notably, Cydrome and Multiflow-attempted to construct large-scale systems based on the concept. However, the technology at that time was not bulletproof, nor was the marketplace ready, and the effort fizzled.
Nevertheless, the main architects of the aforementioned companies never stopped working. They simply went off quietly and refined their techniques. They also recruited a following of hardware and software engineers dedicated to realizing the VLIW vision, in which software is as important-if not moreso-than the hardware itself. (Among the leading lights of VLIW, then and now, are Josh Fisher and Robert Rau, of Hewlett-Packard Labs.)
Today, the first fruits of turnkey VLIW chip technology are upon us, whether you're talking about the server world, the desktop space, or our own embedded arena. Indeed, it's important that embedded designers distinguish between VLIW in the large and VLIW in the small. That's because embedded VLIW implementations seem to have hewn to the inventors' original intent of simplicity as a means of extracting better performance.
Indeed, VLIW in the large is best embodied by Intel's EPIC IA-64 architecture, which is more complex than the company's sister 32-bit microprocessor families. We won't focus on that, but trust me-it is.
As for VLIW that can be better applied in emerging embedded applications, we have confronted several new types of processors. First to hit the market were media processors incorporating VLIW concepts, such as Philips's TriMedia and Chromatic's MPact Media Engine.
Next, Texas Instruments released its VLIW-based “VelociTI” C6X series of digital signal processors (see Figure 1), which found applications in cell phones and related devices. Here, the company had to expend lots of effort convincing potential customers that the chip's C compiler could actually deliver a parallel instruction stream to the DSP. A year after the part's release, TI was able to call their evangelical effort an unequivocal success. In the DSP realm, TI's competitors, such as Analog Devices and StarCore (the latter a joint venture of Motorola and Lucent), also entered the VLIW fray.
On the conventional embedded front, perhaps the most notable implementation is the Crusoe processor family from Transmeta (see “The Software Side of Crusoe,” April 2000, p.85).
Moving forward, VLIW is expected to play a key role in future embedded-silicon designs. However, it will face strong competition from system-on-chip, DSP, and multicore chips. Note, too, that this is not an either/or situation, because a VLIW design can be fabricated using system-on-chip technology.
So what exactly comprises a VLIW chip? Architecturally (or should I say, philosophically), it's actually pretty easy to describe. Only the actual implementation is difficult (that is to say, don't try this at home)!
The basic philosophy is about building processors that-unlike their superscalar opposites-don't spend a lot of time and silicon figuring out what to do and when to do it. The hardware will be simple and computation-only, capable of dumbly carrying out very large numbers of operations in each clock tick.
This, according to VLIW proponents, results in a more architecturally elegant chip layout. For example, the on-chip elements are repetitive (multiple execution units), the elements tile nicely, and they can be packed closely together.
Another advantage is that hardware design takes a lot less time than for a superscalar processor.
However, VLIW's vaunted advantages hinge on the fact that it has enough “smarts” to successfully schedule a parallel stream of instructions for execution. By definition, the term VLIW refers to a wide-bit-length word into which must be packed-in advance-the different commands, or op-codes, of all the instructions to be executed simultaneously by the VLIW processor on the next clock cycle. That is, in a VLIW design, you find what can be done in parallel and then you directly tell the hardware to do it in parallel.
That instruction-scheduling is performed by what's called a trace-scheduling compiler. Such a compiler uses various techniques to assess (look ahead at) very large sequences of operations, through many branches, and schedule the whole thing simultaneously.
In marked contrast, superscalar chips perform their instruction scheduling at runtime. Therein lies the purported advantage of VLIW over superscalar. In a VLIW instruction, packing and scheduling are done in advance by the compiler. In superscalar, they're done at runtime. The advantage is in not having to do the final scheduling at runtime, which involves a lot of silicon and a lot of time on the critical path.
VLIW detractors have argued that any compiler agile enough to effectively schedule so many simultaneous instructions will be overly complex and difficult to write. At this point, it's difficult to say whether they're correct or not, since VLIW software (read: compilers) is, for the most part, several years behind the hardware it is being designed to control (DSPs are the exception). Until VLIW compilation software reaches the requisite level of sophistication, no one will really know whether VLIW will become firmly entrenched or a two-decades-old silicon dream.
Still, some notable computer scientists believe they have overcome most of the barriers to designing good trace-scheduling compilers. In addition, an increasing supply of compiler-writers trained at hot-beds of trace-scheduling research, such as Rice University, Cornell, and Stanford, is beginning to come on line. Indeed, some of those people are working in Fisher's group. Others are employed by Rao, who heads a group in HP Labs' Palo Alto operation. A little-known VLIW effort is also going on at IBM's Yorktown Heights, NY, research center. A “research compiler,” called the Trimaran Compiler, is also available for free downloading by anyone who wants to test drive the code.
Despite its advantages, VLIW has long had one big practical problem, which continues to delay its movement out of the research lab and into the real world. The Achilles' heel of VLIW is that, in its pure form, it offers no object-code compatibility within a given family of chips. For example, a mythical VLIW processor with three execution units wouldn't be compatible with one with five units. Nor would those two be compatible with a third implementation containing 10 execution units.
However, the barriers against compatibility have been slowly and steadily beaten down during the past decade. Indeed, a big selling point Transmeta uses for its Crusoe chips is that they contain code morphing technology (CMT), which enables them to run x86 code.
Whatever the advantages of VLIW, debate is likely to rage for some time over whether and how quickly the technique will move into the mainstream. Supporters believe VLIW will offer the most benefit in those applications in which there's lots of inherent parallelism to begin with, particularly when it comes to processing pixels and sound. Indeed, multimedia is most often mentioned as the prime area that could benefit.
Several technical challenges remain. Most important is finding a way to handle the problem of speculative exceptions-those cases in which the processor “predictively” proceeds with an operation ahead of a branch in the program, in the expectation that the operation will, in fact, be required. However, it sometimes turns out the operation is illegal. The result is a program stall, or, in the worst case, a program error.
Another issue is that superscalar CPUs are likely to offer a key benefit that VLIW chips can't deliver. The advantage for superscalars is that they can adjust to changing conditions at runtime-things like the latency of an operation, whether you missed in a cache, whether two addresses were the same or different. A VLIW processor cannot make those adjustments unless it's architecturally impure and has some superscalar elements added in.
Still, problems such as CPU stalls can be minimized by tricks such as prefetching, which may help in some cases or may not. For example, if you're looking to fetch a pointer that's going to have the address of the next thing you're going to get, you are just stuck. It's pretty hard to predict what address you're going to reference next.
The situation is different for image manipulation-something at which VLIW chips are expected to excel, and a major reason why DSP has become a hot proving ground for embedded VLIW silicon. That's because image data usually reside in linear address spaces, where it's easier to handle functions like block-copy operations.
Alexander Wolfe is editor-at-large for ESP. He holds a BE in electrical engineering from Cooper Union. He has written assembly language code for embedded systems. He co-wrote From Chips to Systems: An Introduction to Microcomputers, 2nd Edition (Sybex, 1987). E-mail him at
Return to Table of Contents