Compiling software to gates -

Compiling software to gates


Are VHDL and Verilog past their prime, soon to be replaced by C-like design languages such as System C, Handel-C, and others? Professor Ian Page thinks a change is at hand.

Crisis, what crisis? A hard question, not because there is no crisis, but because there are so many affecting the semiconductor industry. The crises include the rising cost to create a state-of-art chip, the time it takes to complete a design with hundreds of millions of transistors, the problems of attracting and retaining the key engineers who can handle such complex chips, eliminating errors from the designs before tape out, getting to market in time to catch the market window . . . the list goes on.

Put like that, we're facing a large catalog of problems and a recipe for despondency. Another view, however, is that we really have to deal with only two problems:

  1. How to create efficient and correct designs against a backdrop of exponentially increasing complexity.
  2. How to implement those designs in silicon against a backdrop of exponentially increasing costs of production.

Fortunately, both problems have solutions that could transform the situation and give control back to the engineers and innovators to produce new generations of high added-value products. The answer may be as close as the familiar C language.

Crisis in implementation
Despite the problems of moving to deep sub-micron devices, it would be foolish to expect any early reversal of the trends of the last four decades. Costs of implementing a state-of-art chip are already large and escalating rapidly. The inevitable result has already been seen in fewer ASIC design starts. Because of this cost the future of electronics looks like one of a decreasing number of “standard” chips with which we'll build our new systems.

We're now at the point where a start-up company needing to produce a state-of-art ASIC for its product needs about $50 million of venture-capital funding. The VCs will rarely trust the founders enough to risk this sort of capital on an unproven idea. A route that has taken many bright engineers from an innovative concept to a world-leading technology corporation is now essentially closed to newcomers.

Two of the most interesting features of the changing semiconductor landscape are the move to “multicore” as the single processor reaches its inherent limits in this technology, and the rise of the field-programmable gate array (FPGA) as it incorporates more and more system-level components to its on-chip inventory. In fact, it's not difficult to view these two observations as aspects of the same underlying trend. Extrapolate both of them and you end up with an array architecture of quite complex computing-oriented components connected together with a high-speed, flexible communications network.

These chips will be general-purpose, and thus will fit the bill for tomorrow's shortened list of standard components. Over time and with further scaling, their increased power will come from enlarging the size of the array. Their generality will come from programming the fixed, underlying architecture to make it more application specific, possibly as a final manufacturing step, but more likely in the field by reprogramming the internal hardware components and communication networks.

In a world with few ASICs, reconfigurability will give application-specific advantages to a product, and such architectures are themselves easier to design, prove, and test. It's clearly much easier to design one logic cell and replicate it than it is to design and test a monolithic design. This is especially true with the move to deep sub-micron architectures.

We should expect to see a decrease in the role of the ASIC except in very high-volume products and a rise in the role of the FPGA and reconfigurable architectures that evolve from it.

Crisis in design
Turning to the other of our two “big problems,” let's say that a chief silicon designer has spent the last 18 months completing a design and getting a product launched. It's not unrealistic to assume that this project required an expensive re-spin that actually cost a great deal more in terms of lost market share because of the 10-week delay in shipping the product. Anyway, that pain behind him, the chief designer is now starting on the next design. It's going to have twice as many transistors in it, but does he get twice as many designers allocated to do the job? Does he get double the time to complete the job? Does he get the leeway to make twice as many mistakes this time around? No, of course not. In fact, the chances are that his team remains about the same size, but that the pressure is greatly increased to turn the design around more quickly and with even fewer design errors.

Since we're stuck with both increasing complexity and static resources, there's only one way to make this miracle happen: increase the productivity of the designers themselves. The electronic design automation industry has done spectacularly well by bringing out better tools and languages and encouraging design reuse. It has given us a sustained 23% annual increase in designer productivity over decades. In a saner world, this would qualify electronic design automation for a major industry award. The problem, however, is that this 23% increase falls way short of the 60% or so increase in complexity each year coming from the underlying technology, such as Moore's Law.

The “design gap” is the distance between these two figures. It represents the exponentially increasing gap between what we need to design in order to stay in business and the productivity of the resources we have available to complete that design.

There was a time when silicon was designed by hand—the layout of polygons. Increasing complexity forced designers to become much more productive through the adoption of standard cell libraries and schematic capture, despite the loss of control and “efficiency” this entailed. A decade or so later, another step change in productivity was needed for the same reasons and we gave up even more control and efficiency and switched to hardware-description languages (such as VHDL and Verilog) and logic synthesis. We're now in the situation where nearly all the increase in designer productivity is coming from design reuse. Although it's easy to predict that even more of this will come in the future, I'm taking as my starting point that this is still not enough to maintain a competitive edge and that yet another step change in design productivity is required.

The dreaded paradigm shift
Not another paradigm shift? 'Fraid so! The means we've so far invented to cope with the design gap haven't worked well enough. The pain is so great that most of the industry knows that something different needs to be done. A fair amount of agreement is emerging that this change will mean a move to “system-level” design. Unfortunately, there's no real consensus about what the term “system level” might mean. But it seems clear that the switch to system-level design will occur fairly discontinuously, perhaps similar to how the VHDL/Verilog and the standard-cell changes evolved.

To understand what these changes in design tools might be, let's turn 180 degrees and look at who will be using them. In the West, the number of engineering graduates is falling, which is not a hopeful sign. The number of programmers, however, continues to rise. The simple truth is that software engineers are easier to find, easier to train, and they can work fast and flexibly. In the world of software, a “re-spin” is done over a coffee break at almost no cost. So if we ask, “Where is there a workforce that's readily available, now and in the future, to deal with the problems of massive design complexity against tight time and cost measures?” the only answer I see is to turn to the world of programming and the software-engineering model.

My own thinking was changed quite radically in 1990 when I used my first FPGA chip. This was when I was an academic in the Computing Laboratory at the University of Oxford. I had previously built a 256-processor, SIMD graphics engine with a microcoded controller, driven by a 40-processor MIMD array—all done with wire-wrap technology. Predictably, it had taken far too much time and effort to build this machine so when I saw this FPGA device, I thought that this was the way I wanted to build hardware in the future—especially since I was working in a mathematically oriented computer science department that didn't have the resources to build “real” hardware. Also, the future of the FPGA seemed assured, since it would benefit from every turn of the Moore's Law handle. Because of my personal need, I always saw the FPGA as an (embryonic) general-purpose substrate for building arbitrary digital hardware. It was only later that I realized the benefits might extend to a whole industry.

Another lesson history had taught me was the value of software. Every time I built a new machine, I had to create the software tools (such as a compiler) for it, then create the software infrastructure (an operating system), and then start coding applications for it. This took time and effort but even when successful, the rewards were short-lived. Typically, another turn of the Moore's Law handle made my handcrafted hardware solution irrelevant. When I switched it off for the final time and looked to build my next architectural wonder, I had to throw all that software away as well because it had been specific to the defunct machine. I finally determined that I wouldn't repeat this mistake and that any solution I might come up with for exploiting hardware in the service of demanding applications would have a reusable software basis.

Thus I'd convinced myself that reconfigurable hardware was the future. I will use that term from now, as I think FPGAs are still only a first-generation embodiment of the big idea of a general-purpose, reconfigurable substrate for special-purpose computing. Since ultimately application programs would run on this hardware, it seemed most sensible to look for a way to compile such programs directly onto the reconfigurable substrate. That way, whenever a more powerful version of the substrate came along, a simple recompile could take advantage of it.

The art of the possible
If we accept that the goal is to compile relatively ordinary programs into effective hardware, it would be great if we could write a compiler that took in the sort of C programs that programmers write today and to produce a netlist from it. This is the C-to-gates approach. Unfortunately, this approach doesn't work because of the parallel nature of hardware. This shouldn't be too much of a surprise, because optimizing a sequential program onto a parallel machine has long been known to be computationally intractable; which basically means we will all die long before the compiler ever produces a totally optimal solution from its first source program.

We might hope there would be compiler algorithms to do a “good enough” job despite the complexity. This is already possible with the “traveling salesman” problem, which is another algorithmic problem where finding the absolutely optimal solution is also intractable. Unfortunately, and despite 40 years of parallelizing compilers for all sorts of machines, these algorithms don't work terribly well. So the likelihood is that the older programmers among us will also die before this happens.

It's actually quite easy to build a compiler that will convert arbitrary C programs into a hardware netlist. Such a compiler might even make a half-decent job of parallelizing the program along the way. However, these systems typically only work well on small examples, and when they fail to do well on larger ones, there's no way for the designer to help them do better. The supposedly smart compiler has in fact been too smart for the designer's good. It has left him not in control of his own design and with no obvious way forward.

The stark choice before us is to give up on either our notion of what the output of the compiler should be or what its input is. Given that hardware is still far from being free, it's essential to keep hold of the requirement for consistently producing efficient hardware from the back end of the compiler. That decision forces us then to change our ideas about what we must give to the compiler as input. If it can't be something like sequential C, then it has to be a language where the designer himself is expressing the desired parallelism that our compilers are currently unable to generate for us.

The problem with this approach is the need for programmers to express their algorithms and applications in a parallel-friendly fashion, which unfortunately does make their job harder. But at least it's possible, and the upside is a powerful one. With this approach, we have a way to use programming expertise to turn complex software into good hardware, and we can write compilers that do the job well and never leave the designer out of control.

This way, the programmer is doing something similar to an electronics engineer who decides to fill the two-dimensional space on a chip with, say, two processors, four memory blocks, and 20 multipliers. The software approach raises the level of abstraction so that we're now talking about computations (algorithms and programs) rather than hardware function units. However, we're still essentially filling two-dimensional space with them, but now we deliberately give up any pretense of control over where they are placed. If the programmer wants to fill silicon with two copies of the a computation and three copies of the b computation, he can simply say something like:

par {a; a; b; b; b; }

If the compiler turns each computation into a separate piece of hardware, then the programmer is doing the same basic job as the electronics engineer, by filling up the available silicon area with useful parts of the total solution. The difference is that the programmer is doing it at a much higher level of abstraction. The programs are therefore shorter, easier to produce, easier to change, easier to get right, and easier to maintain.

Time, space, communications
In fact, it's a bit more complex than I indicated just now. To embed complex algorithms into hardware we need to pay detailed attention to time, space, and communications. By space, we're talking about parallelism, as discussed previously. By communications, we are talking about the way various elements of a total solution pass data and synchronization information to each other; we'll come back to this later.

Turning to time, the way that electronic engineers embed their solutions in the time domain is by considering properties such as propagation delay, clock skew, gate delay, pipeline length, and so on. However, in the same way that we don't know how to build a compiler that will parallelize automatically, we also don't know yet how to get a compiler to reliably produce a computation that is efficiently embedded in the time domain. Given the current impossibility of producing these computations reliably and well with an automatic compiler, we apply a similar argument to conclude that our input language must also have the power to express the temporal behavior of our programs.

Many of the time concepts I just mentioned are complex in practice. If we want to embed very complex computations into the time domain, we have to find a higher-level, more abstract way to talk about it. This is especially true if we want designers with a software background to do some or all of the designing. There would be no point trying to teach them about clock skew and metastability. That would require a university course and we already know that not enough people want to take that sort of course. So we must find a higher-level concept of time that allows designers, whatever their background, to worry less about the small detail and more about the big picture as they try to embed their designs into the time domain.

A number of solutions are possible here, but the one I favor is to introduce the notion of a clock cycle. Even programmers have an intuitive notion of what a clock cycle is and what it means. We've all traded in our 600MHz PCs for 2.4GHz machines and know sort of what that means and why we're doing it. There's also a neat connection by coupling the fundamental idea of time in the electronics world, the clock cycle, with the fundamental building block of our programs, the assignment statement.

Nearly all programming languages come from the same model of “imperative” programming. In this model, expressions are evaluated (things are calculated), and the results are assigned (stored) into a variable (register or store location). In fact it's only the assignment statement that causes the computation to advance at all; the rest is all control. The control statements in a program, such as if and while statements, procedure calls, and so forth are simply determining which of the assignment statements will be executed and therefore which variables will be updated. It's a bit like trying to understand how a magician performs a card trick. The only things that truly matter are where and when he puts (assigns) one of the playing cards into his pocket or up his sleeve (into a variable). All the rest is irrelevant arm waving and distraction. If you want to understand the real action of the computation, just “follow the lady.”

Putting aside that little diversion into prestidigitation, my favored solution for getting time into the software language is to link these two concepts by saying that time advances in a computation in units of one clock cycle and that assignments take exactly one clock cycle to execute. Nothing else takes any time. If you want to understand the time behavior of a program, you need only look at the assignment statements that get executed. Each one takes one clock cycle, and no other part of the computation takes any. Thus, three assignments in sequence take exactly three clock cycles, {a=x; b=y; c=z; } , whereas three in parallel take exactly one clock cycle, as in par {a=x; b=y; c=z; }

From this example we have a framework for building parallel arrangements of processes. Although these processes can communicate with each other through shared variables, we'll also want a more direct model of communication so processes can send data and synchronization information to each other. This change is especially needed if we want to make the processes locally synchronous, but globally asynchronous to reduce the massive effect of today's high-speed global clocks.

Although many ideas are competing for the role of generalized model of communication, I'm particularly in favor of one of them, the channel of the CSP/occam system. Channels have the benefit of strong theoretical underpinning from the algebra of Communicating Sequential Processes (CSP)—see Communicating Sequential Processes , by C.A.R. Hoare, Prentice Hall, 1985—and a wealth of practical application experience. A channel has exactly two ends, one in a sending process and the other in a receiving process. A process that attempts to communicate using a channel (either sending or receiving) will be held up until the process at the other end is also ready. Only when both are ready does the communication of a single item of data (of any specified type) happen, at which point the two processes are unblocked and free to run on. Thus, these channels are characterized as point-to-point, directed, blocking, and synchronizing.

Three strikes and you're in
Gathering the above threads together, I'm suggesting that we get programmers (in addition to everything they already have to do) to actively design their programs with respect to parallelism (space), clock cycles (time), and channels (communication). That would seem to be three strikes against this method. Surely we want to make things easier, not harder? I would love to remove each of these hindrances to system-level design, and quite a few more too, but I simply don't know how to do it. I think the goal of automatically compiling an arbitrary piece of software or a mathematical specification into an efficient silicon implementation is simply too far away at present.

Three strikes indeed, but if you buy into this model of design, you may be able to win the game and maybe the series too. It's now demonstrably possible to put together mortal programmers (they should be good, but they don't need to be superhuman) and build a compiler that works 100% of the time, rarely gives any surprises, and never puts the programmers in a situation where they're not completely in control of the hardware they produce. Because no smart compiler tries to solve the really difficult parts of the design, this system won't suffer from the inevitable failures of the simple C-to-gates approach.

I called this the Handel model of computing, after the 18-century German composer. When the model was applied to the C language it resulted in Handel-C. That language, and a large-scale development environment wrapped around it, is now a commercial product sold by Celoxica, the company founded on this research.

Design exploration
When I started the research work behind this article as an academic at Oxford University in 1990, I wanted two things from it. First, I wanted to reclaim the ability to design and build hardware quickly and cheaply. Seeing that this was a problem for others, this goal evolved into a substantially larger one, namely the “redemocratization” of hardware. Small hardware companies could compete with big ones back in the days of TTL logic chips. I thought it should be possible to put the power of application-specific hardware back into the hands of individuals and small companies. The FPGA allied to a powerful compiler that raised the level of design abstraction from hardware to software seemed to be the best hope of doing that.

The other thing I was looking for in 1990 was a way to speed up computations. I was interested in highly responsive graphical user interfaces, video, and machine vision, where the data and computational bandwidths problems are huge. Some of that desired speed-up is achieved in the Handel-C approach.

What I discovered early on, and what was reinforced time and again, was that the biggest gains always came from better architectural exploration. When it takes only minutes to change some program code and test a new version with a completely different approach to parallelism it's possible to explore much more of the design space than it is working with hardware function blocks, wires, buses, and clocks.

As a hardware engineer myself, it was a surprise to find that software engineers were taking my tools and regularly besting hardware engineers at hardware design. It took me a few years to realize it was the software engineering techniques that made the difference. The software approach allowed them to try out perhaps 10 or 20 different designs to every one they could try out in VHDL or Verilog. Designs done this way are better simply because the designer can dismiss many more suboptimal alternatives. All of this won't help the guy who gets it right the first time, every time; on the other hand, he's just a mythical creature. More often we just beat away at a problem, never really sure we have the “best” solution in any dimension at all. The more chances we give ourselves to explore a bit more of the design space, the more likely we are to come up with a better design. The more at bats, the more hits. In design exploration, he who works faster works better.

Does our protagonist, the chief designer we introduced earlier, live happily ever after? Maybe, maybe not. But he does adopt these techniques and finds it easier to hire the people to produce the ever-larger designs his company demands of him, and at least he stays around to design another day. In today's highly competitive world that might be as good as it gets.

Ian Page has worked for over 30 years in the design and construction of parallel computing systems, most of that time as an academic at Oxford University. He created Celoxica as a spinout company from the University based on the research covered in this article. He now runs an early-stage, high-tech investment fund in the UK and is an honorary visiting professor at Imperial College in London. You can reach him at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.