My desktop PC executes billions of instructions per second, even when sitting more or less idle. It reliably perfectly, actually moves data to and from a billion bytes of DRAM in nanoseconds while sucking huge files from 400 gigabytes of hard disk space. Days and at times even weeks go by without a single glitch, yet every second of each of those days the machine slams vast amounts of data around. A single bit error will crash the machine. Yet the occasional problems all arise from imperfect software; the hardware operates flawlessly.
The engineers who design computers work with imperfect components. No one really knows the characteristics of the ICs, capacitors, or any other element they use. That 1K resistor might actually be 1,024 ohms, and though it's rated at a quarter watt, odds are it'll handle a bit more than that, if used in a reasonably-ventilated space in a non-extreme environment. The 5.00-volt power supply might be putting out more like 4.95 volts, and that figure is likely to change as components age over the years and the mains vary due to summer air-conditioning demands. The track on the PC board doesn't really act like a wire at all; it's a complex transmission line which reflects, attenuates, and distorts the digital signal it carries.
But the PC operates reliably because engineers realize their components are less than perfect. They design margin into every aspect of the system. In an ideal world perhaps just a single electron is enough to distinguish a zero from a one, but in the grimy reality of engineering, EEs push thousands or millions of electrons down each wire. Capacitors are hard to make precisely; some come with wild tolerances like -20/+80%. We're not really sure how many Farads the vendor will provide, so design in plenty of margin to insure success.
Margin is the essence of reliable engineering. Civil engineers don't know exactly how strong a bridge beam will be, especially when it's a concrete structure poured on-site by bored and careless laborers. The bridge stands because the beam is two or three times stronger than absolutely needed. When building the Brooklyn Bridge, Roebling (see The Great Bridge by David McCollough for a wonderful book) discovered the wire used to suspend the entire bridge was of inferior quality. Yet his design had so much margin that the bridge still stands a century later, still held up by some of that bad wire.
In the firmware world we work with perfect components. A one is a one is a one. But there's no margin in the world of firmware engineering. One unitialized variable, a single miscomputed bit, just one mismatched PUSH/POP pair, causes the application to crash. If the user types in unexpected data or operates knobs out of sequence our code might kill someone (take, for example, the Therac 25).
Our programs are like bridges without margin, likely to collapse under the feeblest stress.
In truth, there are some techniques that add a bit of margin to our code. Build the system around an MMU and individual tasks might fail, but a very smart OS can save the system. Pepper the code with asserts() to find error conditions and then take corrective action. Add redundant execution streams to identify and repair software errors. Include stack monitors, malloc() traps, and plenty of other instrumentation to build fault-tolerant code. Of course few of us actually do any of this.
Yet these strategies reveal that firmware is topsy-turvy compared to any other engineering discipline. Software margins come at the expense of vastly increased design costs, with, in these days of cheap transistors, little in the way of increased production expenses. The Shuttle's code is probably the best software ever written. Price tag: $1,000 per line.
Bridge-building, though, is all about materials cost. A mere stroke of the designer's pen increases the size of a beam, but perhaps doubles production costs. The wise EE who's worried about pushing 0.2 watts through a 0.25-watt resistor simply uses the next size up. There's zero design effort but substantial recurring costs.
Perhaps software engineering is somewhat akin to automotive design. The car industry must minimize recurring costs, which includes the costs due to recalls and repairing defects. They spend billions engineering a new product. Or maybe it's like building a spacecraft; of the $800 million spent on the Mars Expedition Rovers, only a pittance is in materials cost. Engineering sucked up the largest chunk because failure is intolerable.
Both automotive and spaceflight share the same philosophy: perfection is worth plenty of engineering dollars. Sadly, that's a far cry from the approach of most firmware engineering projects, where nothing is more important than minimizing NRE and the schedule.
What do you think? Can we put more margin into our firmware? Is it worth the extra time and effort?
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. He founded two companies specializing in embedded systems. Contact him at . His website is .
Yes, Jack, you can do it on a shoestring and in record time. It requires, though, having tools and libraries with these things built-in. It requires an attitude that when I'm building tools and libraries for myself, I invest more effort than when I'm cranking out application-specific code. The earlier in my career I am, the more of those tools I'll be building myself, so I have to have the attitude that these will be used for many years over many projects. Then, it's worth the effort to tell management there'll be some slip on this and take the hit early. And, if you find a piece of code being used over and over, it may be worth the effort to make it absolutely bulletproof now, so future projects will benefit. That's my two cents.
– Joe Preston
I don't know if you can really compare hardware and software reliability directly this way. As an example, you cite theShuttles coding cost – but remember that the overall system uses 5 computers, each processing the same data and then voting onthe outcome. In this way even multiple hardware failures will not result in catastrophic loss.
As in other subjects, part of the problem is in training. Many of our engineers today seem to think that the hardware willalways be correct and that whatever it tells the software is gospel – something that is just not true. As systems age -especially electromechanical systems – parts don't work the way they do in the lab. Yet mention being able to design a systemwith tolerance for effects like this and most engineers respond that the lab system doesn't work like that. We probably needto spend a bit more time in the field with real world problems than what we see in the development labs.
– Tom Mazowiesky
Are we talking about EMBEDDED SOFTWARE? Of course margins are built into any embedded firmware, at least any that is built here. I consider interrupt timing latency, stack depth (with & w/o recursion) and error recovery (in embedded systems it serve little purpose to display an error message except as a last resort) analysis as a means to establish reasonable margins in embedded software. It has been my experience that code that is subjected to periodic design reviews, rarely has many overt errors or bugs in it. It seems that most of the late design and early production bugs and failures are due to inadequate analysis of memory and time usage. Just as in mechanical production, if the tolerances are too tight, the item may work on the bench but is a bear to manufacture.
– Michael Weisner