Our long-held assumptions about reliable and deterministic hardware may not be valid much longer. Here's the reason why
Safety-critical systems that rely on firmware are inherently problematic. Outside of some purely academic work, software can't be “proven” correct. We can do a careful design, work out failure modes, and then perform exhaustive testing, but the odds are that bugs will still lurk in any sizeable chunk of code. DO-178B and other standards exist to help ensure reliability, but guarantees are elusive at best. No one really knows how to make a perfect program when sizes reach into the hundreds of thousands of lines and beyond.
It seems that in a capitalist economy the edge between increasing complexity and needed correctness is mediated-not well of course-by litigation. I've no doubt that some horrible accident is coming, one that will be attributed to a flawed embedded system. Inevitably it will be followed by millions of dollars in lawyers' fees and more to the victims. High-level corporate executives will suddenly understand the risks of building computer-based products.
Perhaps this parallels the Green movement, which has shown that the cost of many products, once you factor in environmental degradation and clean-up bills, greatly exceeds the sell-price. I suspect, sadly, that it won't be until the looming embedded disaster arrives that management will understand the true costs of building complex systems. Today, too many view software quality as a nice feature if there's time, but figure the company can always reflash the system when the usual bugs surface. As embedded products control ever more of our world, in ever more interconnected and complex ways-maybe in ways that exceed our understanding-quality will have to become the foundation on which all other decisions are made.
Since the dawn of the embedded world, we've regarded software with implicit mistrust. Controlling a dangerous mechanical device? There's always a hardware-based safety interlock that will take over in case of a crash. The hardware will save the day.
We generally assume the hardware is perfect. Yes, wise developers create designs that fail into a safe state, since components do die. But mostly we believe in catastrophic problems, where a chip just stops functioning correctly, a mechanical shock breaks something, or a poor solder joint, stressed by thermal cycles, lets a chip lead pop free.
Hardware is deterministic. If it's working, well, it works. Until it breaks. But what about glitches? You know, those erratic and inexplicable things the system does but we can never reproduce. A wizened old technician said to me, “If it ain't broke, I can't fix it.” This is the model for accepting rare non-repeatable events.
Years ago, I worked on a project that tought me never to allow wire-wrapped printed circuit boards in prototypes. (Wire-wrapped circuits often exhibit squirrelly problems due to long lead lengths and crosstalk between wires. But a human problem was worse.) The engineers I worked with invariably attributed every strange and irregular behavior to the lousy wire-wrapped prototype. “Once we go to PCB everything will be fine. This is just a glitch.”
The same problem repeatedly surfaced in the circuit boards, necessitating yet another design respin. “It's just a glitch” frequently means “something is terribly wrong and we don't have a clue what's going on.” The culprit was a timing error, one that created observable symptoms only rarely.
When I was in the emulator business, I was shocked by the number of target systems that exhibited erratic memory problems. Our product's built-in tests pounded hard on memory. They frequently reported unrepeatable, yet undeniably serious problems, especially with big RAM arrays. The tests were designed to create worst-case situations of heavy bus switching. Tiny amounts of power supply droop, exacerbated by the massive transitions, created occasional failures. Impedance issues were even more common; the test patterns tossed out at the highly capacitive RAMs created so much ringing that the logic couldn't always differentiate between zeroes and ones!
It makes sense for managers to institute a “no glitch” policy. Until a prototype works reliably, or until the cause of a problem is well understood, we keep improving the system design in the pursuit of a high quality product.
Hardware can be less than perfect, but in the examples I've just shared, the problems all stem from design flaws. More careful design, better reviews, and additional testing should improve such systems substantially. It's the software that's going to be the problem, since the hardware is but a platform that runs those instructions faithfully.
The times may be changing. According to Intel, any assumption that processors just do what they are told is now wrong. The latest CPUs will run your program, and pretty fast at that. But once in a while, you can expect a bit in the instruction stream to flip at random. It's the price of high technology.
Scared yet? I was after I read a September 3, 2001 EE Times article about Intel's McKinley processor. One of its neat features is an L3 cache located onboard the CPU. This about doubles some performance figures.
The cache is big and the geometry of the part quite tiny. It's fabricated with .18-micron line widths, which is astonishingly small but no longer bleeding edge. That's 180 nanometers, or something like a third of the wavelength of red light. (When light wavelengths look big, surely we've fallen through the looking glass into the realm of the fantastic.)
In the article, Intel acknowledged that this small size coupled with the large surface area of the four megabyte L3 cache creates a serious target for incoming cosmic rays. There's very little energy difference in such small parts between a zero and a one, so occasionally the cache may be struck by an incoming particle of just the right energy to flip a bit. What happens next is hard to predict. Perhaps the processor will execute a cache-miss, causing a flush and reload. But maybe not. A cache with wrong data, even when it's just a single bit, is a disaster waiting to happen.
How bad is the problem? Apparently they've looked at this statistically, and feel that odds are the average user will see less than one bit-flip per thousand years. That's a very long time. (I can't help but wonder if Intel can justify a 1,000-year mean time between failures (MBTF) partly because no one expects a Windows-based PC to run for more than a day or two between crashes anyway.)
Can embedded systems live with a 1,000-year MTBF? Perhaps. Unless one considers that high-volume production means that, worldwide, cache bits will flip rather often. Build a million widgets with these reliability parameters, and statistics suggest 1,000 bit-flips per year, divided among all of the deployed products. That is, maybe 1,000 crashes per year would be due just to this one infrequently encountered hardware flaw.
Suppose the airlines retrofit all commercial planes with a super-gee-whiz new kind of avionics that includes just a single McKinley CPU. I read somewhere that, on average, there are 7,000 airliners flying at any time. Will this cause seven crashes per year? Or could it cause even more, since radiation sources pack more punch at higher altitudes.
This is not a rant against Intel or the McKinley processor. The company showed courage by admitting the issue. The problem stems from the evolution of our technology. All vendors pushing into the deep submicron will have to grapple with erratic hardware as transistor sizes approach quantum limits. The PC industry generally leads embedded by a few years as they pioneer high density parts, but we do not follow far behind. We embedded developers can expect to see these sorts of problems in our systems before too long.
If 180nm line widths are subject to cosmic rays then what does the future hold? Most roadmaps suggest that leading-edge parts will use 50nm geometries within 10 years. To put this in perspective, a hydrogen atom is about an angstrom across. In a decade, our line widths will span a mere 500 atoms. It's hard to conceive of building anything so small, especially when a single chip will have a billion transistors made from these tiny building blocks.
Intel predicts these near-future parts will run at 20GHz. Even CISC parts are evolving towards the RISC ideal of one clock per instruction, so that processor will execute an instruction in 50 picoseconds! Not only are the transistors miniscule, they operate at speeds that give RF designers fits.
Heat, too, conspires against chip designers. The McKinley reportedly dissipates 130 watts. A few of these CPUs would make a fine toaster, and, no doubt, that much embedded intelligence will ensure the perfect piece of toast, every time.
I can't find a McKinley datasheet, but the previous generation Itanium shows a max current requirement (processor plus cache) of about 90 amps. This is not a misprint: 90 amps. That's not much less than a Honda's engine cranking current; it will drain a Die Hard in half an hour. The part comes with an elaborate thermal management subsystem that automatically degrades performance as temperature rises, to avoid self-destruction.
Heat has always been the enemy of electronics. High temperatures destroy semiconductors. Thermal stresses can cause leads to pop free of circuit boards. Connectors unseat themselves, PCBs change shape, and all sorts of mechanical issues stemming from high temperatures reduce system reliability. I can't help but wonder if simply powering systems up and down, with the resulting heating and cooling cycles, will create failures. Once we thought semiconductors lasted forever. Will this be true in the future?
It's really quite marvelous that a $1,000 PC running at 1GHz or so works as well as it does. Think about the Saganesque billions of bits-maybe closer to hundreds of billions-running around your machine each second. Even when the machine is “idle” (whatever that means), it's still slinging data at rates undreamed of a decade ago. Yet if just one of those bits flips, or is misinterpreted, or goes unstable for a nanosecond or two, the machine will crash.
Our traditional assumption of reliable hardware will not be valid in the high-density, very fast, and quite hot world that's just around the corner. Intel's disclosure of cosmic ray vulnerabilities may actually be a wake-up call to the computer community that it's time to rethink our designs.
Perhaps the vendors will counter with redundant hardware. In some cases, this makes a lot of sense. It's pretty easy-though expensive-to widen memory arrays and add error correction codes that detect and correct one and two bit errors. Much more difficult is creating reliable CPU redundancies for a reasonable price. Today the cache is at risk, but in the 50-nanometer, one-billion transistor processors of the near future, much of the part's internal logic will surely be susceptible to random bit flips as well. The space shuttle manages hardware failures using five “voting” computers, so we know it's possible to create such redundancy. But the costs are astronomical. The shuttle is an economic anomaly funded by taxpayer money. Most embedded systems must adhere to a much less generous financial model.
But the hardware has always been unreliable. Older engineers will remember the transition to 16KB (16,384 bits, that is) DRAMs in the '70s. New process techniques produced memories subject to cosmic rays. Here we are, a quarter-century later with similar problems! Most of us at the time fought with flaky DRAM arrays, blaming our designs when the chips themselves were at fault. New kinds of plastic packages were invented, until the vendors discovered that some plastics generated alpha particles, which also flipped bits. After a year or two the problems were resolved.
With the dawn of the PC age, DRAM demands skyrocketed. The parts were much less perfect than those we use today, so all PCs used 9-bit wide memory, the ninth bit being devoted to parity. Today's reliable parts have mostly eliminated this-when was the last time you saw a parity error on a PC?
In the embedded world, we design products that might be used for decades. Yet some components won't last that long. Many vendors only guarantee that their EPROMs, EEROMs, and flash devices will retain programmed data for ten years (some, like AMD, offer parts that remember much longer). After a decade, the program might flit away like a puff of smoke. Many vendors don't specify a data retention time, leaving designers to speculate when the firmware will become even less than the ghost in the machine.
We've had some issues with hardware reliability already, but have mostly dealt with the problem by ignoring it. I suspect we're going to see new kinds of firmware development techniques, designed to manage erratic hardware behavior. One obvious approach is a memory management unit that wraps protected areas around each task. A crash brings just a single task down, though what action the software takes at that point becomes problematic. Perhaps future reliable systems will need a failure mode analysis that includes recovery from such crashes. Firmware complexity will soar.
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at .