| Register today for ESC/DESIGN West
To gear heads like me the history of engineering is rich in stories and lore, of failings and successes, and of triumphs and defeats of individual engineers. I remember reading Michener's The Source in high school and being entranced by his description of how the engineers of Megiddo, near Jerusalem , dug a tunnel 210 feet long some 2,900 years ago. The city was under siege and its well was located outside the city walls. With uncanny skill, they bored under the walls, secretly, navigating with only the crudest of instruments, yet somehow targeting the narrow fount perfectly.
Even the humblest of artifacts of technology have fascinating stories. Friends still make fun of my reading Henry Petroski's 400 page book titled The Pencil . Yet even this simplest of all writing devices sports a complex and fascinating history, one of engineers and artisans optimizing materials and designs to give users an efficient writing instrument.
Then there's James Chiles' Inviting Disaster , a page-turner of engineering failures, from bridge collapses, airline crashes, offshore oil platform sinkings, to, horrifyingly, near nuke exchanges. Strangely Chiles doesn't describe the famous loss of the Tacoma Narrows Bridge, which succumbed to wind-induced torsional flutter. The bridge earned the nickname “Galloping Gertie” from its rolling, undulating behavior. Motorists crossing the 2,800-foot center span sometimes felt as though they were traveling on a giant roller coaster, watching the cars ahead disappear completely for a few moments as if they had been dropped into the trough of a large wave.
Failures can be successes. When an aircraft goes down the NTSB sends investigators to determine the cause of the accident. Changes are made to the plane's design, maintenance, or training procedures. This healthy feedback loop constantly improves the safety of air travel, to the point where now it's less dangerous to fly than walk. That's plenty strange when you consider the complexity of such a machine. 400,000 pounds of aluminum traveling at 600 knots 40,000 feet up, in air that's 60 below zero, with turbines rotating at 10,000 RPM. It's astonishing the thing works at all.
Yet the concept of applying feedback, lessons learned, is relatively new. Those behind the Tacoma Narrows Bridge certainly ignored all of the lessons of bridge-building.
Clark Eldridge, the State Highway Department's lead engineer for the project, developed the bridge's original design. But federal authorities footed 45% of the bill and required Washington State to hire an outside, and more prominent, consultant. Leon Moisseiff promised that his design would cut the bridge's estimated cost in half.
Similar structures built around the same time were expensive. At $59 million and $35 million respectively, the George Washington and Golden Gate bridges had a span similar to that of the Tacoma Narrows. Moisseiff's new design cost a bit over $6m, clearly a huge savings.
Except it fell down four months after opening day.
Moisseiff and others claimed that the wind-induced torsional flutter which led to the collapse was a new phenomenon, one never seen in civil engineering before. They seem to have forgotten the Dryburgh Abbey Bridge in Scotland which collapsed in 1818 for the same reason. Or the 1850 failure of the Basse-Chaine Bridge , a similar loss in 1854 of the Wheeling Suspension Bridge , and many others. All due to torsional flutter.
Then there was the 1939 Bronx-Whitestone Bridge , a sister design to Tacoma Narrows, which suffered the same problem but was stiffened by plate girders before a collapse.
And who designed the Bronx-Whitestone? Leon Moisseiff .
Lessons had been learned, but criminally forgotten. Today the legacy of the Tacoma Narrows failure lives on in regulations which require all federally-funded bridges to pass wind tunnel tests designed to detect torsional flutter.
In the firmware world we, too, have our share of disasters. Most were underreported, few developers understand the proximate causes and the lessons that need to be learned. The history of embedded failures shows patterns we should—must!—identify and eliminate.
Consider the Mars Polar Lander , a 1999 triple failure. The MPL's goal was to deliver a lander on Mars for half the cost of the cost of the spectacularly successful Pathfinder mission launched two years earlier. At $265 million Pathfinder itself was much cheaper than earlier planetary spacecraft.
Shortly before it began its descent, the spacecraft released twin Deep Space 2 probes which were supposed to impact the planet's surface at some 400 MPH and return sub-strata data.
MPL crashed catastrophically. Neither DS2 probe transmitted even a squeak.The investigation board made the not-terribly-earth-shaking observation that tired people make mistakes. The contractor used excessive overtime to meet an ambitious schedule. Mars is tough on schedules. Slip by just one day past the end of the launch window and the mission must idle for two years. In some businesses we can dicker with the boss over the due date, but you just can't negotiate with planetary geometries.
MPL workers averaged 60 to 80 hours per week for extended periods of time.
The board cited poor testing. Analysis and modeling substituted for test and validation. There's nothing wrong with analysis, but testing is like double-entry bookkeeping—it finds modeling errors and other strange behavior never anticipated when the product exists only as ethereal bits.
NASA's mantra is to test like you fly, fly what you tested . Yet no impact test of a running, powered, DS2 system ever occurred. Though planned, these were deleted midway through the project due to schedule considerations. Two possible reasons were found for Deep Space 2's twin flops: electronics failure in the high-g impact, and ionization around the antenna after the impacts. Strangely, the antenna was never tested in a simulation of Mar's 6 torr atmosphere.
While the DS2 probes were slamming into the Red Planet things weren't going much better on MPL. The investigation board believes the landing legs deployed when the spacecraft was 1,500 meters high, as designed. Three sensors, one per leg, signal a successful touchdown, causing the code to turn the descent engine off. Engineers knew that when the legs deployed these sensors could experience a transient, giving a false “down” reading… but somehow forgot to inform the firmware people. The glitch was latched; at 40 meters altitude the code started looking at the data, saw the false readings, and faithfully switched off the engine.
A pre-launch system test failed to detect the problem because the sensors were miswired. After correcting the wiring error the test was never repeated.
Then there's the twin Mars Expedition Rovers, Spirit and Opportunity, which at this writing have surpassed all mission goals and continue to function. We all heard about Spirit's dispiriting shutdown when it tried to grind a rock. Most of us know that the flash file system directory structure was full. VxWorks tossed an exception, exactly as it should have and tried to reboot. But that required more directory space, causing another exception, another reboot, repeating forever.
Just as in unlamented DOS deleted files still consumed directory space. A lot of old files accumulated on the coast phase to Mars still devoured memory.
Originally planned as a 90 day mission, the spacecraft were never tested for more than 9 days. In-flight operation of motors and actuators generated far more files than ever seen during the ground tests. The investigators wrote: “Although there was limited long duration testing whose purpose was to identify system memory consumption of this type, no problems were detected because the system was not exercised in the same way that it would later be used in flight.”
Test like you fly, fly what you tested.
Exception handlers were poorly implemented. They suspended critical tasks after a memory allocation failure instead of placing the system in a low-functionality safe mode.
A source at NASA tells me the same VxWorks memory allocation failure has caused software crashes on at least 6 other missions. The OS isn't at fault, but it is a big and complex chunk of code. In all cases the engineers used VxWorks incorrectly. We seem unable to learn from other people's disasters. We're allowed to make a mistake—once. Repeating the same mistake over and over is a form of insanity.
It's easy to blame the engineers, but they diagnosed this difficult problem using a debugger 100 million miles away from the target system, found the problem, and uploaded a fix. Those folk rock.
In 1999 a Titan IVb (this is a really big rocket) blasted off the pad, bound to geosynchronous orbit with a military communications satellite aboard. Nine minutes into the flight the first stage shut down and separated properly. The Centaur second stage ignited and experienced instabilities about the roll axis. That coupled into both yaw and pitch deviations until the vehicle tumbled. Computers compensated by firing the reaction control system thrusters… till they ran out of fuel. The Milstar spacecraft wound up in a useless low elliptical orbit.
A number of crucial constants specified the launcher's flight behavior. That file wasn't managed by a version control system… and was lost. An engineer modified a similar file to recreate the data but entered one parameter as -0.1992476 instead of the correct -1.992476. That was it—that one little slipup cost taxpayers a billion dollars. At least there's plenty more money where that came from.
We all know to protect important files with a VCS—right? Astonishingly, in 1999 a disgruntled programmer left the FAA, deleting all of the software needed for on-route control of planes between Chicago O' Hare and the regional airports. He encrypted it on his home computer. The feds busted him, of course, but FBI forensics took 6 months to decrypt the key.
Everyone makes mistakes, but no one on the Centaur program checked the engineer's work. For nearly 30 years we've known that inspections and design reviews are the most powerful techniques known to prevent errors.
The constant file was never exercised in the inertial navigation system testbed, which had been specifically designed for tests using real flight data.
Test like you fly, fly what you tested.
A year later Sea Launch (check out the cool pictures of their ship-borne launch pad at www.sea-launch.com ) lost the $100 million ICO F-1 spacecraft when the second stage shut down prematurely.
The ground control software had been modified to accommodate a slight change in requirements. One line of code, a conditional meant to close a valve just prior to launch, was somehow deleted. As a result all of the helium used to pressurize the second stage's fuel tanks leaked out. Pre-flight tests missed the error.
Test like you fly, fly what you tested.
This failure illustrates the intractability of software. During countdown, ground software monitored some 10,000 sensors, issuing over a million commands to the vehicle. Only one was incorrect, a 99.9999% success rate. In school a 90 is an A. Motorola's famed six sigma quality program eliminates all but 3.4 defects per million. Yet even 99.9999% isn't good enough for computer programs.
Software isn't like a bridge, where margins can be added by using a thicker beam. One bit wrong out of hundreds of millions can be enough to cause total system collapse. Margin comes from changing the structure in sometimes difficult ways, like using redundant computers with different code. In Sea Launch's case, perhaps a line or two of C that monitored the position of the valve would have made sense.
Robert Glass in his Facts and Fallacies of Software Engineering (Addison-Wesley, 2002, ISBN 0321117425) estimates that for each 25% increase in requirements the code's complexity explodes by 100%. The number of required tests probably increases at about the same rate. Yet testing is nearly always left till the end of the project, when the schedule is at max stress. The boss is shrieking “ship it! Ship it!” while the spouse is wondering if you'll ever come home again.
The tests get shortchanged. Disaster follows.
The higher levels of the FAA's DO-178B safety critical standard require code and branch coverage tests on every single line of code and each conditional. The expense is staggering, but even those ruthless procedures aren't enough to guarantee perfection.
Last month I described how Mars Polar Lander was lost due to a software error. But this scenario played out on yet another planet, the one known as Earth, when on September 8 the Genesis mission impacted at 200 MPH. As I write this the Mishap Investigation Board hasn't released a final report. But they did say the gravity-sensing switches were installed upside down so couldn't detect the Earth's gravitational field.
The origins of Murphy's Law are in some dispute. The best research I can find (www.improb.com/airchives/paperair/volume9/v9i5/murphy/murphy1.html) suggests that Captain Ed Murphy complained “If there's any way they can do it wrong, they will” when he discovered that acceleration sensors on a rocket sled were installed backwards. Nearly 60 years later the same sort of mistake doomed Genesis.
Perhaps a corollary to Murphy's Law is George Santanya's oft quoted “those who forget history are condemned to repeat it.”
NASA's mantra “test like you fly, fly like you test” serves as an inoculation of sorts against the Murphy virus. We don't as yet know why Genesis' sensors were installed upside down, but a reasonable test regime would have identified the flaw long before launch.
The Therac 25 was a radiotherapy instrument designed to treat tumors with carefully regulated doses of radiation. Occasionally operators found that when they pressed the “give the patient a dose” button the machine made a loud clunking sound, and then illuminated the “no dose given” light. Being normal human-type operators, they did what any normal human-type person would do: press the “dose” button again. After a few iterations the patients were screaming in agony.
Between 1985 and 1988 six cases of massive overdosing resulted in three deaths.
The machines were all shut down during an investigation, which found that if the backspace button was pressed within 8 seconds of the “give the patient a dose” control being actuated, the device would give full-bore max X-rays, cooking the patient.
The code used a homebrew RTOS riddled with timing errors. Yet today, nearly two decades later, far too many of us continue to write our own operating systems. This is despite the fact that at least a hundred are available, for prices ranging from free to infinity, from royalty-free licenses to ones probably written by pirates. Even those cool but brain-dead PIC processors that have a max address space of a few hundred words have a $99 RTOS available.
Developers give me lots of technical reasons why it's impossible to use a commercial OS. Too big, too slow, wrong API—the reasons are legion. And mostly pretty dumb. EEs have long used glue logic to make incompatible parts compatible. They'd never consider building a custom UART just because of some technical obstacle. Can you imagine going to your boss and saying “this microprocessor is ideal for our application, except there are two instructions we could do so much better. So I plan to build our own out of 10 million transistors.” The boss would have you committed. Yet we software people regularly do the same by building our own code from 10 million bits. Crafting a custom OS is nearly always insane, and in the case of the Therac 25, criminal.
It's tough to pass information between tasks safely in a multithreaded system, which is why a decent RTOS has sophisticated messaging schemes. The homebrew version used in the Therac 25 didn't have such features so global variables were used instead, another contributor to the disaster.
Globals are responsible for all of the evil in the universe, from male pattern baldness to ozone depletion. Of course there are instances where they're unavoidable, but those are rare. Too many of us use them out of laziness. The OOP crowd chants “encapsulation, inheritance, polymorphism;” the faster they can utter that mantra the closer they are to OOP nirvana, it seems. Of the three, encapsulation is the most important. Both Java and C++ support encapsulation… as do assembly and C. Hide the data, bind it to just the routines that need it. Use drivers both for hardware and to access data items.
The Therac's code was, as usual, a convoluted mess. Written mostly by a solo developer, it was utterly unaudited. No code inspections had been performed.
We've known since 1976 that inspections are the best way to rid programs of bugs. Testing and debugging simply don't work; most test regimens only exercise about half the code. It's quite difficult to create truly comprehensive tests, and some features, like exception handlers, are nearly impossible to invoke and exercise.
Decent inspections will find about 70% of a system's bugs for a twentieth of the cost of debugging. The Therac's programmer couldn't be bothered, which was a shame for those three dead patients.
But 1985 was a long time ago. These things just don't happen anymore. Or, do they?
Dateline Panama, 2001. Another radiotherapy device, built by a different company, zapped 28 patients. At least 8 died right after the overexposures; another 15 either already have or are expected to die as a result.
To protect the patient physicians put lead blocks around the tumor. The operator draws the block configuration on the machine's screen using a mouse. Developers apparently expected the users to draw each one individually, though the manual didn't make that a requirement. 50 years of software engineering has taught us that users will always do unexpected things. Since the blocks encircled the tumor a number of doctors drew the entire configuration in one smooth arcing line.
The code printed out a reasonable treatment plan yet in fact delivered its maximum radiation dose.
Software continues to kill.
The FDA found the usual four horsemen of the software apocalypse at fault: inadequate testing, poor requirements, no code inspections, and no use of a defined software process.
I bet you think pacemakers are immune from firmware defects. Better think again.
In 1997 Guidant announced that one of their new pacemakers occasionally drives the patient's heartbeat to 190 beats per minute. Now, I don't know much about cardiovascular diseases, but suspect 190 BPM to be a really bad thing for a person with a sick heart.
The company reassured the pacemaking public that there wasn't really a problem; the code had been fixed and disks were being sent across the country to doctors. However, the pacemaker is implanted subcutaneously. There's no ‘net connection, no USB port or PCMCIA slot.
Turns out that it's possible to hold an inductive loop over the implanted pacemaker. A small coil in the device receives energy to charge the battery. It's possible to modulate the signal and upload new code into Flash. The robopatients were reprogrammed and no one was hurt.
The company was understandably reluctant to discuss the problem so it's impossible to get much insight into the nature of what went wrong. But clearly inadequate was testing.
Guidant is far from alone. A study in the August 15, 2001 Journal of the American Medical Association (“Recalls and Safety Alerts Involving Pacemakers and Implantable Cardioverter-Defibrillator Generators”) showed that more than 500,000 implanted pacemakers and cardioverters were recalled between 1990 and 2000. (This month's puzzler: how do you recall one of these things?)
41% of those recalls were due to firmware problems. The recall rate increased between in the second half of that decade over the first. Firmware is getting worse. All five US vendors have an increasing recall rate.
The study said: “engineered (hardware) incidents [are] predictable and therefore preventable, while system (firmware) incidents are inevitable due to complex processes combining in unforeseeable ways.”
It's true that the software embedded into these marvels has grown steadily more complex over the years. But that's not an excuse for a greater recall rate. We must build better firmware when the code base grows. As the software content of the world increases a constant bug rate will lead to the collapse of civilization. We do know how to build better code. We chose not to. And that blows my mind.
Remember Los Alamos? Before they were so busily engaged in losing disks bulging with classified material this facility was charged with the final assembly of the US's nuclear weapons. Most or all of that work has stopped, reportedly, but the lab still runs experiments with plutonium.
In 1998 researchers were bringing two subcritical chunks of plutonium together in a “criticality” experiment, which measured the rate of change of neutron flux between the two halves. It would be a Real Bad Thing if the two bits actually got quite close, so they were mounted on small controllable cars, rather like a model railway. An operator uses a joystick to cautiously nudge them towards each other.
The experiment proceeded normally for a time, the cars moving at a snail's pace. Suddenly both picked up speed, careening towards each other at full speed. No doubt with thoughts of a mushroom cloud in his head, the operator hit the “shut down” button mounted on the joystick.
Nothing happened. The cars kept accelerating.
Finally actuating an emergency SCRAM control, his racing heart (happily sans defective embedded pacemaker) slowed when the cars stopped and moved apart.
The joystick had failed. A processor reading this device recognized the problem and sent an error message, a question mark, to the main controller. Unhappily, ? is ASCII 63, the largest number that fits in a 6 bit field. The main CPU interpreted the message as a big number meaning go real fast.
Two issues come to mind: the first is to test everything, even exception handlers. The second is that error handling is intrinsically difficult and must be designed carefully.
The handful of disaster stories have many common elements. On Mars Polar Lander and Deep Space 2, the Mars Expedition Rover and Titan IVb, Sea Launch, pacemakers, Therac 25 and in Los Alamos inadequate testing was a proximate cause. We know testing is hard. Yet it's usually deferred till near the end of the project, so gets shortchanged in favor of shipping now.
Tired programmers make mistakes. Well, duh. Mars Polar Lander, Deep Space 2 and the Mars Expedition Rover were lost or compromised from this well-known and preventable problem.
Crummy exception handlers were one of proximate causes of problems with the Mars Expedition Rover, Los Alamos and plenty of other disasters.
Had a defined software process, including decent inspections, been used no one would have been killed by the Therac 25. I estimate about 2% of embedded developers inspect all new code. Yet properly-performed inspections are a silver bullet that accelerates the schedule and yields far better code.
I have a large collection of embedded disasters. Some of the stories are tragic; others enlightening, and some merely funny. What's striking is that most of the failures stem from just a handful of causes. Remember the Tacoma Narrows bridge failure I described last month? Leon Moisseiff was unable to learn from his profession's series of bridge failures from wind-induced torsional flutter, or even from his own previous encounters with the same phenomena, so the bridge collapsed just four months after it opened.
I fear too many of us in the embedded field are 21st century Leon Moisseiffs, ignoring the screams of agony from our customers, the crashed code and buggy products that are, well, everywhere. We do have a lore of disaster. It's up to us to draw the appropriate lessons.
Jack Ganssle is the chief engineer of The Ganssle Group. Jack writes for Embedded Systems Design and has written six books on embedded systems and one on his sailing fiascoes. He started developing embedded systems in the early 1970s using the 8008. Since then, he's started and sold three electronics companies, including one of the bigger embedded tool businesses. He's developed or managed more than 100 embedded products, from deep-sea navigational gear to the White House security system. Ganssle now gives seminars to companies worldwide about better ways to develop embedded systems.