Disaster redux! - Embedded.com

Disaster redux!

Click here for reader response to this article

No one wants a pacemaker, but if you had one, wouldn't you want it to be reliable? Jack shares more horror stories and lessons.

The spacecraft descended towards the planet, accelerating in the high-g field as it drew nearer. Sophisticated electronics measured the vehicle's position and environment with exquisite precision, waiting for just the right moment to deploy the parachute.

Nothing happened. The craft crashed on the surface.

Last month I described how NASA's Mars Polar Lander was lost due to a software error. But this failure played out on yet another planet, the one known as Earth, when on September 8 the Genesis mission impacted at 200MPH. As I write this the Mishap Investigation Board hasn't released a final report. But they did say the gravity-sensing switches were installed upside down so they couldn't detect the Earth's gravitational field.

The origins of Murphy's Law are in some dispute. The best research I can find (www.improb.com/airchives/paperair/volume9/v9i5/murphy/murphy1.html) suggests that Captain Ed Murphy complained, “If there's any way they can do it wrong, they will” when he discovered that acceleration sensors on a rocket sled were installed backwards. Nearly 60 years later the same sort of mistake doomed Genesis.

Perhaps a corollary to Murphy's Law is George Santayana's oft-quoted, “those who cannot remember the past are condemned to repeat it.”

NASA's mantra “test like you fly, fly like you test” serves as an inoculation of sorts against the Murphy virus. We don't as yet know why Genesis' sensors were installed upside down, but a reasonable test regime would have identified the flaw long before launch.

Last month I focused on high-profile failures from the space business. Few other industries are exempt from their share of firmware disasters, some of which are quite instructive. This month we'll look at spectacular disasters and near misses in a few Earth-bound industries.

Tumor zappers
The Therac 25 was a radiotherapy instrument designed to treat tumors by administering carefully regulated doses of radiation. Operators found that occasionally when they pressed the “give the patient a dose” button the machine made a loud clunking sound and then illuminated the “no dose given” light. Being normal human-type operators, they did what any normal human-type person would do: press the button again. After a few iterations the patients were screaming in agony.

Between 1985 and 1988 six cases of massive overdosing resulted in three deaths.

The machines were all shut down during an investigation, which found that the software got confused when the backspace button was pressed within eight seconds of the “give the patient a dose” control being actuated. When this sequence occured, the device would give full-bore maximum X-rays, cooking the patient.

Software killed.

The code used a homebrew real-time operating system (RTOS) riddled with timing errors. Yet today, nearly two decades later, far too many of us continue to write our own operating systems despite the fact that at least a hundred are available, for prices ranging from free to infinity, from royalty-free licenses to ones probably written by pirates. Even those cool but brain-dead PIC processors that have a maximum address space of only a few hundred words have a $99 RTOS available.

Developers give me lots of technical reasons why it's impossible to use a commercial operating system. Too big, too slow, wrong application programming interface—the reasons are legion. And mostly pretty dumb. Electrical engineers have long used glue logic to make incompatible parts compatible. They'd never consider building a custom UART just because of some technical obstacle. Can you imagine going to your boss and saying “this microprocessor is ideal for our application, except there are two instructions we could do so much better. So I plan to build our own out of 10 million transistors.” The boss would have you committed. Yet we software people regularly do the same by building our own code from 10 million bits when perfectly sensible alternatives exist. Crafting a custom operating system is nearly always insane, and in the case of the Therac 25, criminal.

It's tough to pass information between tasks safely in a multithreaded system, which is why a decent RTOS has sophisticated messaging schemes. The homebrew version used in the Therac 25 didn't have such features so global variables were used instead, another contributor to the disaster.

Globals are responsible for all of the evil in the universe, from male pattern baldness to ozone depletion. Of course, sometimes they're unavoidable, but those instances are rare. Too many of us use them out of laziness. The OOP crowd chants “encapsulation, inheritance, polymorphism;” the faster they can utter that mantra the closer they are to OOP nirvana, it seems. Of the three, encapsulation is the most important. Both Java and C++ support encapsulation, as do assembly and C. Hide the data, bind it to just the routines that need it. Use drivers both for hardware and to access data items.

The Therac's code was, as usual, a convoluted mess. Written mostly by a solo developer, it was utterly unaudited. No code inspections had been performed.

We've known since 1976 that inspections are the best way to rid programs of bugs. Testing and debugging simply don't work; most test regimens only exercise about half the code. It's quite difficult to create truly comprehensive tests, and some features, such as exception handlers, are nearly impossible to invoke and exercise.

Decent inspections will find about 70% of a system's bugs for a twentieth of the cost of debugging. The Therac's programmer couldn't be bothered, which was a shame for those three dead patients.

But 1985 was a long time ago. These things just don't happen anymore. Or, do they?

Dateline: Panama, 2001. Another radiotherapy device, built by a different company, zapped 28 patients. At least eight died right after the overexposures; another 15 are expected to die as a result or already have.

To protect their patients the physicians put lead blocks around the tumor. The operator would draw the block configuration on the machine's screen using a mouse. Developers apparently expected the users to draw each one individually, though the manual didn't make that a requirement. Fifty years of software engineering has taught us that users will always do unexpected things. Since the blocks encircled the tumor a number of doctors drew the entire configuration in one smooth arcing line.

The code printed out a reasonable treatment plan yet in fact delivered its maximum radiation dose.

Software continues to kill.

The FDA found the usual four horsemen of the software apocalypse at fault: inadequate testing, poor requirements, no code inspections, and no use of a defined software process.

Pacemaking
I bet you think heart pacemakers are immune from firmware defects. Better think again.

In 1997 Guidant announced that one of its new pacemakers occasionally drives the patient's heartbeat to 190 beats per minute. Now, I don't know much about cardiovascular diseases, but suspect 190BPM to be a really bad thing for a person with a sick heart.

The company reassured the pacemaker-buying public that there wasn't really a problem; they had fixed the code and were sending disks across the country to doctors. The pacemaker, however, is implanted subcutaneously. There's no Internet connection, no USB port, no PCMCIA slot.

Turns out that it's possible to hold an inductive loop over the implanted pacemaker to communicate with it. A small coil in the device normally receives energy to charge the battery. It's possible to modulate the signal and upload new code into flash. The robopatients were reprogrammed and no one was hurt.

The company was understandably reluctant to discuss the problem so it's impossible to get much insight into the nature of what went wrong. But clearly inadequate was testing.

Guidant is far from alone. A study in the August 15, 2001 Journal of the American Medical Association (“Recalls and Safety Alerts Involving Pacemakers and Implantable Cardioverter-Defibrillator Generators”) showed that more than 500,000 implanted pacemakers and cardioverters were recalled between 1990 and 2000. (This month's puzzler: how do you recall one of these things?)

Forty-one percent of those recalls were due to firmware problems. The recall rate increased in the second half of that decade compared with the first. Firmware is getting worse. All five U.S. pacemaker vendors have an increasing recall rate.

The study said, “engineered (hardware) incidents [are] predictable and therefore preventable, while system (firmware) incidents are inevitable due to complex processes combining in unforeseeable ways.”

Baloney.

It's true that the software embedded into these marvels has steadily grown more complex over the years. But that's not an excuse for a greater recall rate. We must build better firmware when the code base grows. As the software content of the world increases a constant bug rate will lead to the collapse of civilization. We do know how to build better code. We choose not to. And that blows my mind.

Plutonium perils
Remember Los Alamos National Laboratory? Before they were so busily engaged in losing disks bulging with classified material this facility was charged with the final assembly of U.S. nuclear weapons. Most or all of that work has stopped, reportedly, but the lab still runs experiments with plutonium.

In 1998 researchers were bringing two subcritical chunks of plutonium together in a “criticality” experiment that measured the rate of change of neutron flux between the two halves. It would be a Real Bad Thing if the two bits actually got quite close, so they were mounted on small controllable cars, rather like a model railway. An operator uses a joystick to cautiously nudge them toward each other.

The experiment proceeded normally for a time, the cars moving at a snail's pace. Suddenly both picked up speed, careening towards each other at full speed. No doubt with thoughts of a mushroom cloud in his head, the operator hit the “shut down” button mounted on the joystick.

Nothing happened. The cars kept accelerating.

Finally after he actuated an emergency SCRAM control, the operator's racing heart (happily sans defective embedded pacemaker) slowed when the cars stopped and moved apart.

The joystick had failed. A processor reading this device recognized the problem and sent an error message, a question mark, to the main controller. Unhappily, “?” is ASCII 63, the largest number that fits in a 6-bit field. The main CPU interpreted the message as a big number meaning go real fast.

Two issues come to mind: the first is to test everything, even exception handlers. The second is that error handling is intrinsically difficult and must be designed carefully.

Patterns
The handful of disaster stories I've shared over the last two columns have many common elements. On Mars Polar Lander and Deep Space 2, the Mars Expedition Rover and Titan IVb, Sea Launch, pacemakers, Therac 25, and in Los Alamos, inadequate testing was a proximate cause. We know testing is hard. Yet it's usually deferred till near the end of the project, so gets shortchanged in favor of shipping now.

Tired programmers make mistakes. Well, duh. Mars Polar Lander, Deep Space 2, and the Mars Expedition Rover were lost or compromised because of this well-known and preventable problem.

Crummy exception handlers were one of the proximate causes of problems with the Mars Expedition Rover, Los Alamos, and plenty of other disasters.

No one would have been killed by the Therac 25 had a defined software process, including decent inspections, been used. I estimate about 2% of embedded systems developers inspect all new code. Yet properly performed inspections are a silver bullet that accelerates the schedule and yields far better code.

I have a large collection of embedded systems disasters. Some of the stories are tragic; others enlightening, and some merely funny. What's striking is that most of the failures stem from just a handful of causes. Remember the Tacoma Narrows bridge failure I described last month? Because bridge designer Leon Moisseiff was unable to learn from his profession's series of bridge failures caused by wind-induced torsional flutter and his own previous encounters with the same phenomenon, the bridge collapsed just four months after it opened.

I fear too many of us in the embedded field are 21st-century Leon Moisseiffs, ignoring the screams of agony from our customers, the crashed code, and buggy products that are, well, everywhere. We do have a lore of disaster. It's up to us to draw the appropriate lessons.

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at .

Reader Response


Touch screen voting machines anyone?

The code is considered proprietary and cannot be reviewed by anyone outside the manufacturer.

Perhaps another example of software that kills. But only indirectly.

– Steve Jacobson


Great article!it is time that embedded SW becomes more process oriented, with clear design, design reviews, and test plans.

– Giuseppe Scaglione


“…inadequate testing was a proximate cause.” Inadequate testing does not CAUSE disasters, though adequate testing MIGHT preventthem.

The goal is to not introduce faults into the system. And if they are introduced (since no one is perfect) to correct them as soon aspossible. The way to do this is through training (including perhaps certification), process (including reviews at all levels fromrequirements through implementation), and adequate tools.

I think our tools are failing us. Which is easier to understand, a page of C code or a page of C++ code? It seems the more powerfulour tools become the more difficult they become to use and the easier it is to create (and hide) mistakes in them.

I think C. A. R. Hoare summed it up well in his 1980 Touring Award lecture: “There are two ways of constructing a software design.One way is to make it so simple that there are obviously no deficiencies.And the other way is to make it so complicated that there are no obvious deficiencies.”

– John Kaufmann


I'm one of those who rolls my own RTOS, so I guess I'm a disaster waiting to happen. But I sincerely doubt there's an off-the-shelfRTOS that will do what I need in the space and environment I've got, and that it'll be bug free. It seems what Jack is saying that acommercial RTOS is tested “more heavily” in the wild, and that statement may contain a kernel (pardon) of truth. I've often used suchargument to support using a commercial “C” compiler rather than a free, open-source compiler, primarily because a commercial compilerused in a variety of environments still has a sole repository for complaints and a sole producer of well-documented fixes and updates. Risk is therefore lowered, and at least, liability can be offloaded. On the other hand, the variety of embedded environments eachindividual RTOS would need to be tailored to operate within could not possibly provide a similar level of comfort. I take exceptionto Jack's comment that those who do not use commercial RTOS solutions are introducing potentially deadly risk factors, and moreover,there surely are embedded platforms so unique they cannot be adequately controlled by a commercial RTOS. You see, I've never thoughtRTOS programmers were necessarily better than those who don't.

– Matt Staben


Nobody bats 1000 in baseball. Humans just are'nt that good as individuals. Any modern technical endevor is a team effort.It takes this type of team effort to overcome our individual limitations. Management has the responsibility to research the bestpractices in the industry and see that the staff is trained in them, and implements them in practice. Management also has theresponsibility to train the staff in the regulatory aspects of the industry, and respect the regulations of the industry and notskirt the law. Management and staff share a responsibility of coming up with realistic plans and schedules and ajdusting them ifsafety is involved, or if new risks are un-covered. Management also should take an interest in seeing that the tools of the trade areup to the job — management cuts the check for these, but often does so with out the understanding of why saving a hundred bucks on acompiler, RTOS, ETC, can cost thousands or even millions or billion!

Management also needs to be more security aware in today's environment — from terrorists to unscroupulous competitors the stakes arehigher than ever. Big name companies have seen exploits for “embedded” products posted on the internet, source code has been posted,and systems have been hacked. Security is a joint responsibility, but the policies and procedures need to be in place.As players in this game, it is in your interest to educate yourself in as many of these areas as possible for your industry inaddition to the purely technical — Get to know the regulatory aspects, get to know about your tools, get more knowlege aboutsecurity, etc.

– Anonymous


Jack, The state I come from has very few people — If you come across someone in trouble on the highway you are expected tostop and make your best effort to aid them — Over all this works pretty well, more lives are saved than lost due to this, and it israre that people are fined for stopping and failing to render aid. If east coasters ran the state many more people would die on theroads each year — It would be against the law to stop and help unless you were an MD and paramedic with a specialization in emergencytrauma — a pretty rare thing in the desolate west. The basic idea of the Therac was not too bad, and one would have to look at thenumbers for its overall success ratio — I would venture, even with the bugs, it was a darned sight better than setting a kitchentimer and using that on average. One has to look at the averages, and how things are improving over time. Back in 1985 few collegesor universities taught formal software engineering methods for safety critical systems.

Now more universities are aware of this. Things hopefully will get even better in the future. Looking at my own situation, On oneproduct line alone, there were over Two million lives saved in a seven year period, with less than 150 lives lost due to all causes(weather, pilot error, mechanical, and all other) To save those lives we all took risks — Ranging from engineers / pilots going ontest flights with new equipment in an experimental category aircraft, to Doctors/Nurses/etc in the crews /O.R. working 24 hour shifts,and then driving home.

In a perfect world we would not have to take those risks, and would still be able to save those lives. Until we get to that perfectworld, we will have to accept that there are limits to what we can achieve, and all we can do is play the odds in mankind's favor tothe best of our abilities.

Keep trying to improve the odds for us in the trenches doing the day to day work of embedded systems.

Embedded.Com has improved many many peoples knowledge and abilities, and should be required reading in colleges and universitiesteaching embedded software development.

There's a lot of new stuff going on with embedded systems – and new pit-falls. Good articles might include: (Verifying FPGA basedprocessor designs: Embedded C++ pitfalls, System Architecture and performance estimation while using C++ – avoiding issues throughsystem planning. Debugging embedded C++, embedded DSP, Built In Self Test, etc. )

– William Murray


I work in a small engineering design/consulting firm. There are 3 partners, all engineers. One is a mechanical design guru. One (me) does electrical design and software/firmware development, and the third does the customer and project management and everything else except design work so that the other two can concentrate on the designs.

I firmly embrace your code review mantra. The problem I have is that there's no one to review the code with (or the hardware design, for that matter). So I've come to embrace a concept I call “defensive design”. It's really just a realization that from the first blank piece of paper to the shipped product, if something goes wrong with the product's “brain” then it's my fault. This makes me take extra precaution in every aspect; more transient protection on I/O, more processor power than I'll need, multiple sanity checks in the code. And I always estimate that at least 1/3 of the software development is error handling. If the potential errors aren't identified up front and included in the design, you're in for a long and ugly debug effort.

I'm not trying to self-inflate my ego. My point is that if more designers acted like they were directly responsible for their efforts, and were rewarded/chastised based on their results, then many of the code disasters you pointed out wouldn't have happened. So my hope is that all embedded programmers and hardware designers will take heed of your “quality is free” message, and their managers will realize that when they implement some type of reward system for doing it right, maybe they'll see the culture change that we need.

– Ray Zeun


A perspective of embedded control:

Early times:

Mechanical control — subject to wear, friction, cam mis alignment etc, — limited complexity

Next Phase:

Electro-Mechanical — subject to wear, friction, cam mis alignment and contact pitting
— Even Telephone switching systems were built with some electro-mechanical tech.
— did not end until 1990's for general control
— Vehicles still use this

— Electronic non micro processor
— This was the mainframe, mini computer, transistorized controller era.Software was incredibly expensive to develop, Many things were developed that were computer controlled ranging from chemical plants to air traffic control.

Disadvantages — air-conditioning often required and subject to failure. Size, Power, etc.
— Still some legacy systems in use

Beginnings of Computer languages

Microprocessor Era —

Explosion in number and type of things automated.

Formal Software methods for safety start to evolve

Dis-advantages — growing complexity (Example Web enabled Toaster??? — Why?)
— Security — Web/net connects many many things
— The complex things are often getting less reliable, than their simpler predecessors(A 32 bit bus has 2x as many connections to break as a 16 bit bus — something to consider in a vehicle, plus 32 bits may come in BGA which will thermally mis-match VS the PCB and fail due to temperature cycling sooner)

– William Murray


Many embedded systems I've worked on were simply too small and cost-sensitive to allow any sort of commercial RTOS. Indeed, they had no OS in the accepted sense at all – tasks simply ran sequentially in a round-robin loop, with no dynamic memory allocation, any interrupts being temporarily masked while data was transferred. Reliability here comes from simplicity.

I agree with the comment above that there may often be no other programmer around to review code. But reviews are such a powerful weapon that I am now convinced that it would be worth paying someone external to perform this function.

On the subject of testing – it is so often omitted at the last minute due to pressure to ship that I think we should push for it to be accepted practice to write (and review!) tests BEFORE writing a line of product code.

None of this is new to the extreme programming crowd, of course.

– Phil McKerracher


About the “commercial RTOS is Better” concept. I beg t differ. One very well known RTOS begins operation by turning OFF the hardware watchog timer and disable supervisor mode protection. And since this was written in C the basic idea is to simply link your code directly to the O/S binaries. To call an O/S function you do just that, call it. No parameter validation is built-in so you had better get it right or really strange things happen.

One home-built RTOS I did has per task wathdog operation managed by the O/S, a supervisor call mechanism with parameter checking. Not to toot my horn too loudly, but aren't these things basic to a good RTOS design?

While I agree that global data can be a danger and is not a good idea unless well managed I also think that the largest number of errors can be blamed on poor pointer management. And I don't see a wide-spread movement to eliminate the ability to manipulate pointers.

Look very closely before you buy is my suggestion since company size guarantees nothing about the product details.

– Mike Salish

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.