To read original PDF of the print article, click here.
Embedded Systems Programming
Lock Up Your Software
Murphy's Law should have stated that anything that can go wrong, will go wrong, while interrupts are disabled. How many of us can confidently predict the behavior of our software? To implement feature-rich software according to commercially feasible deadlines, programmers have to accept that conditions will be encountered in the field that were not exercised during development. You may be able to tell me how you expect the software to behave in a given case, but most programs are too complex to predict their behavior for certain-unless they have actually been tested. But testing is not a guarantee that applications will work the same way next time. The timing conditions or an uninitialized value may turn out okay on one run, but they might not on the next.
The Technische ›berwachungs-Verein (TUV) is the regulatory body that certifies medical and other electronic devices for sale in the European Union. It functions similarly to the FDA in the U.S. One of the criteria TUV reviewers have used is to ask the following question: what is the worst thing the software might do? The assumption is that the combination of processor and software is so complex that we cannot predict what instructions might get executed, should either hardware or software go wrong.
One approach to dealing with this lack of certainty is to imagine that a malicious programmer, intent on causing the maximum amount of damage, has reprogrammed your device. If the device is a Game Boy, that does not amount to much. However, many embedded programmers control devices on which lives directly depend, or devices that could put lives in danger. In other cases, no lives are at stake, but a malfunction may cost the user money through, say, factory downtime or loss of material. In those cases it is justifiable to spend a fraction of the money at risk as insurance against failure.
Let's examine the risk again. Most programmers would consider it fanciful that software, in a failure mode, would just happen to follow the worst possible course of action. It is true that some sophisticated actions are unlikely to happen by chance. An automated guidance vehicle, for example, is not likely to chase a victim around the factory. If you look more closely at some real examples though, many cases exist in which the worst case can easily happen. In antilock brakes, for instance, the system simply has to do nothing (a common failure mode in many programs) to achieve worst-case. A driver is blissfully unaware that the protection has been removed, and when it comes time to press the brakes, nothing happens! Because of this hazard, if the microcontroller controlling your brakes fails, the mechanical coupling between the pedal and the breaking mechanism still functions. It is still possible to stop, just not as smoothly as in the computer-assisted case.
In the past, I have worked on medical ventilators. These devices pump air into a patient's lungs according to parameters defined by the physician, and monitor the patient's response. An alarm goes off if the patient is not making the effort expected, or if rates or pressures are higher or lower than expected. The worst case here is that the air pressure could be driven very high, without the physician being made aware of it. In some ventilators, the air is driven by a piston, which is powered by a motor. Ideal performance often means moving the motor as fast as possible at the start of the breath, and reducing speed later on, when the pressure has to be controlled more carefully-a fairly typical feedback control loop. If the software halts immediately after instructing the motor to move at its top speed, then the motor will not be instructed to slow down or to halt. That would be okay if our alarm mechanism can alert the physician that all is not well. However, if our software has halted, the alarm code may not get a chance to run again. So a single point of failure has led to our control mechanism going out of control and our monitoring system being disabled. Maybe you should stop reading for a moment and consider if a similar failure could occur on your current project.
A more malicious failure would not just halt the monitoring system, but also display misleading data on the user interface. Explicitly telling the user that everything was fine, when just the opposite is true lures the user into a false sense of security. In one incident, an anesthesiologist became suspicious when a blood pressure monitor displayed data that did not fit with the patient's other symptoms. When the monitor's display was examined more carefully, it turned out that the words “Demo mode” appeared occasionally in small print.
The monitor had been placed in this mode during a routine service, and then put into use. The indication on the display that it was operating in demonstration mode was small and intermittent-not enough to attract attention in a busy ward. “Ah, but,” objects the software engineer, “it was not a bug. They were just not using it properly.” That objection does not hold up. Software engineers are not just responsible for writing software that conforms to requirements, but they must ensure that those requirements are safe. In this case, the requirements should have stated that the user interface should make it very obvious at all times when the device is in demonstration mode.
Now that we have seen that our mind game with maliciously inclined software can correspond to real failures, what can we do about it? Double checks in software will only give us limited protection, and it is always possible that our malicious program will disable that check at the most important time. So the solution has to be outside of software. A mechanical or electronic mechanism that restricts the actions that software can take is called an interlock. This is a lock that prevents certain actions, or makes one action dependent on a certain state, or forces a certain combination of actions to happen in a particular order.
One example is elevator doors. When the doors are opened the elevator will often be electrically inhibited from moving. Similarly, train doors prevent the train from moving if they are open.
Lock-in and lock-out
Going back to our medical ventilator example, one of the components was a safety valve that would open if the air pressure reached a certain threshold. Software could instruct the safety valve to open at a lower pressure if it detected pressure rising too quickly. Software could also close the valve, so long as the valve had been opened by software. Once the safety valve had opened because it had reached its mechanical cracking pressure, it remained open until the power was cycled. Effectively, the valve was making the decision that the software could not be trusted, on the grounds that pressure had reached a value that would not have been reached with software executing its control loop properly. Once a power cycle occurred, you could be sure that the software was reset. It was also assumed that an operator was present-otherwise who would turn the device off and on? The safety valve was a limiting component, but it also provided a lock-out. Once software allowed the controlled activity to go outside of its operating envelope, it was locked out, and was not allowed to control the process again until some kind of human intervention took place.
Some devices cleverly restrict you to a certain mode once you are there. Many MCUs come equipped with a watchdog timer. Typically, the device can operate in one of two modes. In watchdog-enabled mode, the software must set some register on a regular basis as a confirmation that the software is still running normally. In watchdog-disabled mode the software does not have to provide this strobe. For a safety-critical system, you always want to operate in watchdog-enabled mode. However, you have to contend with our imaginary malicious programmer, and assume that one of the first things he would do, once he has taken control, is turn that watchdog timer off. This is why many MCUs that employ a watchdog timer use a lock-in. Once you enable the watchdog, you cannot disable it without a processor reset. So long as your software functions correctly up to the point where the watchdog is enabled, you are guaranteed that it will be enabled from that time forward, regardless of any bugs you may have.
As well as having a lock-in for certain modes, you may want a lock-out for others. Let's return to the example of the patient monitor that was left in demonstration mode. To the designers' credit, they realized that it would not be a good thing if the operator accidentally entered demonstration mode while the device was in use. So they put password protection on the demonstration mode. Typically, service technicians would know the password, while the end-users would not. This lock kept users out of demonstration mode. In the example described previously, the problem arose because the service technician had put the device into demonstration mode before returning it to use.
Once the problem was discovered by the anesthesiologist, the easy solution was to switch back to normal operating mode. Unfortunately, making that change also required a password. The medical staff did not know the password and had to replace the monitor with a spare one. The system was designed with the rule that any change into or out of demonstration mode would require a password. Such symmetry appeals to engineers, but it was not the right design decision in this case. The demonstration mode required a lock-out. But once in demonstration mode a lock-in wasn't needed. In general, you want to make it difficult to change from a safe state to an unsafe state, but easy to move from an unsafe state to a safe one.
Next month we will examine a few more uses of interlocks, and how they can keep software engineers and users out of trouble. In the meantime, if you find this topic interesting, it is well worth reading Donald Norman's The Design of Everyday Things. A small part of the book is devoted to interlocks. The rest discusses other design-for-usability issues. A small number of books are out there that make almost no mention of software, yet should be compulsory reading for software engineers. This is one of them. Norman's examples are almost all from the world of mechanical design, but the same principles often apply to usable software, especially in the embedded world. While on the topic of books, if readers have any suggestions for other non-software books that benefit software engineers, let me know about them. I will add any that look interesting to my reading list, and discuss the whole lot in a later column.
The open source debate
Bill Gatliff and I wrote a pair of articles last September where I took a fairly jaundiced view of the open source community, and the products emerging from it. It received quite a reaction, and I apologize that I did not have time to reply to everyone, but I have made all of the feedback available online at www.panelsoft.com/murphyslaw. Because many of the e-mails were lengthy, and many good points were made on both sides, I thought that my correspondents deserved a slightly wider audience. I will try to make a habit of posting interesting feedback on the Web site, so that anyone who either agrees with me or who wants to put me straight will be able to see if anyone else was of a similar view.
Of the response I got to the open source article, it all fell down fairly hard on one side or the other. One writer pointed out the virtues of the Latex typesetting system. I would have to agree that it is a fine tool, and I used it quite a bit myself before the days of what-you-see-is-what-you-get editors. However, Latex does not really enter the argument in the embedded community, since it would never be something that would get used on an embedded system, though, admittedly, it might assist in the documentation of the project.
The other interesting piece of feedback was a reader's reaction to how difficult it has become to download Red Hat's CygWin distribution from their Web site. They have a vested interest in selling the distribution on CD rather than having it downloaded at no cost. Obviously it costs them money to maintain the file servers, and so this is a reasonable business direction, even if it does frustrate customers a bit. A more honest approach would be to charge a small fee for a faster download, but Red Hat realize that such an approach would probably leave them open to criticism that they were leaving their free software philosophy. The real concern here is that business is dictating a limitation on their Web site. Try to figure out what can happen if you extend this philosophy to the support side of Red Hat's business. If they have the opportunity to release a piece of code that is very readable, maintainable, and so easy to port that users can do it for themselves, then they are going to hurt an important revenue stream. What would your decision be?
Niall Murphy has been writing software for user interfaces and medical systems for 10 years. He is the author of Front Panel: Designing Software for Embedded User Interfaces. Murphy's training and consulting business is based in Galway, Ireland. He welcomes feedback and can be reached at .
“Potentially life-threatening medical equipment failure,” from comp.risks archive, July 1999. Available online at http://catless.ncl.ac.uk/Risks/20.48.
Murphy, Niall, “Watchdog Timers,” Embedded Systems Programming, November 2000, p. 112.
Norman, Donald A. The Design of Everyday Things. New York: Doubleday and Company, 1990.
Murphy, Niall, “Are Open Source and Innovation Compatible?” Embedded Systems Programming, September 2000, p. 78.