Lock Up Your Software, Part 2

When it comes to safety-critical applications, software is not always your biggest worry. Sometimes it's the user.

Last month, we looked at interlocks, which help prevent software from taking unsafe actions. We also invented a malicious programmer who would deliberately program our device to cause as much damage as possible from software. This month, we will look at more real-life examples of interlocks and their interaction with software, but we will also look at how the interlocks affect user behavior. In some cases, a single interlock provides convenient protection against an errant program as well as a misguided user.

In my book, I describe an interlock used on a paper cutting machine. [1] The goal of this interlock was not to prevent the software from damaging the device, but to prevent the user from damaging himself. The paper cutter has a table with a bed of air holes. The operator slides blocks of paper around on the air bed and when he positions them correctly, a blade descends, cutting the paper to the required size. The hazard here is that the operator may have a hand in the way of the blade when it comes down. We want to force the operator to put his hands elsewhere while the blade is moving. Donald Norman calls interlocks that do this forcing functions. [2] The solution in the case of the paper cutter is to force the user to place each hand on a button on the device. Because the buttons are too far apart to be pressed by one hand, it is a reasonable way to ensure that the operator's hands are away from the cutting area of the blade.

My inspiration to write about this mechanism came from seeing such a device in action during a tour of a printing plant. A reader of my book, Peter Turpin, worked on software for a similar device and he mailed me with an interesting anecdote. Peter told me that many of the operators of these machines were paid bonuses for productivity and could work faster if both hands were free to control one block of paper while another block was being cut. Placing both hands on the buttons while the blade was moving slowed the operator down, which in turn cost him money. As a result, many operators taped a ruler across the two buttons and leaned against it to press them, keeping both hands free. The operators increased their productivity, but decreased their safety, since accidents became more likely.

This is a classic trade off between safety and cost. While the operator chose cost over safety, even his own, it was in the designer's interest to put safety first. In order to make the device safer, the designers moved the buttons to the side of the machine, a bit like a pinball machine. The operators responded by setting up a levering mechanism on each side that was tied to the operator's belt. By moving his hips, the operator caused the buttons to be depressed and again the blade could drop with both hands free.

The designers eventually won this battle of wits. The device had a login sequence in which the operator identified himself. The operator was instructed to place his hands on the metallic buttons and the device measured the capacitance of his body. Later, when the blade was due to descend, instead of simply checking if the buttons were being pressed, it also checked that the capacitance across the buttons matched the operator. This method has not been fooled (yet), but Peter figures that it is only because it is difficult for the operator to find out how the device works. If they knew that it was based on capacitance, a counter measure would probably not be that difficult to implement. Capacitance is a crude measure, and is not accurate enough to be sure of a particular operator's presence, but it is accurate enough to distinguish between an operator and a wooden ruler.

The purpose of the interlock just described was to protect the user from himself. Many other interlocks serve the same purpose. Industrial robots are often surrounded by a fence. If a gate in the fence is opened, the robot stops moving. In a factory in Japan, a worker hopped over such a fence and was killed when the robot moved unexpectedly. [3] You have to assume that your users will have different priorities, and interlocks that make certain actions impossible will always be a better design than interlocks that depend on a cooperative user.

Dead man controls

One interesting type of interlock is called a dead man control. This is a control that detects if the operator has died, or is no longer at the controls for some reason. The paper cutter I mentioned previously is one example. If the operator walks away from the machine, the blade will stop moving. Dead man controls are used on trains to ensure that the driver is at the controls at all times. In one London Underground incident, a driver had weighed down the dead man lever to avoid the inconvenience of having to hold it while the train was moving. [4] The driver left the cabin while the train was at a platform, to see if one of the doors was stuck open. Another interlock ensured that the train could not leave the station while a door was open, to prevent the hazard of a passenger falling out, or of a passenger being dragged if he were stuck in a door. While the driver was on the platform, the train pulled away-presumably, whatever had blocked one of the doors had been removed. The train halted automatically at a red light at the next station, and the driver was able to catch up on another train. No injuries resulted, but the incident shows how a number of interlocks can interact in ways the designer-and certainly the driver-does not anticipate.

Speedboats commonly have a dead man control in which a pin in the engine is attached by a cable to the driver's wrist. If the pin is pulled out, the engine will stop. This control will not actually detect if the driver has died, but it will detect if he has left the driving position. On a boat, the main objective is to ensure that the boat halts if the driver falls overboard. Unfortunately, the driver can easily leave the wrist strap off of his wrist. I witnessed the kind of danger that is created when a driver bypasses such a control. Two boats collided near a beach where I was walking. The force of the collision left one boat submerged, and the occupants of both boats were sent flying into the water. The other boat, with no occupants or driver, propelled itself madly in circles, at speed, within feet of four people bobbing in the water. A third boat had to ram the out-of-control craft to prevent what could easily have become a fatal accident. Again, the moral of the story is that users will often bypass safety mechanisms that designers thought would be used in all cases.

On medical ventilators that I helped design, we usually allowed the operator to enter settings and then leave the machine unattended. If a problem arose with the machine or the patient, an audible alarm would attract attention. In one case we implemented a dead man control because it was too dangerous to allow the physician to leave the device and patient. One way to test the strength of a patient's lungs is to not allow them any air, and measure how much negative pressure they can generate while they attempt to breathe. Obviously, any situation that cuts off the patient's air supply should be managed carefully. In our design, the physician initiates the maneuver by holding down a button. This closes the valves that allow air to flow to the patient, while the device continuously displays the measured pressure. When the physician is satisfied that he has seen the peak pressure, he releases the button and the ventilator reverts to normal operation, and the patient, thankfully, resumes breathing.If the physician is distracted by something during this maneuver and walks away from the patient, the ventilator reverts to normal operation because the physician had let go of the button. If no dead man control was used, and the maneuver was started with one button press and ended with a second button press, the patient could be left without air if the operator was distracted, or if someone accidentally pressed the start button. The ventilator also applied a timeout to this maneuver, so that it could not go on indefinitely, but using the timeout as the only safety mechanism would not have been sufficient in this case.

Can we trust software more than the user?

Another interesting class of interlock protects against both software and user error. The Therac-25 is a radiation therapy device that overdosed a number of patients over several years, before the cause of the failures was discovered. [5,6] It is an important case, partly because of the influence it had on the FDA's approach to device safety, and because of the lessons that it teaches. Many of the discussions of the Therac-25 focus on management failure, and problems with code written without proper care regarding reentrant functions. I discuss some of the user interface issues of the Therac-25 in my book. In this column, I am going to discuss one of the mistakes made in the area of interlocks.

The Therac series of radiation devices were capable of producing beams of different strengths, depending on whether X-rays or electron therapy was required. They used a filter, which was rotated into place on a turntable, to ensure that the patient did not receive a raw high energy beam directly. In the Therac-20 (an earlier model), an interlock ensured that the high-energy beam could not be activated unless the table had been rotated to the correct position. By detecting the turntable position with microswitches, a simple electrical circuit inhibited the high energy beam if the filter was not in position.

On the Therac-25, the table was rotated by a motor controlled by software, and software also read the microswitches. It was assumed that the software would never choose to turn on the high energy beam with the filter out of position, so, as a cost-saving measure, the interlock was removed from the design. Without the interlock, our hypothetical malicious programmer was free to cause serious damage. A race condition in the code meant that under some rare conditions, the software would indeed apply the wrong settings. The interlock that had prevented the user from applying a high energy beam with the filter out of position would have prevented software from making the same mistake. The company claimed that the interlock was implemented in software, but that is a case of the fox guarding the henhouse. Software checking itself is useful, but it is not enough in safety critical situations.

Because high energy and low energy beams are invisible, the safety-design of the device cannot depend on the operator to spot a flaw in the system. In many of the Therac-25 overdoses, patients did not start to suffer symptoms of the damage for several days.

Allowing software to view the interlock

In this column and the previous one, I have tried to emphasize that some mechanism outside of software should limit software, and that software should not have control over the interlock, because software could then bypass it at the worst possible time. However, it is often software that has to explain to the operator, by means of the user interface, why some action has not taken place. This can only be done well if software can detect when the interlock has inhibited an action. For this reason, software should be given read-only access to the interlocks. In some cases, this allows the operator to rectify the situation. In other cases, such a message would indicate a device fault that might lead to the device being returned to the manufacturer. The ability to read the state of the interlock would allow the Therac-25 to inform the user that the beam was not fired because the turntable was in the wrong position. On the underground train, an indicator should be placed in the cabin to indicate if the train is not moving due to an open door. An interlock that baffles the operator or service technician as to why the device will not function creates undue frustration.

Hopefully, some of these examples will give you a few ideas for ways to make your own product safer, or at least make you stop and think about the worst thing that can happen, and when it might happen-remember, Murphy was an optimist!

Niall Murphy has been writing software for user interfaces and medical systems for ten years. He is the author of Front Panel: Designing Software for Embedded User Interfaces. Murphy's training and consulting business is based in Galway, Ireland. He welcomes feedback and can be reached at . Reader feedback to this column can be found at .

References

1. Murphy, Niall. Front Panel: Designing Software for Embedded User Interfaces. Lawrence, KS: R&D Books, 1998.
Back

2. Norman, Donald. The Design of Everyday Things. New York: Doubleday, 1990.
Back

3. Neumann, Peter. Computer Related Risks. Reading, MA: Addison-Wesley, 1995.
Back

4. “Tube train leaves…without its driver,” from comp.risks archive, April 1990. http://catless.ncl.ac.uk/Risks/9.81.html#subj2.1
Back

5. Leveson, Nancy. Safeware: System Safety and Computers. Reading, MA: Addison-Wesley, 1995.
Back

6. Leveson, Nancy and Clark S. Turner. “An Investigation of the Therac-25 Accidents,” IEEE Computer, July 1993, p. 18. This article can also be found at http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html
Back

Return to Table of Contents

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.