Sources report that the Palo Verde Nuclear power plant in Arizona has been running for 19 years with a latent defect in its emergency coolant system. Sketchy technical details leave much to the imagination. I imagine the operators run periodic tests of the safety gear. But it’s likely those tests don’t simulate every possible failure mode.
Kudos to the Nuclear Regulatory Commission for finding the bug, and let’s breathe a sigh of relief that the worst didn’t happen.
There’s an intriguing parallel to software development. Exception handlers – the code’s safety system – are just as difficult to implement correctly and fully test.
Exception handlers are those bits of firmware we hope never execute. They’re stuffed into the system to deal with the unexpected, just as a reactor’s emergency cooling system is only there for the unlikely catastrophic fault. But the unexpected does happen from time to time. Hardware fails, unanticipated inputs crash the code, or an out-and-out design flaw – a bug – causes the code to veer out of control, with luck throwing an exception.
Or a cosmic ray flips a bit in the processor. As transistor geometries shrink our systems will be ever more vulnerable to so-called single event upsets: soft, unrepeatable hardware disruptions that will likely crash the code.
That’s when the safety systems engage. An exception handler detects the failure and takes appropriate action. If the code controls a nuke plant it may very well dump emergency coolant into the core. An avionics unit probably switches over to backup equipment. Consumer devices might simply initiate a safe reset.
But exception handlers are notoriously difficult to perfect. It’s hard to invoke those snippets of code since the developer must perfectly simulate something that should never happen. It’s even harder to design a handler that deals with every possible failure mode.
My collection of firmware disasters is rife with system failures directly traceable to faulty exception handlers.
The NEAR spacecraft was, ah, “near”(ly) lost when an accelerometer transient invoked an error script… but the wrong, untested version had been shipped, causing thrusters to dump most of the fuel.
Two years earlier Clementine was lost when an unhandled floating point exception crashed the code and spewed all of the fuel into space.
The first Mars Exploration Rover, one of the wonderful twin robots today roaming the planet, crashed when the filesystem filled… and the exception handlers repeatedly tried to recover by allocating more file space.
Ariane 5 rained half a billion dollars of aluminum over the Atlantic when the inertial navigation units shut down due to an overflow. Yet they continued to send data to the main steering computer, asserting diagnostic bits meaning “this data is no good.” The computer ignored the diagnostics, accepted the bad data, and swiveled the nozzles all the way to one side.
In the case of firmware I believe a lot of the problem stems from overstressed engineers putting too little time into analyzing potential failures. But that’s certainly not the case at the Arizona reactor. With an entire city at risk from a core meltdown surely the engineers used every tool possible, like Failure Mode and Effects Analysis, to insure the plant responds correctly to every possible insult.
Perhaps we human engineers of all stripes, software, mechanical, and nuclear, are just not good at anticipating failure modes.
That’s a scary thought.
What’s your take on safety systems and exception handlers?
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at . His website is .
Your analogy between Palo Verde and computers may not be as close as itfirst appeared. The problem at Palo Verde, upon further analysis, turnedout to be a problem that didn't exist. More detailed analysis of thepostulated problem revealed that it was never a problem after all.
Failure modes and effects analyses have been used in design reviews ofnuclear power plants for decades. But nobody ever knows about all possiblefailure modes. That is why nuclear plant workers keep looking for failuresand keep asking questions.
Even if the postulated problem at Palo Verde would have been real, it stillwould have taken another failure to bring the postulated failure into play,and at least one other failure to have made the postulated problem affectthe safety of the plant.
Here is the main point of the original report that Palo Verde phoned intothe NRC (these calls are all recorded):
“Engineering personnel were unable to demonstrate that the original design of the Emergency Core Cooling System (ECCS) could perform its safety function for its mission time under certain postulated accident scenarios. Specifically, the Refueling Water Tank (RWT) is designed with baffles to prevent a vortex from developing and air binding the Safety Injection pumps during a Loss of Coolant Accident (LOCA). On a LOCA, the High Pressure Safety Injection pumps take a suction from the RWT and inject borated water into the Reactor Coolant System (RCS). At 7.4 percent RWT level, the source of borated water by design automatically shifts from the RWT to the containment sump. However, for small break LOCA there may be insufficient containment pressure to ensure inventory is not continuing to be drawn from the RWT. This may allow the baffles in the bottom of the RWT to uncover. With the RWT baffles uncovered, a vortex may develop, leading to potential air binding of the Safety Injection pumps before the operator manually isolates the RWT.
Based on the infomation at the end of this link:
all three Palo Verde Units were not yet producing power early this morning.Below is the link to the report that said they were preparing to start up.Since they went all the way down to cold shutdown it will take longer torestart the reactors and heat up the main turbines. http://www.elpasotimes.com/apps/pbcs.dll/article?AID=/20051019/BUSINESS/510190321/1003
– Don Kosloff
As long as the approach is like a codesmith; code is to be hammered into functionality, such exception handling bugs will occur. We really need to think like parachute and ejection seat manufacturers. Or even the electrical fuse manufacturer.
– kalpak dabir
The company for which I work makes power supplies for in-flight seat-back video systems for a reputable airline. Let's suffice it to say that I nor my family will never fly that airline so long as our products are aboard. Our engineering and production groups are overstressed, our products are under-tested (both in design and production stages), and this particular product was put out by our division with the worst record (and the top brass considers them “the best”).
It's not just the flight power modules, it's endemic to the entire company.
The only reason we learned about the product on this airline is because we've already forced two landings for in-flight smoking failures after they made the evening news.
I certainly hope other companies are not in the same situation we are.
– Andy Kunz
In my corner of the world, I've found design engineers to often be unfamiliarwith the context of use of what they develop, particularly when their productsare components of a larger, loosely coupled, multivendor system. As you haveoften remarked, designers should go to the sites where their products are usedto get some sense for their real-world application that may simply be beyond thecapability of a System Requirements Specification to adequately portray.
Users and customer engineers and technicians can often provide insight intocritical failure modes that is grounded in school-of-hard-knocks experience, butwe are rarely consulted in the early stages of product conception. Most often,the first we learn of a new product is from a sales or marketing representative(or when it shows up at our door). By then it is usually too late for us toprovide useful input of any sort that could inform the development of a product.
It is not a given that customer engineers and technicians will trap mostsignificant and/or occult problems before devices go into use, but we have doneso more than once. So I'm not comfortable with the assertion that engineersaren't good at anticipating failure modes. But if we're doing our work in avaccuum, then we're not as good as we could be.
– Rick Schrenker
Exceptions are signal pathways. In an algorithmic system, pathways are very diffcicult to spot due to the textual and sequential (spaghetti) nature of the code. This is especially true in legacy systems. This is one of the many reasons that we should switch to a non-algorithmic, synchronous, signal-based software model. Only then will all exception pathways be clearly visible. A signal-based environment lends itself well to a visual programming and automated verification. Total reliability. This is part of the goal of Project COSA.
– Louis Savain