Software has become ubiquitous, creating risks at a terrifying rate. Jack wonders if we're smart enough to manage all that code.
How good does firmware have to be? How good can it be? Is our search for perfection, or near-perfection, an exercise in futility?
Complex systems are a recent phenomenon. Many of us remember the early transistor radios, which sported no more than a half-dozen active devices. Vacuum tube televisions, common into the '70s, used 15 to 20 tubes, more or less equivalent to about the same number of transistors. In the '40s, the ENIAC computer required 18,000 tubes, so many that technicians wheeled shopping carts of spares through the room, constantly replacing those that burned out. Though that sounds like a lot of active elements, even the 25-year-old Z80 chip used a quarter of that many transistors, in a die smaller than just one of the hundreds of thousands of resistors in the ENIAC.
Now, the Pentium 4 has 45 million transistors. A big memory chip might require a third of a billion. Intel predicts that, later this decade, their processors will have a billion transistors. I'd guess that even the simplest embedded system, such as an electronic greeting card, requires thousands of active elements.
Software has grown even faster. In 1975, 10,000 lines of assembly code was considered huge. Given the development tools of the day-paper tape, cassettes for mass storage, and crude teletypes for consoles-working on projects of this size was difficult. Today 10,000 lines of C is a small program. A cell phone might contain a million lines of C or C++. This is astonishing, considering the device's small form factor and power requirements.
Another measure of software size is memory usage. The 256-byte (that's not a typo) EPROMs of 1975 meant even a measly 4KB program used 16 devices. Clearly, even small embedded systems were quite pricey. Today? 128KB of flash is nothing, even for a tiny application. The switch from 8- to 16-bit processors, and then from 16- to 32-bitters, is driven more by address space requirements than raw horsepower.
So our systems are growing rapidly in both size and complexity. They're also growing, I contend, in failure modes. Are we smart enough to build these huge applications correctly?
It's hard to make even a simple application perfect; big ones will most likely never be faultless. As the software grows, its components inevitably become more interdependent. A change in one area impacts other sections, often profoundly. Sometimes this is due to poor design; often, it's a necessary effect of system growth.
The hardware, too, is certainly a long way from perfect. Even mature processors usually come with an errata sheet, one that can rival the datasheet in size. The infamous Pentium divide bug was just one of many bugs. Even today, the Pentium 3's errata sheet (renamed “specification update”) details 83 issues. Motorola documents nearly a hundred problems in the MPC555.
What is the current state of the reliability of embedded systems? No one knows. It's an area devoid of research. Yet a lot of raw data is available, some of which suggests we're not doing well.
The Mars Pathfinder mission succeeded beyond anyone's dreams, despite a significant error. A priority inversion problem-noticed on Earth but attributed to a glitch and ignored-caused numerous hangs. A remote debug capability saved the mission. This is an instructive failure because it shows the importance of adding external hardware and/or software to deal with unanticipated software errors.
The August 15, 2001 issue of the Journal of the American Medical Association contained a study called “Recalls and Safety Alerts Involving Pacemakers and Implantable Cardioverter-Defibrillator Generators” by William H. Meisel and others. (Since these devices are implanted subcutaneously I can't imagine how a recall works.) Surely designers of these devices are on the cutting edge of building the very best software. Yet between 1990 and 2000, firmware errors accounted for about 40% of the 523,000 devices recalled.
In the 10 years studied, we learned a lot about building better code. Tools have improved and the amount of real software engineering that takes place is much greater. Or so I thought. It turns out that the annual number of recalls increased between 1995 and 2000.
In defense of the pacemaker developers, they no doubt confront very complex problems. Interestingly, heart rhythms can be mathematically chaotic. A slight change in stimulus can cause the heartbeat to burst into quite unexpected randomness. And surely there's a wide distribution of heart behavior in different patients.
Perhaps a new QA strategy is needed for these sorts of life-critical devices. What if the software engineer were someone with heart disease who had to use the latest widget before release to the general public?
A pilot friend tells me the 747 operator's manual is a massive tome that describes everything one needs to know about the aircraft and its systems. He says that fully half of the book documents avionics (read: software) errors and workarounds.
The space shuttle's software is a glass half-empty/half-full story. It's probably the best code ever written, with an average error rate of about one per 400,000 lines. The cost? $1,000 per line. So, it is possible to write great code, but despite paying vast sums, perfection is still elusive. Like the 747, the stuff works “well enough,” which is perhaps all we can ever expect. Is this as good as it gets?
The human factor
We don't build systems that live in isolation. They're part of a complex web of systems, not the least of which is the human operator or user. When tools were simple, there weren't so many failure modes. That's not true anymore. Do you remember the U.S.S. Vincennes? She's a U.S. Navy battle cruiser, equipped with the sophisticated Aegis radar system. In July of 1988 the cruiser shot down an Iranian airliner over the Persian Gulf, killing all 290 people on board. Apparently the system knew that the target wasn't an incoming enemy warplane, but that fact was displayed on terminals that weren't easy to see. So here's a failure where the system worked as designed, but the human element created a terrible failure. Was the software perfect since it met the requirements?
Unfortunately, airliners have become common targets for warplanes. This past October, a Ukrainian missile apparently shot down a Sibir Tu-154 commercial jet, killing all 78 passengers and crew. While I write, the cause is unknown, or unpublished, but local officials claim the missile had been targeted on a nearby drone. It missed, flying 150 miles before hitting the jet. Software error? Human error?
The war in Afghanistan shows the perils of mixing men and machines. At least one smart bomb missed its target and landed on civilians. U.S. military sources say incorrect target data was entered. Maybe that means someone keyed in the wrong GPS coordinates. It's easy to blame an individual for mistyping, but doesn't it make more sense to look at the system as a whole, including bomb and operator? Bombs connote serious safety-critical issues. Perhaps a better design would accept targeting parameters in a string that includes a checksum, rather like credit card numbers. A mis-keyed entry would be immediately detected by the machine.
It's well known that airplanes are so automated that on occasion both pilots have slipped off into sleep as the craft flies itself. Actually, that doesn't really bother me much, since the autopilot beeps when at the destination, presumably waking the crew. But, before leaving, the fliers enter the destination in latitude/longitude format into the computers. What if they make a mistake (as has happened)? Current practice requires pilot and co-pilot to check each other's entries, which will certainly reduce the chance of failure. Why not use checksummed data instead and let the machine validate the data?
Another U.S. vessel, the Yorktown, is part of the Navy's “Smart Ship” initiative. Automating significant portions of the ship's engineering (propulsion) reduces crew needs by 10% and saves some $2.8 million per year on this one ship. Yet the computers create new vulnerabilities. Reports suggest that an operator once entered an incorrect parameter that resulted in a divide-by-zero error. The entire network of Windows NT machines crashed. The Navy claims the ship was dead in the water for about three hours; other sources (www.gcn.com/archives/gcn/1998/july13/cov2.htm) claim it was towed into port for two days of system maintenance. Users are now trained to check their parameters more carefully. I can't help wonder what happens in the heat of battle, when these young sailors may be terrified, with smoke and fire perhaps raging. How careful will the checks be?
Some readers may also shudder at the thought of Windows NT controlling a safety-critical system. I admire the Navy's resolve to use a commercial product, but wonder if Windows, which is the target of many hackers' wrath, might not itself create other vulnerabilities. Will the next war be won by the nation with the best hackers?
People behave in unpredictable ways, leading to failures in even the best system designs. As our devices grow more complex, their human engineering becomes ever more important. Yet all too often this is neglected in our pursuit of technical solutions.
I'm a passionate believer in the value of firmware standards, code inspections, and a number of other activities characteristic of disciplined development. It's my experience that an ad hoc or a non-existent process generally leads to crummy products. Smaller systems can succeed from the dedication of a couple of overworked experts, but as things scale up in size heroics become less and less effective.
Yet it seems an awful lot of us don't know about basic software engineering rules. When talking to groups I usually ask how many participants have (and use) rules about the maximum size of a function. A basic rule of software engineering is to limit routines to a page or less. Only rarely does anyone raise their hand. Most admit to huge blocks of code, sometimes thousands of lines. Often, this is a result of changes and revisions, of the code evolving over the course of time. Yet it's a practice that inevitably leads to problems.
Methodologies haven't solved the problem. Most are too big and too complex. I have hope for UML, which seems to offer a way to build products that integrates hardware and software, and that is an intrinsic part of development from design to implementation. But UML will fail if management won't pay for extensive training, or resist the urge to toss the approach when panic reigns.
The FDA, FAA, and other agencies are becoming aware of the perils of poor software, and have guidelines that can improve development. Britain's Motor Industry Software Reliability Association (MISRA) has guidelines for the safer use of C. They feel that we need to avoid certain constructs and use others in controlled ways to eliminate potential error sources. I agree.
I doubt, though, that any methodology or set of practices can, in the real world of schedule pressures and capricious management, lead to perfect products. The numbers tell the story. The very best use of code inspections, for example, will detect about 70% of the mistakes before testing begins. (However, inspections will find those errors very cheaply.) That suggests that testing must pick up the other 30%. Yet studies show that often testing checks only about 50% of the software!
Sure, we can (and must) design better tests. We can, and should, use code coverage tools to ensure every execution path runs. These all lead to much better products, but not to perfection. Because all of the code is known to have run doesn't mean that complex interactions between inputs won't lead to bizarre outputs. As the number of decision paths increases, the difficulty of creating comprehensive tests skyrockets.
Perhaps the nature of engineering is that perfection itself is not really a goal. Products are as good as they have to be. Competition is a form of evolution that often leads to better quality. In the '70s Japanese automakers, who had practically no U.S. market share, started shipping cars that were reliable and cheap. They stunned Detroit, which was used to making a shoddy product that dealers improved and customers tolerated. Now the playing field has leveled, but at an unprecedented level of reliability.
Perfection may elude us, but we must find better ways to build our products. Wise developers spend their entire careers engaged in the search.
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at .