When I started writing firmware and for years afterward, few people outside of the electronics design community gave a thought to the countless embedded systems around them. At the time, I found it difficult to explain to most friends and relatives what exactly it was that I did for a living. Yet embedded software was all around them at home and at work–in their phones, anti-lock brakes, laser printers, and many other important products. But to these folks, “software” was something you bought in a box at a store and installed on the one “computer” you owned.
Today, of course, there are countless embedded systems per person, and our health and wellbeing are both greatly enriched by and increasingly dependent upon proper functioning of the firmware inside. Consumers now notice them and think of them as software containers–if only because they require frequent reboots and upgrades. And there is no let up in sight: several billion more such devices are produced each year.1
Lawsuits are on the rise, too. In recent years, I've been called into U.S. District Court (as an expert witness) in several dozen lawsuits involving embedded software. I've met others with similar experiences and become aware of many other cases. Popular claims range from copyright theft and patent and trade secret infringement to traditional products liability with a firmware twist. Unfortunately, the quality and reliability of our collective firmware leaves the door open to an ever-increasing number of the latter.
This code stinks!
At a recent Embedded Systems Conference, I gave a popular free talk titled “This Code Stinks! The Worst Embedded Code Ever” in which I used lousy code from real products as a teaching tool. The example code was gathered by a number of engineers from a broad swath of companies over several years.2
Listing 1 shows just one example of the bad code in that presentation. I don't know if the snippet contains any bugs, as most of the other examples were found to. And that's a problem. Where are we supposed to begin an analysis of the code in Listing 1? What is this code supposed to do when it works? What range of input values is appropriate to test? What are the correct output values for a given input? Is this code responsible for handling out-of-range inputs gracefully?
The original listing had no comments on or around this line to help. I eventually learned that this code computes the year, with accounting for extra days in leap years, given the number of days since a known reference date (such as January 1, 1970). But I note that we still don't know if it works in all cases, despite it being present in an FDA-regulated medical device. I note too that the Microsoft Zune Bug3 was buried in a much better formatted snippet of code that performed a very similar calculation.
Listing 2 contains another example, this time in C++, with the bug-finding left as an exercise for the reader. You can find the full set of slides from my talk online at http://bit.ly/badcode.
Lest you think that the evidence from the presentation are exceptions to the norm found because I and other engineers were on the prowl for bad code, consider just a couple of examples stemming from the more obvious embedded software failures.
First, recall the Patriot Missile failure in Dhahran, Saudi Arabia during the first Gulf War. Twenty-eight U.S. soldiers were killed when a Scud missile was not shot down due to improper tracking by the Patriot Missile battery protecting a military base. A report from the U.S. Government Accountability Office examined the events leading to the failure and concluded the problem was partly in the requirements: the government didn't tell the designer it would need to “operate continuously for long periods of time.” Huh!? “At the time of the incident, the battery had been operating continuously for over 100 hours”.5, 6
Now consider a more recent example. GPS-maker Garmin announced a “free, mandatory GPS software update to correct a software issue that has been discovered to cause select GPS devices to repeatedly attempt to update GPS firmware and then either shut down or no longer acquire GPS satellite signals.” This sounds to me like a bug in their bootstrap loader (a.k.a., bootloader). Many Garmin GPS units are named as affected, including members of the popular nüvi product family.7
Or consider what a consumer had to say about his Celestron SkyScout Personal Planetarium recently in a forum at Amazon.com: “I'm downloading the second firmware update release since I've had my SkyScout . . . about 3 weeks. Each release is making the device more stable.”
Finally, consider these quotes from the recent recall of a device regulated by the U.S. Food and Drug Administration–an AED (automatic external defibrillator):
• “Units serviced in 2007 and upgraded with software version 02.06.00 have a remote possibility of shut down during use in cold environmental conditions. There are no known injuries or deaths associated with this issue. The units will be updated with the current version of software.”
• “All of the recalled units will be upgraded with software that corrects [another] unexpected shutdown problem. In the meantime . . . it is vital to follow the step 1-2-3 operating procedure which directs attachment of the pads after the device has been turned on. This procedure is described on the back of your device and also in the Quick Reference material inside the AED 10 case. Some pages in the user's manual may erroneously describe or show illustrations of [a different] operating procedure . . . Please disregard these erroneous instructions.”
At least one death was reported at a time when the second type of unexpected software shutdown occurred. Are bugs in the embedded software to blame for that too? If not, how did the User's Manual come to be out of sync with the firmware in a process-driven FDA-regulated environment?
Given the above, is it not appropriate to wonder if the unexplained loss of Air France 447 over the Atlantic Ocean earlier this year was firmware-related? An abrupt 650-ft. dive an Airbus A330 flight experienced in October 2006 may offer clues to the loss of Air France 447. Authorities have blamed a pair of simultaneous computer failures for that event in the fly-by-wire A330. First, one of three redundant air data inertial reference units began giving bad data. Then, a voting algorithm intended to handle precisely such a failure in one unit by relying only on the other two failed to work as designed; the flight computer instead made decisions only on the basis of the one failed unit! “More than 100 of the 300 people on board were hurt, with broken bones, neck and spinal injuries, and severe lacerations splattering blood throughout the cabin.”8 A lawsuit is pending.
Take a deep breath
Firmware bugs seem to be everywhere these days. So much so that firmware source-code analysis is even entering the courtroom in criminal cases involving data collection devices with software inside. Consider the precedent-setting case of the Alcotest 7110. After a two-year legal fight, seven defendants in New Jersey drunk driving cases successfully won the right to have their experts review the source code for the Alcotest firmware.9
The state and the defendants both ultimately produced expert reports evaluating the quality of the firmware source code. Although each side's experts reached divergent opinions as to the overall code quality, several facts seem to have emerged as a result of the analysis:
• Of the available 12 bits of analog-to-digital converter precision, just 4 bits (most-significant) are used in the actual calculation. This sorts each raw blood-alcohol reading into one of 16 buckets. (I wonder how they biased the rounding on that.)
• Out of range A/D readings are forced to the high or low limit. This must happen with at least 32 consecutive readings before any flags are raised.
• There is no feedback mechanism for the software to ensure that actuated devices, such as an air pump and infrared sensor, are actually on or off when they are supposed to be.
• The software first averages the initial two readings. It then averages the third reading with that average. Then the fourth reading is averaged in, and so on. No comments or documentation explains the use of this formula, which causes the final reading to have a weight of 0.5 in the “average” and the one before that to have a weight of 0.25, and so forth.
• Out of range averages are forced to the high or low limit, too.
• Static analysis with lint produced over 19,000 warnings about the code (that's about three errors for every five lines of source code).
What would you infer about the reliability of a defendant's blood-alcohol reading if you were on that jury? If you're so inclined, you can read the full expert reports for yourself.10
A better way
Don't let your firmware source code end up in court! Adopt a coding standard that will prevent bugs and start following it; don't wait a day. Run lint and other static analysis and code complexity tools yourself, rather than waiting for an expert witness to do it for you. Make peer code reviews a regular part of every working day on your team. And establish a testing environment and regimen that allows for regression testing at the unit and system level. These best practices won't ensure perfect quality, but they will show you tried your best.
I'll have more to say about keeping bugs out of embedded software in my next few columns. Meanwhile, try not to think about all the firmware upon which your life depends.
Michael Barr is the author of three books and over 50 articles about embedded systems design, as well as a former editor in chief of this magazine. Michael is also a popular speaker at the Embedded Systems Conference and the founder of embedded systems consultancy Netrino. You may reach him at or read more by him at .
2. Minor details, including variable names and function names, were changed as needed to hide the specifics of applications, companies, or programmers.
4. Hint: This code was embedded in a piece of factory automation equipment.
5. GAO's report can be found at www.fas.org/spp/starwars/gao/im92026.htm.
6. In fact, soldiers in Israel had previously discovered that the Patriot Missile software's ability to track an incoming missile degraded in just eight hours, and they had a software upgrade to fix it.