Bugs are more manageable than you think. Tracking their patterns will help you get them under control.
It seemed a simple project, really. The 32KB ROM constraint set a reasonable limit on firmware size and complexity. Bob, the solo developer, had built several of these controllers for his company; one more wouldn't tax his skills.
His off-the-cuff estimate of 5,000 lines of C started him down the thorny path to complacency. Turns out the guess wasn't that far off, but the minimal size lured him into hasty design sans reviews. Within a week of starting the project, he was cranking code. main() was a simple loop, the ISRs trivial. Functions sprouted like weeds, none particularly tricky, but each interacting in increasingly complex ways.
Well ahead of schedule, he started debugging...and the project became mired in problems. Weeks turned into months, months into quarters, and every day, legions of elusive bugs appeared. It seemed that fixing one bug created or unmasked three others. The due date came and went, pestering QA people never left Bob alone, and the boss's demands grew ever more shrill. The project was eventually cancelled.
Sound familiar? How about this one?
The team lead was a graduate of many a development effort, and had honed his skills by constantly searching each completed project for lessons learned. A thick veneer of process mediated every phase of design, coding, and implementation. Every intermediate due date was met on time and without fuss. Design completed, coding mostly done, developers fired up their debuggers and quickly tested most of the application-level code.
But this was an embedded system, and the lead's experience largely came from creating business packages. Rumbles of discontent from the application guys grew louder as their debugging efforts were stymied by the lack of low-level drivers.
The poor sods interfacing to the complex ASIC chipset seemed unable to produce anything that worked. Sure, they had tons of code-all of which behaved in unpredictable and buggy ways. Exasperated and unable to understand the peculiar environment of firmware developers, the team lead demanded explanations. "The docs are wrong or non-existent; we don't know how these peripherals really work; the EEs keep inverting and transposing register bit assignments, and, well, at some point we had to make some assumptions about how things were supposed to work. These seem to be consistently wrong."
The original driver designs so cleanly transposed to C++ code were then changed, fixed, modified, and hacked to meet the seemingly inscrutable requirements of the ASIC until the code was an ultra-tuned and fragile assemblage that no one understood. Every tiny, apparently benign edit led to the collapse of other, unrelated, functions. Schedules slipped, the drivers never improved to any significant degree, and eventually the over-budget project was cancelled.
Both stories are true, and show that despite the best and worst of engineering practices an avalanche of bugs can destroy a project.
Bugs are inevitable. Lousy engineering produces them but great engineering is no sure cure. The very best of processes will surely reduce their incidence, but cannot eliminate them-at least in the commercial world, where time to market and costs are important concerns.
This suggests that wise developers manage bugs as aggressively as the legal department manages lawsuits. No one wants either suits or bugs, but since both are facts of life, it's critical we engage them up front, plan for them, and deal with them efficiently.
Last month I talked about making and using measurements in firmware development. It usually makes sense to look at how metrics can help us when we find consistent and expensive problems.
As I mentioned, managing by measurement is a Bad Thing in a creative environment like development. But managing without measurement is equally dysfunctional. The trick is to find a mix of reasonable metrics that yield important data easily.
To recapitulate my main message, never take data unless you're prepared to go through all four steps of the measurement process:
- Collect the data.
- Make sense of it-raw data is usually pretty raw.
- Present the data to your boss, your colleagues, or perhaps even to just yourself.
- Most importantly, act on the data. The only reason to collect software metrics is to enable change.
Stories and stats
Over the years, I've worked with an awful lot of engineering groups, and have observed that most devote somewhere around half of the schedule to debugging. Few actually plan for this much time, of course. Most usually have only a vague awareness that developers spend an awful lot of time in front of their debuggers.
It's astonishing how many development teams never even plan for bugs. Debugging is almost an afterthought, something that, despite repeated experience, folks figure won't be much of an issue. They're wrong.
Fifteen years in the in-circuit emulator (ICE) business showed me a lot about how teams go about bug management. I observed, for instance, that 70% of our ICEs were shipped FedEx overnight. Why is nearly everyone operating in panic mode?
In most cases, developers ran into real-time problems that exceeded the capability of their tools. When planning the project their egos subliminally suggested that their work would be nearly perfect. Reality hits, the ROM monitor just isn't up to tracking interrupts, tasks, DMA activity, interprocessor communications, or any of the other complexities that infest real systems. A frantic search for more powerful tools ensues.
Take it from one who has seen this behavior hundreds of times. (I'm long out of the tool business, by the way, and have no interests in it anymore, and no ax to grind.) Good tools are cheap. Their price is usually irrelevant. Spend $10,000 or more on the right debugger and you'll save much more than that over the life of the project. Plan to buy the tools early, rather than reacting to the problems that always surface. You'll save money.
A lot of developers maximized their costs by panicking and renting tools. Original contracts rarely exceeded three months; single-month rentals were quite common. Yet, practically without exception, customers extended the contracts a month at a time, at the exorbitant single-month rate. By the time a year went by, some chose to convert the rental to a purchase; many others continued the monthly rental extensions for two or three years.
Tool vendors don't care if customers purchase, rent, or lease their products. All have learned that "short-term rental" almost always translates into "long-term revenue stream." We warned our customers that they were usually better off just buying outright, but they resisted for two reasons.
The first is the "just one more bug" syndrome. The product almost works. One intractable defect holds up shipment; when fixed, we're done. Why not spend $500 to rent a tool for one month?
After watching too many developers fall prey to this condition, I've learned it's a sign of trouble. "Just one more bug and we're done" really means that the team is reacting to defects, desperately swatting them down as they appear. No one knows the real status of the product, but hopeful pronouncements to the boss keeps cancellation or layoffs at bay, for a while.
Monthly rentals were also more attractive than purchasing because bosses just don't get the debugging picture. They sense that bugs are a symptom of lousy engineers. "You want me to spend $10,000 on a tool that finds your mistakes?" How can one answer that question truthfully without appearing incompetent? No logical argument will overcome the boss's disgust.
Hardware developers need several PCB spins. Doctors spend $100,000 a year on malpractice insurance. Airplanes are outfitted with black boxes in case of crashes. Most professions account for and expect mistakes. For some reason, this is not true in software engineering.
If my observation that many of us spend half the project in debug is even close to correct, then optimizing debugging is the best way to accelerate schedules. As Michael Barr noted in his May editorial ("Code of Conduct," p. 5), we should all commit to code inspections to find bugs cheaply. Finding bugs through inspections is 20 times cheaper than through conventional debugging. Don't believe me? Discount that by an order of magnitude and inspections are still twice as effective as debugging.
Beyond inspections, it pays to have a quantitative idea of where the bugs are. Did you know that the average software project-without inspections-runs about a 5% error rate? Even with a clean compile, figure on one bug per 20 lines of code. Expect 500 bugs in a little 10,000-line project. No wonder half the schedule is eaten by debugging.
There are extremes: the Space Shuttle project runs one bug per 400,000 lines of code, at a cost of $1,000 per line. That's taxpayer money, of course-there's plenty more where that came from. I've also seen too many developers with a 20% error rate. When someone says they can crank 200 lines of C per day, figure on massive back-end schedule slippages.
Count bugs. It's a sure way to improve the quality of the code. The process of taking such measurements leads to fewer bugs, and thus faster deliveries.
Barry Boehm showed that 80% of the bugs are in 20% of the modules. That confirms a feeling that many of us have. On most projects, there are a few functions that are consistently infuriating. Change anything-even a comment-and the silly thing breaks. No one, not even the original designer, really knows how it works anymore. We try to beat the function into submission with constant tweaking, but it's an exercise in futility.
Boehm showed that development costs for these nuisance functions average four times that of any other function. So it's pretty obvious that if we can identify the problem modules quantitatively and early, we can take corrective action cheaply. In other words, count bugs per function and be ready to toss out any chunk of code that stands out from the rest. Boehm's 4x factor means it's cheaper to toss it early than to beat it into submission.
Suppose Joe Coder is assigned function X. He hunches over the computer in his miniscule cubicle, trying to ignore the PA system and phone conversations going on around him, and creates the code. It gets inspected and only then does Joe start testing.
What Joe does during debugging is between him, the computer, and God. No one else needs to know the nature of his development sins. Nor is it practical to expect Joe to log errors. Few of us can live up to such a high standard of honesty.
But at some point he releases the function to the rest of the team, with his "tested" imprimatur stamped upon it. Perhaps I should say he inflicts it on the team, as every problem that surfaces now impacts the whole project.
This is the time to count and track bugs. As soon as more than a few appear, it's time to look closely at the code and make hard decisions. Is the function turning into a nightmare? A constant source of trouble? If the bug rate is significantly higher than any other function, toss the code, start over, and get it right. Everyone is entitled to screw up occasionally, but we do know that the second time we'll get it right.
Don't just count the bugs. Log the nature of the problem. Go over the list from time to time looking for patterns, for they are sure to appear. Do you regularly seem to mix up "=" and "=="? Once the pattern is clear you're less likely to make the mistake in the first place.
Learn from mistakes
I collect embedded disaster stories as they are all instructive. One that struck me was the 1994 crash of the Chinook helicopter. As is true in most aviation accidents, many factors contributed to the crash. However, the engine control software was at least partially at fault.
After the accident, a review board attempted to study the code, giving up after reading less than 20% due to the vast number of issues found. Some 56 category 1 errors (outright coding errors or spec non-compliance) and over 400 other issues surfaced.
To me this is a prime example of dysfunctional bug measurement. Though probably necessary to assign blame and ultimately correct the problems, ex post facto software analysis always results from not taking the data up front, during development.
There's always time to do it right the second time.
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. His e-mail address is email@example.com.
1. Davis, Alan. 201 Principles of Software Development. New York: McGraw-Hill, 1995.
3. Boehm, Barry. Software Engineering Economics. Upper Saddle River, NJ: Prentice Hall, 1981.
4. "T55 Software Verification Report-Final Report." EDS Corp.