Getting disciplined about embedded software development: Part 3 - The value of postmortems - Embedded.com

Getting disciplined about embedded software development: Part 3 – The value of postmortems

The TV camera pans across miles of woodland, showing ghastly images of wreckage. Some is identifiable: the remnants of an engine, a child's doll, scattered papers from a businessperson's briefcase; much is not. The reporter, on a mission to turn tragedy into a career, breathlessly pours facts and speculation into the microphone. Shocked viewers swear off air travel till time diminishes their sense of horror.

Yet the disaster, a calamity of ineffable proportions to those left waiting for loved ones who never come home, is in fact a success of sorts. The NTSB searches for and finds the black boxes that record the flight's final moments, and over the course of months or years reconstructs the accident.

We've all seen the stunning computer-generated final moments of a plane's crash on the Discovery Channel. Experts find the root cause of the incident and then change something. Maybe there's a mechanical flaw in the plane's structure, perhaps an electrical fire initiated the accident. The FAA issues instructions to the aircraft's builders and users to implement an engineering change.

Perhaps the pilots were confused by their instrumentation, or they handled the wind shear incorrectly. Maybe maintenance people serviced a control surface incorrectly. Or perhaps it was found that Americans are getting fat so old loading guidelines no longer apply (as was recently the case in one incident). Changes are made to training or procedures. This sort of accident never happens again.

A jet cruises in the sparse air at 40,000 feet where it's 60 below zero. Four hundred thousand pounds of aluminum traveling at 600 knots relies on a complex web of wiring, electronics, mechanics, and plumbing to keep the passengers safe. It's astonishing a modern plane works at all, yet air travel is the safest form of transportation ever invented. The reason is the feedback loop that turns accidents into learning experiences.

Contrast the airplane accident with the carnage on our roads—over 40,000 people are killed in the United States of America each year in car crashes; another 2 million are injured. The accident ends with the car crash (plus enduring litigation); we learn nothing from either, we take no important lessons away, we make no changes in the way we drive.

Traffic slows around the emergency crews cutting a twisted body from the smashed car, but then we're soon standing hard on the accelerator again, weaving in and out of traffic inches from the bumper ahead, in a manic search to save time that may shave, at best, a few seconds from the commute.

Carmakers do improve the safety of their vehicles by adding crumple zones and air bags, but the essential fact is that the danger sprouts from poor driving. The car and driver represent a system without feedback, running wildly out of control.

Feedback stabilizes systems. Every EE knows this. Amplifiers all use negative feedback to control their output. An oscillator has positive feedback, and so, well, oscillates.

Feedback stabilizes human systems as well. The IRS's pursuit of tax cheats keeps most 1040s relatively honest. A recent awful crash on my street led to a week or two of radar enforcement. Speeds dropped to the mandated 30 mph, but the police soon moved on to other neighborhoods.

Feedback does—or should—stabilize embedded development efforts. Most of the teams I see work madly on a project, delivering late and buggy. The boss is angry and customers are screaming. Yet as soon as the thing gets out the door we immediately start developing another project. There's neither feedback nor introspection.

Resumes abound with “experience;” often that engineer with two-dozen projects and 20 years behind him actually has had the same experience time after time. The same old heroics and the same bad decisions form the fabric of his career. Is it any wonder so few systems go out on time?

The role of engineering managers
In most organizations the engineering managers are held accountable for getting the products out in the scheduled time, at a budgeted cost, with a minimal number of bugs. These are noble, important goals.

How often, though, are the managers encouraged—no, required —to improve the process of designing products?

The Total Quality movement in many companies seems to have bypassed engineering altogether. Every other department is held to the cold light of scrutiny, and the processes tuned to minimize wasted effort. Engineering has a mystique of dealing with unpredictable technologies and workers immune to normal management controls. Why can't R & D be improved just like production and accounting?

Now, new technologies are a constant in this business. These technologies bring risks, risks that are tough to identify, let alone quantify. We'll always be victims of unpredictable problems.

Worse, software is very difficult to estimate. Few of us have the luxury to completely and clearly specify a project before starting. Even fewer don't suffer from creeping featurism as the project crawls toward completion.

Unfortunately, most engineering departments use these problems as excuses for continually missing goals and deadlines. The mantra “engineering is an art, not a science” weaves a spell that the process of development doesn't lend itself to improvement.

Phooey.

Engineering management is about removing obstacles to success. Mentoring the developers. Acquiring needed resources.

It's also about closing feedback loops. Finding and removing dysfunctional patterns of operation. Discovering new, better ways to get the work done. Doing things the same old way is a prescription for getting the same old results.

Doing software project postmortems
How do developers go about learning more about their craft? Buy a pile of books, perhaps read some of them, peruse the magazines, go to conferences, bring in outside gurus. These are all great and necessary steps. But it's astonishing that most refuse to learn from their own actions.

A company may spend hundreds of thousands to millions developing a project. Many things will go right and too many wrong during the work. Wise developers understand that their engineering group does indeed make products, but is also a laboratory where experiments are always in progress.

Each success is a Eureka moment, and each failure a chance to gain insight into how not to do development. Edison commented that, though he had had 1000 failures in his pursuit of some new invention, he had also learned 1000 things that do not work.

We can fool ourselves into thinking that each of these success/failure moments is a powerful learning tool. Sure, we take away some insight. But this is a casual way to learn, one that's personal and so of no benefit to other team members.

I prefer to acquire experience scientifically. Firmware development is too expensive to take any other approach. We must use the development environment as a laboratory to discover solutions to our problems. This means all projects should end with a postmortem, a process designed to suck the educational content of a particular development effort dry.

The postmortem is a formal process that starts during the project itself. Collect data artifacts as they are generated—for instance, the estimated schedule, the bug logs, and change requests. Include technical information as well, such as the estimated size (in lines of code and in object file bytes) versus actuals, real-time performance results, tool issues, etc.

After the product is released schedule the postmortem. Do it immediately upon project completion while memories are still fresh and before the team disbands (especially in matrix organizations). My rule of thumb is to do the postmortem no more than 3 days after project completion.

Management must support the process and must make it clear this work is important.

Dysfunctional organizations that view firmware as a necessary evil will try to subvert anything that's not directly linked to writing code. In this case run a stealth postmortem, staying under the screens of the top dogs. If even the team lead doesn't buy into this sort of process-improvement endeavor, I guess you're doomed and might as well start looking for a better job.

A facilitator runs the postmortem. In many activities I advocate rotating all team members through the moderator/leader role, even those soft-spoken individuals afraid to participate in verbal exchanges. It's a great way to teach folks better social and leadership skills.

But postmortems tend to fail without a strong leader running the show. Use the team lead, or perhaps a developer well-respected by the entire group, one who is able to run a meeting.

All of the developers participate in the postmortem. We're trying to maximize the benefits, so everyone is involved and everyone learns the resulting lessons. In some cases it might make sense to bring in folks involved in the project in other ways, such as the angry customer or QA people.

The facilitator first makes it clear there are but two ways to get into trouble. First, it is the end of the project, probably late, we're all tired and hate each other. Despite this everyone must put in a few more hours of hard work, as the postmortem is so important. Slack off and you'll get zinged.

Second, obstruct or trash the process and expect to be fired. “Yeah, this is just another stupid process thing that is a waste of time” is a clear indication you're not interested in improving. We don't want developers who insist on remaining in a stasis field.

He or she also insures the postmortem isn't used to beat up on a particular developer who might have been a real problem on the project. The fundamental rule of management must apply: praise publicly, discipline privately. Deal with problem people off-line.

Hold a sort of history-day meeting. Run by the facilitator, it's where we look at the problems encountered during the project. The data that was acquired during the effort is a good source of quantitative insight into the issues.

This is not a complaint session. The facilitator must be strong enough to quash criticisms and grumbling. It's also not an attempt to solve any problem. Rather, identify problems that appear solvable. Pick a few that promise the maximum return on investment.

Resist the temptation to solve all of the ills suffered during the project. I'm a child of the 1960s. At the time we thought we could save the world—we couldn't. But it was possible to implement small changes, to make some things better. Don't expect any one postmortem to lead you to firmware nirvana. Postmortems are baby steps we take to move to a higher plane. Try to do too much and the effort will collapse.

Pick a few problems—depending on the size of the group—maybe 2, 3, or 4. Break the team into groups, with each group tasked to crack a single issue.

The groups must focus on creating solutions that are implementable, and that are comprised of action items. If the project suffered from ending streams of changes from the 23-year-old marketing droid, a solution like “stop accepting changes,” or “institute a change control process” is useless. These are nice sentiments but will never bear fruition.

Create plans specifying particular actions: "Joe evaluates change control tools by April 1. Selects one. Trains entire team on the process by April 15. No uncontrolled changes accepted after that date."

If each group comes back and presents its solutions to the entire team, the postmortem process will absolutely fail. We engineers have huge egos. Each of us knows we can solve any problem better than almost anyone else. If team A comes in and tells me how to fix myself I'll immediately toss out a dozen alternate approaches. The meeting will descend into chaos and nothing will result.

Instead, before making any presentations, team A solicits input on its idea from each developer. This is low-key, done via water-cooler meetings. The team is looking for both ideas and buy-in. Use Congress as a model: nothing happens on the House floor. All negotiations take place in back rooms, so when the vote occurs on the floor it's all but a fait accompli.

A final meeting is held, at which time the solutions are presented and recorded. In writing. End with the post-project party. What! You don't do those? The party is an essential part of maintaining a healthy engineering group. All ones and zeros makes Joe a dull boy. The party eases tensions created by the intense work environment. But it happens only after the project is completely finished, including the postmortem.

The postmortem is done, the team disbands. Now the most important part of the postmortem begins. That's the closing of the loop, the employment of feedback to improve future projects. When the next development effort starts, the leader and all team members should—must!—read through all of the prior postmortems. This is the chance to avoid mistakes and to learn from the past. A report that's filed away in a dusty cabinet never to surface is a waste of time.

A 1999 study ( “Techniques and Recommendations for Implementing Valuable Postmortems in Software Development Projects” by Gloria H. Congdon, Masters Thesis at the University of Minnesota, May 1999) showed that of 56 postmortems the developers found 89% of them very worthwhile. The 11% failed ones are the most interesting—developers rated them bad to awful because there was no follow-through. The postmortem took place but the results were ignored.

Enlightened management or those companies lucky enough to have a healthy process group will use the accumulated postmortems outside of project planning to synthesize risk templates. If a pattern like “every time we pick a new CPU we have massive tool problems” emerges, then it's reasonable to suggest never changing CPUs, or taking some other action to mitigate this problem.

Plane crashes, though tragic, are used in a healthy way to prevent future accidents, to save future lives. Shouldn't we employ a similar feedback mechanism to save future projects, to learn new ways to be more effective developers?

To read Part 1 , go to Any idiot can write code.
To read Part 2 , go to The Seven-Step Plan

This article was printed with permission from Newnes, a division of Elsevier, Copyright 2008, from “The Art of Designing Embedded Systems, Second Edition” by Jack Ganssle. For more information about this title and other similar books, go to www.elsevierdirect.com.

With 30 years in this field Jack was one of the first embedded developers. He writes a monthly column in Embedded Systems Design about integration issues, and is the author of two embedded books: The Art of Designing Embedded Systems and The Art of Programming Embedded Systems. Jack conducts one-day training seminars that show developers how to develop better firmware, faster.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.