Engineering schedules are notoriously fickle, but Jack shows a proven technique for getting close to the truth.
In 1986 the U.S. Department of Defense (DoD) estimated that the F-22, a stealth jet aircraft, would cost $12.6 billion to develop. By 2001, with the work nearing completion, the figure climbed to $28.7 billion. The 1986 schedule appraisal of 9.4 years soared to 19.2 years.
A rule of thumb suggests that smart engineers double their estimates when creating schedules. Had DoD hired a grizzled old embedded systems developer back in 1986 to apply this 2x factor the results would have been within 12% on cost and 2% on schedule.
Interestingly, most developers spend about 55% of their time on a project, closely mirroring our intuitive tendency to double estimates.
Traditional Big-Bang delivery separates a project into a number of sequentially performed steps: requirements analysis, architectural design, detailed design, coding, debugging, testing, and, with enormous luck, delivery to the anxious customer. But notice there's no explicit scheduling task. Most of us realize that it's dishonest to even attempt an estimate till the detailed design is complete, as that's the first point at which the magnitude of the project is really clear. Realistically, it may take months to get to this point.
In the real world, though, the boss wants an accurate schedule by Friday.
So we diddle triangles in Microsoft Project, trying to come up with something that seems vaguely believable, though no one involved in the project actually credits any of these estimates with truth. Our best hope is that the schedule doesn't collapse till late into the project, deferring the day of reckoning for as long as possible.
In the rare (unheard of?) case where the team does indeed get months to create the complete design before scheduling, they're forced to solve a tough equation:
schedule = effort/productivity
Simple algebra, indeed, yet usually quite insolvable. How many know their productivity, measured in lines of code per hour or any other metric?
Alternatively, the boss benevolently saves us the trouble of creating a schedule by defining the end date himself. Again there's a mad scramble to move triangles around in Project to, well, not to create an accurate schedule, but to make one that's somewhat believable. Until it inevitably falls apart, again not till some time in the distant future, we hope.
Management is quite insane when using either of these two methods. Yet they do need a reasonably accurate schedule to coordinate other business activities. When should ads start running for the new product? At what point should hiring start for the new production line? When and how will the company need to tap capital markets or draw down the line of credit? Is this product even worth the engineering effort required?
Some in the agile software community simply demand that the head honcho toughens up and accepts the fact that great software bakes at its own rate. It'll be done when it's done. But the boss has a legitimate need for an accurate schedule early, and we legitimately cannot provide one without investing months in analysis and design. There's a fundamental disconnect between management's needs and our ability to provide.
There is a middle way.
Do some architectural design, bring a group of experts together, have them estimate individually, and then use a defined process to make the estimates converge to a common meeting point. The technique is called Wideband Delphi and can be used to estimate nearly anything, from software schedules to the probability of a spacecraft failure. When originally developed by the Rand Corporation in the 1940s it was used to forecast technology trends, something perhaps too ambitious due to the difficulty of anticipating revolutionary inventions like the transistor and laser. Barry Boehm extended the method in the 1970s.
At the recent Boston Embedded Systems Conference several consultants asked how they can manage to survive when accepting fixed priced contracts. The answer: use Wideband Delphi to understand costs. How do we manage risks from switching CPUs or language? Use Wideband Delphi. When there's no historical cost data available to aid estimation by analogy, use Wideband Delphi.
The Wideband Delphi method recognizes that the judgment of experts can be surprisingly accurate. But individuals often suffer from unpredictable biases, and groups may exhibit “follow the leader” behavior. Wideband Delphi shortcuts both problems.
Wideband Delphi typically uses three to five “experts”experienced developers, people who understand the application domain and who will be building the system once the project starts. One of these people acts as a moderator to run the meetings and handle resulting paperwork.
The process starts by accumulating the specifications documents. One informal survey at the Embedded Systems Conference a couple of years ago suggested that 46% of us get no specs at all, so at the very least develop a list of features that the marketing droids are promising to customers.
More of us should use features as the basis for generating requirement documents. Features are, after all, the observable behavior of the system. They're the only thing the customerthe most important person on the project and the one who ultimately pays our salariessees.
Consider converting features to use cases, which are a great way to document requirements. A use case is a description of a behavior of a system. The description is written from the point of view of a user who has just told the system to do something in particular and is written in a common language (English) that both the user and the programmer understand. A use case captures the visible sequence of events that a system goes through in response to a single user stimulus. A visible event is one the user can see. Use cases do not describe hidden behavior at all.
Although there's lots to like about using the Universal Modeling Language, it's a complex language that our customers will never get. UML is about as useful as Esperanto when discussing a system with non-techies. Use cases, on the other hand, grease the conversation.
Here's an example of a use case for the action of a single button on an instrument:
- Description: Describes behavior of the “cal” button
- Actors: User
- Preconditions: System on, all self-tests OK, standard sample inserted.
- Main scenario: When the user presses the button the system enters the calibrate mode. All three displays blank. It reads the three color signals and applies constants (“calibration coefficients”) to make the displayed XYZ values all 100.00. When the cal is complete, the “calibrated” light comes on and stays on.
- Alternative scenario: If the input values are below (20, 20, 20) the wrong sample was inserted or the system is not functioning. Retry three times then display “- – – – -” and exit, leaving the “calibrate” light off.
- Postconditions: The three calibration constants are stored and then all other readings outside of “cal” mode are multiplied by these.
Hidden features are in this action as wellones the customer will never seebut that we experts realize the system's needs. These features may include debouncing code, a real-time operating system, protocol stacks, and the like. Add these derived features to the list of system requirements.
Next select the metric you'll use for estimation. One option is lines of code (LOC). Academics fault the LOC metric and generally advocate some form of function points as a replacement. But most of us practicing engineers have no gut feel for the scale of a routine with 100 function points (FPs). Change that to 100 LOC and we have a pretty darn good idea of the size of the code.
Oddly, the literature is full of conversions between FPs and LOC. On average, across all languages, one FP burns around 100 lines of code. For C++ the number is in the 50s. So the two metrics are, for engineering purposes at least, equivalent.
Estimates based on either LOC or FP suffer from one fatal flaw. What's the conversion factor to months? Hours are a better metric.
Never estimate in weeks or months. The terms are confusing. Does a week mean 40 work hours? A calendar week? How does the 55% utilization rate factor in?
The guestimating game
Here's how the guestimating game works. The moderator gives all requirements including both observed and derived features to each team member. A healthy company will give the team time to create at least high-level designs to divide complex requirements into numerous small tasks, each of which gets estimated via the Wideband Delphi process.
Team members scurry off to their offices and come up with their best estimate for each item. During this process they'll undoubtedly find missing tasks or functionality, which gets added to the list and sized up.
It's critical that each person log any assumptions made. Does one developer figure most of Display_result() is reused from an earlier project? That assumption has a tremendous impact on the estimate. Table 1 shows the typical reuse benefits.
Table 1: Typical reuse benefits
*Cost as a percent of the cost to develop a new module.
Source: NASA, “Manager's Handbook for Software Development,” Software Engineering Laboratory Series, SEL-84-101, November 1990
When estimating, team members ignore all schedule pressure to make an honest assessment of the actual delivery date. And, they assume that they themselves will be doing the tasks.
Use units that make sense. There's no practical difference between 21 and 22 hours; our estimates will never be that good. Consider emulating the time base switch on an oscilloscope that has units of 1, 2, 5 and 10. That is, estimate 5 hours, not 4; 50, not 40.
All participants then gather in a group meeting, working on a single feature or task at a time. The moderator draws a horizontal axis on the white board representing the range of estimates for this item, and places Xs to indicate the various appraisals, keeping the source of the numbers anonymous.
At this point the team generally comes up with quite a range of predictions; the distribution can be quite disheartening for someone not used to the Wideband Delphi approach. But here's where the method shows its strength. The experts discuss the results and the assumptions each has made. Unlike other rather furtive approaches, Wideband Delphi shines a 500,000 candlepower spotlight into the estimation process.
The moderator accepts secret ballots from each participant and plots the results of a second round of estimates. The results nearly always start to converge.
The process continues till four rounds have occurred or till the estimates have converged sufficiently. Now compute the standard deviation of the final reckonings.
Each feature or task has an estimate. Sum them to create the final project schedule:
Combine all of the standard deviations:
With a normal distribution at one standard deviation you'll be correct 68% of the time. At two sigma count on 95% accuracy. So there's a 95% probability of the project finishing between Supper and Slower, defined as follows:
Slower = St 2σt
Supper = St + 2σt
Roadmaps get you there faster
An important outcome of Wideband Delphi is a heavily refined specification. The discussion and questioning of assumptions sharpens vague use cases. When coding does start, it's much clearer what's being built. Which, of course, only accelerates the schedule.
Wideband Delphi might sound like a lot of work. But the alternative is clairvoyance. You might as well hire a gypsy with her crystal ball.
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at .
Maybe for the consultants at the Embedded Systems Conference it was an acceptable answer, but how does an*individual* consultant take advantage of Wideband Delphi? Like code reviews, this sounds like another one of thosetechniques that requires at least a small group of developers.
– Brad Peeters
Jack Replies – Brad, you make a good point. There is no way to do widebandDelphi as a solo activity. Like baseball, certain activities require a teamof some sort.
But for code inspections there are some approaches, though not ideal, that a solodeveloper can use. Check out Peer Reviews in Software, by Karl Wiegers. He describes allsorts of alternatives.
The ability of several independent observers to come to a better conclusion collectively than by relying on a single, even highly experienced, person has been discussed in “The Wisdom of Crowds”.
Interesting book, but I'm not sure I buy in to the conclusions. I think I need more evidence and more theory that explains the evidence.
– Vladimir Ivanovic
Do you advocate accepting the mean estimates of the 3-5 developers, or multiplying their estimates by a factor of1.0/0.55=1.8? If so, is the purpose of the factor to account for vacation/sick days, previous project cleanup, and priorityinterrupts? At our company I have seen factors between 1.2 and 1.8 used for this purpose.
– Gary Kenaley
Jack Replies: I use the mean estimate of the 3-5 developers in hours, and then multiply by 1.8, and divide by 40, to get weeks. That hasthen got to be adjusted by vacation and the like. The 1.8 factor accounts for the normal in-office stuff that detracts fromwork, like meetings, memos, and the like.
The 1.8 number may be different depending on the company. Some outfits may have problems in that developers get pulled offprojects to support other projects. You can reduce the 1.8 by accounting for these interrupts – whenever they happen, add thesame # hrs back into the project. But 1.8 is an industry average.