When you've seen one bug, you definitely haven't seen them all. How many are there? Biology provides the answers.
How long will it take to test the software? The answer to this ubiquitous and often-vexing question depends heavily on how many defects are lurking within. In a past life (mid 1960s), I was a forestry major, and knew that wildlife biologists routinely use statistical sampling methods to estimate the populations of fish in lakes, birds, and other animals. So what about programming bugs? Could we use one of these techniques to take an early sample and come up with a usable estimate of total defects?
Using two independent test teams to obtain this early estimate was suggested in 1970 by H. D. Mills in the IBM FSD Report, and more recently by Steve McConnell, who termed it “defect pooling” in his Software Project Survival Guide. (None of the references I list at the end of this article describes the two independent test team estimation process beyond extremely general terms.)
I explored this technique while working on a master's thesis in computer science. After examining several different models, I initially selected the Petersen Method, which has been widely used since the late 1930s to estimate animal and fish populations.[6,7] I used data from previously completed software testing performed on firmware for several digitizing oscilloscope projects at Agilent (where I work) to determine both how well and how early this prediction method worked.
Before proceeding, let me state that I am neither a wildlife biologist nor a statistician. Fortunately, I had an immense amount of help from Doug Krieger (aquatic biologist, Colorado Department of Wildlife) and Greg Kruger (statistician, Agilent).
The technique of capture-mark-recapture-also known as mark-recapture or capture-recapture-has been widely used for many years to estimate wildlife populations. Using mark-recapture, the Petersen Method (also known as the Lincoln-Petersen Method) is used to estimate fish and animal populations. Examples of fish, bird, and animal population estimates can be found in publications such as Ecology, Journal of Wildlife Management, BioScience, Journal of Mammalogy, and The American Midland Naturalist.
It has been suggested that applying this method of tagging a known number of specimens and then catching a sample containing both marked and unmarked fish can work in the software world by using two independent groups of software testers. This provides a simple model for predicting the number of software errors by making fewer assumptions than more sophisticated models.
The first team finds “A” defects, while the second team finds “B” defects. The “C” defects are those common to both teams. These duplicate defects correspond to the number of marked fish retrieved in a mark-recapture procedure. The estimated total number of defects in the software is then given as:[2,3,4,5]
N = estimated total defects in the tested software
A = total defects found by test team A
B = total defects found by test team B
C = total defects that were found by both teams (duplicate or common)
Reserachers make Simple Petersen Population Estimates of the fish world in two ways. One method is to place a given number of tagged fish in a body of water by removing fish from the population (capture), marking them for future identification, and returning them to the population. Another method is to place a known number of marked same-species fish that were raised in captivity into the population. In either case, time is allowed for both marked and unmarked subjects to mix. A sample is then taken from the population (recapture) and the ratio of the total number of fish in the sample to the number of marked fish in the sample is used to determine an estimate for the total population as indicated by the following equation:[6,7]
where:N = total fish population; unknown
M = number of fish tagged (marked); known
S = total number of fish recaptured (both marked and unmarked); known
R = number of tagged (marked) fish recaptured; known
The total fish population (N) is then given by:
The formula for Simple Petersen Population Estimates exhibits a tendency to overestimate the true population until the sample size becomes quite large. An adjusted version that appears to work better for smaller sample sizes is:[6,7]
Since my work attempted to produce an estimate of the total number of defects early in the testing process, the need to keep the sample size as low as possible meant that the Adjusted Petersen Method was more appropriate.
Variance of the Petersen estimator
George Seber gives a formula for the variance of the estimate:
Translating this formula into the notation used here, we have:
N = total number of software defects
A = total number of software defects found by test team A
B = total number of software defects found by test team B
C = total number of defects found by both team A and team B
The difficulty of doing something is always in the details. The literature on statistical population analysis only discusses defect pooling in broad terms; no one gives any explicit instructions on how to proceed. So when I was working on this with my master's committee, we had many spirited discussions as to how to go about this. What we ended up with may need to be monitored and revised if the method becomes more widely used. What follows is a brief discussion of the methodology we adopted to apply the Adjusted Petersen Method to software.
The final defect databases for the twelve previous oscilloscope projects were analyzed. Each defect record contained an identifying number, a description, and the name of the individual tester who found it. These databases were used to compare the results when using the Adjusted Petersen Method versus the actual totals found during directed abuse testing for the twelve projects.
The names of the testers for each project were divided into two test teams (A and B). Testers were assigned to teams by using a table of random numbers. As each new tester was identified in the database, a random number was selected from the table. All odd-numbered testers were assigned to team A; those having even numbers to team B.
A qualified duplicate was defined as a defect found by both teams. Defects were not counted as qualified duplicates if they were resolved as duplicate in the database, but could not be related to another defect by either having the same number for the other defect, which was the duplicate documented in its resolution text, or by matching the two defects from their descriptions.
The twelve project databases in the defect tracking system were then examined until the total number of qualified duplicate defects for each project reached four. The number four was picked so that the method could produce the earliest possible unbiased estimate. This technique of sampling until a predetermined number of marked items are found is known as inverse sampling.
The total number of defects found by each team was monitored, as was the total number of qualified duplicate defects they found. Duplicate defects found by members of the same team were not counted as qualified duplicates (because we felt that this was analogous to two fishermen both landing the same fish) and did not contribute to the total count of four.
When the total count of qualified duplicate defects reached four, we tabulated the total number of defects found by each team and adjusted for any duplicates found within that team. Next, we applied the formula for the Adjusted Petersen Method to produce the prediction for the total number of system test defects (software, documentation, usability, and hardware) that were actually reported during the project's Directed Abuse Test. We then calculated the 95% confidence interval[7,8,9] for the estimate on the project, and the predicted total compared to the actual total number of system test defects logged during Directed Abuse Tests in the Agilent defect tracking database. The actual total also was checked to see if it fell within the project's 95% confidence interval, which was determined from the formula for the variance of the Adjusted Petersen estimator given earlier (since 95% confidence is ±1.96 standard deviation and standard deviation is the square root of the variance[8,9] ).
As you can imagine, this was quite tedious. Instead of wading through the defect tracking database, I used its report writer to create a file of the defects, identification numbers, submitters (testers), and one-line descriptions. Next, I used “sed” scripts to add the team identifiers (A or B) for the testers. Because I could not figure out a good way to automate the determination of four qualified duplicate defects, I did it by hand for all the projects examined. I then used an “awk” script to total the defects for the two test teams. The resulting numbers were then plugged into the Adjusted Peterson Method formulas via a small BASIC program.
Next, I tested the relative errors (difference between actual and estimate) from the four projects that produced estimates for bias using a two-tailed t-test. The variances from the estimates were examined to show the efficiency of the estimator. Finally, the Percent of Testing required for an Estimate (PTFE) was calculated to determine if the estimation method provided an early estimate of the total number of defects.
The estimation model was initially applied to software projects that had been previously tested, so the total number of actual defects was already known. Of the twelve projects analyzed, six did not produce the minimum of four qualified duplicate defects required and were discarded; no estimate was possible. Differences between the testing teams and their approaches to preventing duplicate finds and reporting are some of the possible problems with performing this type of analysis after the fact.
|Table 1: Results of applyingadjusted Petersen method to Agilent projects|
|Actual # of Defects||Estimated # of Defects||Variance||95% Confidence
Table 1 shows a summary of results from applying the Adjusted Petersen estimation method to the six software projects that did yield a minimum of four qualified duplicate defects. It shows that four of the projects for which an estimate was possible had an actual total number of defects (N) that fell within the 95% confidence interval of the estimate.
A two-tailed t-test was performed to check the estimation method for bias; the results were that the estimator was not statistically biased.
The variances for the estimates as shown in Table 1 provide an indication as to the efficiency of the estimator. To determine relative efficiency, another estimation method needs to be applied to these same projects, and the variances of its results compared to the variances obtained via the Adjusted Petersen Method.
But based on the experience of using this estimation model, just having the numbers that were provided by the model for the 95% confidence bounds would have been extremely important information for both the managers and the developers of these projects.
Mean percentage of testing required for an estimate (PTFE)The mean PTFE for making an estimate using four qualified defects as the minimum number for making an estimate possible can be calculated from Table 1 for those four projects that did not violate basic assumptions of the estimation method:
This shows that an estimate was made for those four projects, on average, after 28.3% of the testing was completed.
It is important to be aware of the assumptions that must hold true in order to validate a statistical technique for purposes of estimation. If any of the assumptions for an estimate are invalid, its results become highly suspect or invalid. For example, if I toss a coin, the estimate that it will land head up is 50%; some of the implicit assumptions in this estimate are that the coin has both head and tail (not one-sided), that the coin is not weighted, and so on. The following assumptions are given for Petersen estimates in the literature:[6,7]
- Marked fish have the same mortality rate as unmarked fish. It is assumed in the estimation model used for software that the marked fish correspond to duplicate defects-those found by both testing teams.
- Marked fish are as vulnerable to the type of fishing being used as the unmarked fish. For software, this would translate to duplicate defects being as likely to be found during testing (fishing) as non-duplicate defects.
- Marked fish do not lose their mark. After a defect is identified as a qualified duplicate in the defect tracking system, it cannot lose its mark.
- Marked fish become randomly mixed with the unmarked fish. Since the estimation method does not extract a defect, mark it, and reinsert it for later discovery, this would not appear to be a problem.
- The fishing is random over the body of water. Agilent's Directed Abuse Testing takes measures to ensure that testing is evenly distributed over the features being tested. Project 6 seemed to violate this assumption; only one new feature was added to a previous project and the testing was heavily concentrated on this one new feature. Subsequently, the actual value for the total number of defects found was outside (above) the 95% confidence interval provided by the estimation method.
- All marked fish are recognized and reported when recovered during fishing. Since qualified duplicates are used to represent the marked fish, if not all duplicates are recognized as such (possibly due to differing descriptions from the two test teams) the resulting estimate could become invalid.
- There is no significant population recruitment during fishing once marking occurs. Recruitment (population change via death, birth, emigration, or immigration) can effect a major change. Open populations experience recruitment; closed populations do not. The Peterson method is thus an estimation model for closed populations; there are more advanced models for open populations. Project 12 appeared to violate this assumption; not all features were completed when testing began. The resulting recruitment (new defects being created) resulted in an actual value of total defects found which was outside (above) the 95% confidence interval provided by the estimation method.
The post-release defect densities for the digitizing oscilloscope projects that were part of this study are all less than 0.5 defect per KNCSS (thousand lines of non-comment source statements). Given this low number of post-release defects, it was further assumed that the total number of defects recorded in Agilent defect tracking databases for final system test are extremely close to the actual totals. Although this is an approximation, it seems reasonable since the post-release defect counts are quite small. (Post-release defects were not included in the total number to be estimated, however, since the method was attempting to estimate the total number of defects found by final system test.)
Based on the data observed and the procedures followed, the thesis concluded that the Adjusted Petersen Method provided a reasonably good early estimate of total final system test defects.
If nothing else, the exercise was valuable in that we now are using a more mathematically rigorous approach to many of our estimates (schedules, resources/budgets, sales, defects, and so on) by documenting and monitoring our assumptions and by using interval estimates (for example, 95% confidence level) instead of point estimates. If assumption(s) are violated, we revisit the estimate.
When the Adjusted Petersen Method is applied to projects as tget are tested, as opposed to being simulated from past data as was done in the thesis work, new estimates could be periodically produced as the number of qualified duplicates increases past four. These subsequent estimates could then be examined at the end of the project, when the actual total number of final system test defects is known, to determine if the progression of estimates became successively more accurate. This would test the consistency of the estimator.
Future research in this area could include exploring many of the other available (and much more sophisticated) wildlife estimation techniques for both closed and open populations.[6,7] The involvement of professional statisticians would also certainly help, especially if they are experienced in biological statistics.
If you get adventurous and decide to apply this method to estimating the total number for your final systems test defects, please contact me and let me know how it works out.
In the words of George Box, “All models are wrong; some models are useful.”
Mark Lambuth is a 26-year veteran of Hewlett-Packard, now Agilent, in Colorado Springs. He received his MS in computer science (with a concentration in systems software engineering) from Colorado Technical University. He is currently involved in software quality and process work, primarily for digitizing oscilloscopes and can be contacted at .
Special thanks for all the long hours and hard work by my master's committee: Charles Schroeder, Carol Beckman, and Bo Sanden.
1. Mills, H. D. “On the Statistical Validation of Computer Programs,” IBM FSD Report, July 1970.
2. McConnell, Steve. Software Project Survival Guide. Redmond, WA: Microsoft Press, 1998.
3. Myers, Glenford. Software Reliability: Principles and Practices. New York: John Wiley & Sons, 1976.
4. Humphrey, Watts. Introduction to the Team Software Process. Reading, MA: Addison-Wesley, 2000.
5. Schulmeyer, G. and McManus, J. Handbook of Software Quality Assurance. London: Thomson Computer Press, 1996.
6. Ricker, W. E. Computations and Interpretation of Biological Statistics of Fish Populations. Ottawa, Canada: Department of the Environment, Fisheries and Marine Sciences, 1975.
7. Seber, George. Estimation of Animal Abundance and Related Parameters. London: Edward Arnold Publishing, 1982.
8. Myers, R. and R. Walpole. Probability and Statistics for Engineers and Scientists. New York: MacMillan, 1972.
9. Bluman, A. Elementary Statistics, a Step by Step Approach (2nd edition). Dubuque, IA: Wm. C. Brown, 1995.
10. Fenton, Norman and Shari Pfleeger. Software Metrics: A Rigorous and Practical Approach. London: Thomson Computer Press, 1996.