Measuring Changes in Software with CLOC

As software developers we are often asked to measure different software characteristics. Decisions are based upon facts and figures, which are generated through taking precise measurements.

In order to better control and manage software development, computer scientists work to develop better software measurements. These measures can assist in choosing development environments, clarifying and focusing goals, and in ascertaining the cost of design decisions.

The factors involved with source code are numerous and varied. They include the degree of expertise required to develop the code and the size, complexity, reliability, and performance of the code.

Although software's complex nature makes reliance on a single measurement method impractical, there are a myriad of good techniques for quantifying software. The current techniques each provide absolute values for certain characteristics of a static piece of software.

However, it can sometimes be useful to quantify the evolution (i.e., the amount of change) that occurs in a software project (Figure 1 below ), instead of a static absolute measurement pertaining to a single instance.


Click on image to enlarge.

The evolution of a software project can be based upon the reuse of code, which increases programmer productivity by allowing for the reuse of programming artifacts such as the designs, specifications, tests, and of course the code itself [1, p. 65].

When we were recently asked to evaluate the changes in a piece of software from one version to another, we realized that the existing metrics were good at measuring single static values, but were not sufficient for quantitatively measuring the evolution of the software. We therefore devised a method to measure the changes in software from one version to another.

The new measure is called Changing Lines of Code or “CLOC ” for short and is based upon the kinetic nature of software development. This new method is demonstrated through an analysis of the popular open source search engine Mozilla Firefox and the Apache HTTP server.

Traditional Methods
The traditional method for measuring software evolution is to compare the total number of lines of code between two versions. There are variations of this method that differ in the way functional and nonfunctional lines of code are counted, but they all produce the same basic results.

We concentrated on the “source lines of code” (“SLOC”) method that simply counts the number of non-blank lines of source code (“LOC”) in a program, which includes both functional and nonfunctional code.

The SLOC method can provide static absolute measurements for two completely different programs, which can roughly translate to the relative efforts involved in developing the programs. If one program has a higher SLOC count then it can reasonably be concluded that it is more complex and required more effort to develop.

This method has some major flaws when examining the evolution between subsequent versions of a single software project because it measures absolute values and does not account for the nature of software evolution [2 p. 18].

Source Code Evolution
The evolution of software from version to version is a mixture of several different kinds of changes. Figure 1 above shows how the LOC and files of a software project evolve.

There are detailed changes that absolute static measurements of the two versions cannot account for. Specifically, the maintenance and development of the software project can result in some of the original files being removed and other files continuing with either changed lines or with lines being removed.

The SLOC method is not accurate enough to measure software evolution because the subsequent version is not just the result of adding new LOC to the original version.

A software project could have a great deal of refactoring between versions that does not significantly increase the overall SLOC of the project, but still represents a significant amount of effort and evolution.

Even small modifications in existing code can represent large amounts of effort, because it involves understanding the software to be refactored, and then additional testing [3 pg. 2]. Therefore, it is necessary to take these additional details of software evolution into account.

Changing Lines of Code Measure (CLOC)
The CLOC method eliminates these discrepancies and properly measures the intricate changes involved. The CLOC method counts the number of LOC that have been added, changed, or remain unchanged.

These values are then combined to express the change in software as a rate of growth. The results can also be expressed in terms of the decay of the original code, which can be useful as a measure of how much original intellectual property (“IP”) still exists from that original code.

Measurements. The CLOC method relies upon the CodeDiff and FileCount tools in Software Analysis & Forensic Engineering Corporation's (S.A.F.E.) toolset, CodeSuite, combined with a specially developed CLOC spreadsheet.

FileCount is a function that simply counts files, lines of code, and number of bytes in a directory tree. CLOC requires that FileCount is first used to count the number of program specific files and the number of non-blank lines in the software project's directory tree. CodeDiff is a function that exhaustively compares lines of code in one set of source code files to that in another set of source code files.

CLOC requires that CodeDiff is used to compare same-name files from the original version to subsequent versions of the software project. Typically movements of source code between files represents work being performed.

Similarly a file name change represents work being performed, because file names are not generally changed from version to version unless there is a significant change to the functionality of the file.

The results of the CodeDiff analysis are then exported into a CodeSuite distribution report that contains the statistical information about changes in the files and LOC.

The CodeDiff statistics are then combined with the FileCount numbers in the CLOC spreadsheet to generate the rate of software growth. This is demonstrated below in an analysis of the search engine Mozilla Firefox. The CLOC spreadsheet for the Firefox analysis is shown below in Table 1 below .


Click on image to enlarge.

The data elements shown in bold are generated from formulas in the spreadsheet that use the CodeDiff and FileCount data as input, whereas the other numbers are generated automatically by FileCount and CodeDiff.

Four views of software change
The software evolution results can be expressed as four different data elements: growth, LOC decay, file decay and unchanged file decay.

Growth . The growth of each version is the ratio of total new LOC (“TNL”) to the total LOC (“TL”) in the original version. The TNL is represented by the green oval in Figure 1 earlier , and includes lines that have changed, along with completely new lines in either continuing files or completely new files.

Growth(n) = TNLn /TL0

LOC Decay . Sometimes it is important to know how much of the original code is still represented as a software project evolves. The LOC Decay is the ratio of total continuing LOC (“CL”) to the total LOC in each subsequent version. The total continuing LOC is shown as the blue oval in Figure 1 earlier , and is the count of the LOC that are present in the original version and the subsequent version.

Line Decay(n) = CLn /TLn

File Decay . The decay can also be represented as either the file decay or the unchanged file decay. The file decay is the ratio of original files that are still remaining (“TCF”) to the total number of files in the subsequent version (“TF”).

The unchanged file decay is the ratio of the continuing files that are completely unchanged in the subsequent version (“UCF”), to the total number of files in the subsequent version.

File Decay(n) = TCFn / TFn
Unchanged File Decay(n) = UCFn / TFn

Measured Results
The traditional SLOC measurement of growth is included as the last row of Table 1 earlier . These measurements demonstrate the advantages of the CLOC method over the SLOC method, as the visual representation in Figure 2 below clearly illustrates. According to the CLOC measurement, the Firefox software project continually grew and evolved.

Figure 2: The CLOC growth vs. SLOC growth of Mozilla Firefox software

In contrast, one can easily see in Figure 2 that the SLOC measurements rise and fall. It is safe to assume that the second most popular web browser continually grew and evolved, and that the CLOC method is therefore more accurate.

The SLOC method does not account for the more subtle changes inside the total LOC, so it fails to accurately measure the growth, evolution, and work involved in this popular open-source project.

The ups and downs of the Firefox SLOC measurements are a drastic demonstration of the traditional method's shortcomings. An examination of another popular open source project, the Apache HTTP Server, is a less drastic, but perhaps clearer demonstration of why the CLOC method should be used.

Figure 3 below shows the measured growth according to the CLOC and SLOC methods. They both follow the same basic trend, but the CLOC growth is more rapid then the SLOC growth.

Figure 3. CLOC growth vs. SLOC growth of Apache HTTP server software

The SLOC method is leaving out a high number of lines that were deleted and replaced as well as those that were modified. This leads to a gross under-representation of the software project evolution, such as the disparity between the two methods seen in Figure 3.

Cost Prediction
The ability to better predict the cost or effort involved in future development is the goal of accurate measurement. A developer's ability to forecast future efforts is dependent on the availability of proper metrics from the past.

Project managers and developers rely on experience and industry norms to estimate the effort [4, pg. 422]. The constructive cost model (COCOMO) allows one to “estimate the cost, effort, and schedule when planning a new software development activity” [5].

The models used in COCOMO are based upon previous research, including analysis of the costs of code reuse. Depending on the relative cost of reusing code, the overall cost potentially decreases as the portion of reused code increases [6, pg. 81].

Although, the CLOC method provides a measurement of past software evolution and not a prediction, its greater accuracy can help developers understand the evolution of their code and may improve both the forecasting by people and models alike.

Conclusion
The CLOC method was specifically designed to provide an accurate measurement of the evolution of a software project. This new method has already found use as a basis for valuation of different versions of large software projects in a major tax case and has been the subject of a recent academic paper [7].

By basing the design upon the kinetics of how source code actually evolves from one version of a software project to another instead of static snapshots, the CLOC method is able to more accurately measure the growth and evolution of a software project.

We hope that the CLOC method can help increase the understanding of software evolution and improve the accuracy of software development effort predictions.

Nikolaus Baer is a research engineer at Zeidman Consulting. He has written articles and given presentations about software trade secret theft and how to analyze source code. He holds a Bachelor's degree in computer engineering from UC Santa Barbara, where he attended on a Regents Scholarship. He can be contacted at Nik@ZeidmanConsulting.com .

Bob Zeidman is the founder and president of Zeidman Consulting. He is the author of several engineering textbooks and holds five patents. He has a Master's degree from Stanford University and two Bachelor's degrees from Cornell University. He can be contacted at Bob@ZeidmanConsulting.com .

References
[1] Tracz, W., Software Reuse: Motivators and Inhibitors, [ed.] Tracz, W., Software Reuse: Emerging Technolog, Washington D. C., IEEE Computer Society Press, 1988, pp. 62-67.
[2] Jones, Capers. Programming Productivity. San Francisco : McGraw-Hill Publishing Company, 1986.
[3] Clark, B., Devnani-Chulani, S., Boehm, B., Calibrating the COCOMO II Post-Architecture model, International Conference on Software Engineering, 1998.
[4] Nisar, M.W., Yong-Ji, W. Manzoor, E.,Software Development Effort Estimation Using Fuzzy Logic – A Survey, Fifth International Conference on Fuzzy Systems and Knowledge Discovery, 2008.
[5] USC Viterbi School of Engineering Center for Systems and Software Engineering ,COCOMO II, http://csse.usc.edu/csse/research/COCOMOII/cocomo_main.html, [Cited: 4 30, 2009].
[6] Barnes, B., et al., A Framework and Economic Foundation for Software Reuse, [ed.] Tracz, W., Software Reuse: Emerging Technolog, Washington D. C., IEEE Computer Society Press, 1988, pp. 77-88.
[7] Baer, N. and Zeidman, B., Measuring Software Evolution with Changing Lines of Code, 24th International Conference on Computers and Their Applications (CATA-2009), April 10, 2009.
[8] Selby, R. W., Empirically Analyzing Software Reuse in a Production Environment, [ed.] Tracz, W., Software Reuse: Emerging Technolog, Washington D. C., IEEE Computer Society Press, 1988, pp. 176-189.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.