As a technology journalist, I get the equivalent of a postgraduate education reading good, detailed papers and reports, though some are mind-numbingly difficult. Knowing my interest in such papers, several software developers and tool vendors have independently referred me to “Quantitative Evaluation of Static Analysis Tools,”by Shin’ichi Shirashi, Veena Mohan and Hemalatha Marimuthu, at the Toyota InfoTechnology Center in Mountain View, Ca.
In this paper, the authors describe the task of selecting the optimum code analysis tool for doing run time code analysis of software for use in Toyota's vehicles. Starting with tools from about 170 vendors of proprietary tools as well as a range of free and open source versions, they narrowed their choices down to six, those from Coverity, GrammaTech, PRQA, MathWorks, Monoidics and Klocworks to make their selections they used a complex methodology to first make their assessment and from that also derive a set of coding guidelines to help the company's development teams avoid defects proactively.
Readers of the report may disagree with the types of tests and metrics the Toyota team used to make its choices. Some developers I have talked to complain that despite the data-driven quantitative approach, at the beginning of their efforts the team made use of the qualitative and subjective judgements of a few experts they trusted. However given the number of such tools they had to evaluate, that was a choice they were almost forced to make. Even after limiting their choices in that way, there were many alternatives to evaluate and test, even after they narrowed their choices further by excluding noncommercial tools that provided no technical support, those that did not support safety critical applications, and including only those that supported the C language.
The paper describes how the Toyota team first created a set of test suites incorporating a wide variety of software defects that might cause problems in safety critical applications. They then tested the tools they had selected against that suite. Finally, because of the importance of driving buggy software out of the automobile environment, they went one step further: they used the data already collected to create several new metrics to further hone down the performance of the tools. For information on the various tests they used, they depended on several reports from the U.S. National Institute of Standards and Technology (NIST), supplemented with information from various tool vendors and users.
The choices they made and the process they came up with made it clear how many things can go wrong in a software design, especially a safety critical one, and how hard it is to pin things down. Drawing on every piece of literature they could find, they identified eight defect types that were important, including: static and dynamic memory, resource management, pointer-related, and concurrency defects as well as the use of inappropriate and dead code. From that they identified 39 defect sub-types, and from those they created 841 variations that they used in their tests. The methodology is about as comprehensive as any I have ever seen.