Improve static code analysis with Software DNA maps - Embedded.com

Improve static code analysis with Software DNA maps

An experienced programmer knows that software defects are inevitable.They are a natural part of the evolutionary process that mostapplications experience. Whether you're working on the initial versionof an application or expanding an existing application ” defectshappen.

As defects progress through the development cycle, they becomeexponentially more expensive to fix. And according to a ForresterReport produced in 2007, it is at least 30 times more costly to fixsoftware in the field versus during development.

Creating a picture, known as a Software DNA Map, opens the door forstatic analysis, which significantly improves code quality andsecurity. A Software DNA Map is an extremely accurate representation ofa software system based on understanding all operations that the buildsystem performs as well as an authentic compilation of every sourcefile in that build system.

To be useful, this map would need to address how every single filewas compiled for all targets, how each set of files was linkedtogether, the different binaries that were generated, the functionswithin each file and the corresponding callgraph, all the differentcontrol flow graphs through each function, and on and on.

By providing an accurate representation of an application to thefunction, statement, and even expression level, the Software DNA Mapenables static code analysis to overcome its previous limitations ofexcessive false positives and deliver accurate results that developerscan put to immediate use.

Combining breakthroughs in dataflow analysis with a Software DNA Maphas created substantial benefits for development organisations byenabling them to detect defects early in development. However, theperformance and accuracy of these analysis solutions can still beimproved. A groundbreaking technology called Boolean satisfiability ispoised to greatly expand static code analysis capabilities (Figure 1 below ).

Figure1: Boolean representation of a software function

A new generation in analysis
Source code analysis has had a mixed reputation in software developmentbecause analyses often took too much time or produced excessive noiseand a large percentage of bogus results (false positives). With lowsignal-to-noise ratios, most source code analysis technologies andproducts quickly became shelf-ware after a few uses.

The promise of new static analysis solutions is tantalising fordevelopers because it offers the ability to find bugs before softwareis run, improving code quality and dramatically accelerating theavailability of new applications. Though static analysis hashistorically struggled to deliver on this promise, a groundbreakingtechnique applied in the static analysis field may help fulfil itspotential.

The software development challenges of today require companies tolook for innovative ways to remove the critical fl aws andvulnerabilities in software before it gets to the quality assurance labor, worse, out into the field. According to a recent report from SQSResearch, 40% of European organisations have suffered financial loss asa result of poor or no software testing.

Fortunately for developers, source code analysis is now up to thechallenge. By combining breakthrough techniques in the application ofBoolean satisfiability to software with the latest advances in dataflowanalysis, the most sophisticated source code analysis can boastout-of-the-box false positive rates as low as 10% and scalability totens of millions of lines of C/C++ or Java code.

Software developers can use static analysis to automatically uncovererrors that are typically missed by unit testing, system testing,quality assurance, and manual code reviews. By quickly finding andfixing these hard-to-find defects at the earliest stage in the softwaredevelopment life cycle, organisations are saving millions in associatedcosts.

In complexity theory, the Boolean satisfiability problem (SAT) is adecision problem whose instance is a Boolean expression written usingonly AND, OR, NOT, variables, and parentheses. The question is, giventhe expression, is there some assignment of TRUE and FALSE values tothe variables that will make the entire expression true? A formula ofpropositional logic is said to be satisfiable if logical values can beassigned to its variables in a way that makes the formula true.

Boolean satisfiability
The concept of Boolean satisfiability is not new but recently,efficient SAT Solvers have been developed to solve very complicatedformulas with millions of variables. While used heavily in theelectronic design automation industry to aid in chip designverification technologies, the application of SAT solving to softwareanalysis has remained untouched.

A SAT Solver is a computer program that takes in a formula ofvariables under the operations of AND, OR, and NOT and determines ifthere is a mapping of each individual variable to TRUE and FALSE suchthat the entire formula evaluates to TRUE (satisfiable).

If there is at least one such mapping, the solver returns a specificsatisfying assignment. If no such assignment exists, the SAT-solverindicates that the formula is unsatisfiable and may provide a proofdemonstrating that it is unsatisfiable. The application of SAT tosoftware requires that source code be represented in a form that can beplugged automatically into a SAT Solver.

Fortunately, with a Software DNA Map, all the necessary informationfrom the code is available to transform it into any representationdesired. Because SAT Solvers deal in TRUE, FALSE, AND, OR, and NOT, therelevant portions of this program can be transformed into theseconstructs.

Take an 8bit variable as an example, such as char a; To represent'a' as TRUES and FALSES, those 8 bits (1s and 0s) can be each thoughtof as TRUES and FALSES, so 'a' becomes an array of 8 Boolean values;a(0) to a(7)

Converting code
The operations that make up expressions in the code also must betranslated. All expressions in code can be converted to an equivalentformula of AND, OR, and NOT. The thinking behind this is that acompiler must turn these operations into instructions in machine codeand that machine code must run on a chip.

The chip's circuitry is nothing more than pushing 1s and 0s througha number of gates (all of which can be simplified to AND, OR, and NOT);therefore, this indicates that such a mapping exists. For example, toconvert the expression a = = 19 into a formula, the followingexpression would do the trick: !a(0) ^ !a(1) ^ !a(2) ^ a(3) ^ !a(4) ^!a(5) ^ a(6) ^ a(7) In this example, a(0) is the mostsignifi cant-bitand a(7) is the leastsignifi cant-bit.

Once the entire Software DNA Map is represented in this format ofTRUES, FALSES, NOTS, ANDS, and ORS, a wide variety of formulas can beconstructed from this representation and SAT Solvers can be applied toanalyse the code for additional, more sophisticated quality andsecurity problems. It is this bit-accurate representation of thesoftware that enables more precise static analysis than was previouslypossible, based solely on dataflow techniques.

Writing better code
An early application of SAT Solvers for static analysis is false pathpruning. When performing datafl ow analysis, sometimes a defect will bereported on a path that is infeasible (or unsatisfiable) at runtime.This reported defect should be eliminated from the static analysisresults since there is no possible combination of variables where itcould actually happen in the software system.

The methodologies for dealing with this problem within the datafl owframework pale in comparison to a bit-accurate representation coupledwith a good SAT Solver in this regard.

For example, with this bit-accurate representation of the sourcecode, developers can construct a formula that is a conjunction (AND) ofall the conditions in a path that lead to any given defect discoveredby datafl ow analysis.

By solving this formula, the SAT Solver indicates if there is a setof values for the variables involved in all the conditions such thatthe path can actually be executed. If the SAT Solver says'satisfiable', then the path is feasible at runtime. If the SAT Solversays 'unsatisfiable', the path can never be executed and no bug shouldbe reported.

Consider the earlier discussion of static analysis history and theproblems with excessive false positive results. By identifying falsepaths with SAT, developers can prune them from their static analysisresults.

This enables the developers to focus testing and analysis efforts onpotential problems that have a real possibility of compromising theproject at hand. It's important to note that false path pruning is justone example of how SAT can be leveraged to provide more accurate sourcecode analysis.

Other potential applications of SAT include the ability to findproblems such as string overflows, integer overflows or dead code, andthe use of assertion-based checking to identify difficult-to-find logicbugs.

While some instances of the defects in these categories can bediscovered today, SAT-based analysis allows developers to build on thesuccess of existing datafl ow analysis to reach new levels of accuracyand precision in static code analysis. For years, embedded softwaredevelopers have looked for products that could automatically andeffectively find software defects early in the development cycle.

However, tools for discovering defects automatically have had manyfalse starts because of their failure to grasp a complete picture ofthe software.

SAT will benefit software developers and architects by fundamentallychanging the way they view static analysis of their source code. Byproviding complete understanding of a given build environment andsource code, a Software DNA Map can allow organisations to leverageexisting SAT techniques to produce the highest quality softwarepossible.

By leveraging SAT within static code analysis, companies canimmediately find and fix software defects in source code at thebeginning of the development cycle, impressing anyone who writessoftware and perhaps becoming the breakthrough that enables staticanalysis to deliver on more of its long-awaited promise.

Ben Chelf is the CTO of Coverity. He was a founding memberof the Stanford Computer Science Laboratory team that architected anddeveloped Coverity's underlying technology. He is one of the world'sleading experts on the commercial use of static source code analysis.Chelf holds BS and MS degrees in Computer Science from StanfordUniversity. He can be reached at bchelf@coverity.com

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.