Tracking down the tainted data in your embedded app with static analysis

It is inherently risky to assume that software inputs will always be well-formed and within reasonable ranges. In the worst case, this assumption can lead to serious security vulnerabilities and crashes. Systems built from a combination of components with different sources are at particular risk: research has proven that security vulnerabilities proliferate at the boundaries between code components, often due to innocent differences in the interpretation of interface specifications.

In the parlance of secure programming, unchecked in put values are said to be tainted. Tainted data should always be a concern for developers : it can cause unexpected behavior, lead to program crashes, or even provide an avenue for attack.

An important consequence for the embedded domain is that any software that reads input from any type of sensor should treat all values from the sensor as potentially dangerous. The sensor may malfunction and report an anomalous value, or it may accurately report circumstances that had not been foreseen by the software author. It may even be possible for an attacker to gain access to the sensor’s communication channel and send values of their own choosing. Opportunities for these attacks have grown along with the increasing degree of network connectivity in embedded applications; where attackers would once have needed physical access to a device, they may now be able to attack over the network.

This article will describe some of the ways in which tainted data can cause problems, then explain how taint analysis capabilities in modern static analysis tools can help developers and users identify and eliminate these problems.

The dangers of tainted data
The most risky input channels are those over which an attacker has control. Programmers can defend against such defects by treating inputs from potentially risky channels as hazardous until the validity of the data has been checked.

The biggest risk of using unchecked values read from an unverified channel is that an attacker can use the channel to trigger security vulnerabilities or cause the program to crash. Many types of issues can be triggered by tainted data, including buffer overruns, SQL injection, command injection, cross-site scripting, and path traversal. (For more details on these and other classes of defect, see the Common Weakness Enumeration at Mitre .) Many of the most damaging cyber-attacks in the last two decades have been caused by the infamous buffer overrun (Figure 1 ). As this is such a pervasive vulnerability, and because it illustrates the importance of taint analysis, it is worth explaining in some detail.



Click on image to enlarge.

Figure 1. A buffer overrun warning. Underlining shows the effect of taint.

There are several ways in which a buffer overrun can be exploited by an attacker, but here we describe the classic “stack smashing” attack, in which an attacker hijacks the process and forces it to run arbitrary code. Consider the following code:

   void config(void)
   {
   char buf[100]; int count;
   …
   strcpy(buf, getenv(“CONFIG”));
   …
   }

In this example, the input from the outside world is through a call to getenv() that retrieves the value of the environment variable named “CONFIG ”.

The programmer who wrote this code was expecting the value of the environment variable to fit in the 100 characters of buf , but there is actually no guarantee that this will be the case. If the attacker has sufficient system access, he or she can cause a buffer overrun by assigning CONFIG a value whose length exceeds 100.

Because buf is an automatic variable that will be placed on the stack as part of the activation record for the procedure, any characters after the first 100 will be written to the parts of the program stack beyond the boundaries of buf . The variable named count may be overwritten (depending on how the compiler chose to allocate space on the stack). If so, then the value of that variable is under the control of the attacker.

This is bad enough, but the real prize for the attacker is that the stack contains the address to which the program will jump once it has finished executing the procedure. To exploit this vulnerability, the attacker can set the value of CONFIG to a specially-crafted string that overwrites this return address with a different address of their choosing. When program control exits the function, it will return to that address instead of the address of the function’s caller.

If the code is executed in a sufficiently secure context, it may be impossible for an attacker to exploit this vulnerability. Nevertheless, the code is clearly risky and remains a liability if left unfixed. A programmer might also be tempted to re use the code in a different program that does not run under the same degree of external security.

While this example takes its input from the environment, the code would be just as risky if the string were read from another input source, such as the file system or a network channel.

It is also important to note that unexpected inputs do not necessarily originate from attackers. A problematic input value may be accidentally provided by a trusted user, for example, or be generated by malfunctioning equipment. Whatever the origin of tainted input, the same analysis techniques can be applied to detect it and track its influences, and the same defense techniques apply.

Understanding taint flow
An automated static analysis tool canplay an important role by helping programmers understand how taint isflowing through a particular program. For example, in GrammaTech’s CodeSonar product, the locations of taint sources and sinks can be visualized,and taint propagation information can be overlaid on a regular codeview. This can help developers understand how tainted values interactwith their code and aid them in deciding how best to neutralizetaint-related vulnerabilities.

In the example illustrated in Figure 1 above, first note the blue underlining on line 80. This indicates thatthe procedure’s parameter points to a value that is tainted by the filesystem.

The underlining on line 91 indicates that the value returned by compute_pkgdatadir() points to data that is tainted by the environment. The call to strcpy() then copies that data into the local buffer named “full filename”(declared on line 84), propagating taint into that buffer. Consequently,the red underlining in line 92 shows that the buffer has become taintedby a value from the environment.

The explanation for the buffer overrun confirms that the value returned by compute_pkgdatadir() originated from a call to getenv() .A user inspecting this code can thus see that there is a risk of asecurity vulnerability if an attacker can control the value of theenvironment variable.

It can be difficult to track the flow oftainted data through a program because doing so involves tracking thevalue as it is copied from variable to variable, possibly acrossprocedure boundaries and through several layers of indirection.Performing this task manually is difficult and tedious even for verysmall programs; for most real-world applications, it is infeasible.

Automated static analysis provides a solution to this problem.

Automating taint analysis
Staticanalysis tools are useful because they are good at finding defects thatoccur only in unusual circumstances, and because they can do so veryearly in the development process. They can yield value before the codeis even ready to be tested. They are not intended to replace or supplanttraditional testing techniques, but instead are complementary.

Roughlyspeaking, an advanced static-analysis tool works as follows: First, itanalyzes the program to create a set of representations (such asabstract syntax trees, control-flow graphs, symbol tables, call graph),collectively known as the ‘program model’. Then, it executes variouskinds of queries on those representations in order to find defects.While superficial bugs can be found with simple pattern-matching, moresophisticated problems require correspondingly sophisticated queries.

Thereally serious bugs – those that cause the program to fail, such asnull pointer dereferences and buffer overruns – are detected usingabstract execution. The analyzer simulates the execution of the program,but instead of using concrete values, it uses equations that model theabstract state of the program. If it encounters an anomaly, the analyzerissues a warning.

Figure 2 below shows an example bufferoverrun warning report from CodeSonar. The report shows the path throughthe code that must be taken in order to trigger the bug. Interestingpoints along the way are highlighted. An explanation of what can gowrong is given at the point in which the overrun happens.



Click on image to enlarge.

Figure 2. Buffer overrun warning shows the path tainted code must take to trigger a bug.

Taintanalysis falls at the more sophisticated end of the analysis spectrum.Consider, for example, a C program that reads a string from a riskynetwork port. Strings in C are typically managed through pointers, sothe analysis must track both the contents of the string and the value ofall pointers that might refer to the string. The characters themselvesare said to be tainted, whereas the pointer is said to “point totaintedness.”

If the contents of the string are copied, for example by using strcpy() ,the analysis must now account for the propagation of taint to thedestination string. If the pointers are copied, the analysis mustaccount for the propagation of the points-to-taint property.

Ofcourse, there may, in turn, be pointers to those pointers, and evenpointers to those, and the analysis must track those too. Ultimately theproblem boils down to a kind of alias analysis : the tool must determine which variables access the same memory locations.

Staticanalysis tools typically also provide functionality for viewing taintanalysis results. In CodeSonar, for example, the available views includea top-down visualization of taint flow in a program, as shown in Figure 3 .



Click on image to enlarge.

Figure 3. A top-down viewof the call graph of the program showing modules according to thephysical layout of code in files and directories. The red colorationshows the modules with the most taint sources, and the blue “glow” showsmodules with taint sinks.

In this example, theuser has specified that modules containing taint sources should becolored red. This is a reasonable approximation of the attack surface ofthe program. The code within the highlighted module is shown in thepane to the right; the underlining shows the variables that carry taint.

Summary
Taint analysis is a technique that helpsprogrammers understand how risky data can flow from one part of theprogram to another. An advanced static analysis tool can perform a taintanalysis and present the results to the user, making the task ofunderstanding a program’s attack surface easier, and easing the workinvolved in finding and fixing serious defects.

Paul Anderson ,Vice President of Engineering at GrammaTech ,received his B.Sc. from Kings College, University of London and hisPh.D. from City University London. Dr. Anderson’s work has been reportedin numerous articles, journal publications, book chapters, andinternational conferences.  

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.