Where testing fails - Embedded.com

Where testing fails

Testing, alone, cannot find all potential failure points for real-time embedded software, particularly in multithreading environments. Here are several techniques you can use to find errors that may otherwise be missed.

Most software development projects rely on some combination of code inspection, structural testing, and functional testing to identify software defects. While these conventional techniques are invaluable and uncover most software problems, they can't detect many common types of errors in today's complex systems. This article will show you how to avoid the problems that create these elusive but common errors.

Conventional methods
Structural testing or white box testing effectively finds errors in the logic, control flow, calculations, and data in the code. This test requires an unobstructed view of the inner workings of the software (hence the name “white box” or “glass box”) to look at details of the software's construction. It examines every conditional expression, mathematical manipulation, input, and output. Because of the large amount of detail to be tested, structural tests examine the software one unit at a time, typically a single function or class.

Code inspection also uses the same detailed knowledge of the implementation to find defects and potential problems. Like white box testing, inspection is generally performed on individual units of the software due to the focus and intense scrutiny required by an effective inspection process.

Unlike inspection and white box testing, functional testing or black box testing assumes no knowledge of the software's implementation. It tests outputs driven by controlled inputs. Functional testing consists of test procedures, written by the test or development staff, which specify the expected program outputs for a particular set of program inputs. After the tests are run, the testers compare the actual outputs to the expected outputs to detect problems. Black box testing effectively finds unimplemented requirements, interface problems, performance issues, and errors in the most-used features of the program.

Although these techniques combined may find most errors latent within a given software program, they have limits. Inspection and white box testing focus on only a small portion of the code at a time, ignoring the rest of the system. Black box testing generally exercises the system as a whole, but ignores the implementation details. Some important problems can only be detected by focusing on the implementation details as they interact within the entire system; these problems can't reliably be found using the conventional methods. An examination of the software system in its entirety, looking for the particular causes of specific problems, is required. Since an exhaustive analysis of every detail in a program and it's interaction with all other items in the code is generally impossible, analysis is focused on specific areas known to cause problems. I will address three of these potential problem areas:

  • Stack overflows
  • Race conditions
  • Deadlocks

The discussion also continues online. You can read about the following problems in part two of this article:

  • Timing problems
  • Reentrancy conditions

All of these issues are prevalent in systems that employ multitasking real-time designs.

Stack overflows
Processors use the stack to store temporary variables, pass parameters to called functions, save the thread “state,” and so on. If the system doesn't use virtual memory (in other words, it can't swap pages of memory out to disk to free the memory for other uses), the stack size is fixed at the time the product is delivered. If for any reason the stack exceeds the amount allocated by the programmer, the program will become nondeterministic. This erratic behavior may lead to a catastrophic failure of the system. For this reason, it's critical to ensure that enough stack is allocated for the worst case.

The only way to ensure that a stack overflow will never occur is to analyze the code to determine the maximum stack usage for any possible path through the software and then verify that enough stack has been allocated. Testing is highly unlikely to stimulate the right combination of inputs at the precise instant necessary to cause this worst-case situation.

Stack-depth analysis is simple in concept:

  1. Build a call-tree for each independent thread.
  2. Determine the stack usage of every function in the call-trees.
  3. Examine each call-tree to determine which invocation path from the tree's “root” to an outer “leaf” will use the most stack.
  4. Sum the maximum stack usage for each independent thread's call-tree.[1]
  5. Determine the maximum stack used by any interrupt service routine (ISR) within each interrupt priority level and add it to the total.[2] However, if the ISRs don't have their own stack, but use the stack of the thread that they've interrupted, add the maximum stack used by the ISRs to each thread stack.
  6. Add the stack used to save the processor state when an interrupt occurs, once for each priority level.
  7. If an RTOS is being used, add the maximum stack that it requires for its own internal usage (independent of system calls originating in the application code, which are included in step 2).

Here are two important facts to consider, as well. First, a call-tree developed solely from the high-level language source code is likely to be incomplete. Most compilers use a run-time library to optimize common computational tasks such as multiplying and dividing large integers, floating-point operations, and so on. These calls will only be visible in the assembly code produced by the compiler. Run-time library functions can use significant stack space themselves and must be included in the analysis. If you're using C++, all the following types of functions (methods) must also be included in the call-tree: constructors, destructors, overloaded operators, copy constructors, and conversion functions. All function pointers must also be resolved, and the functions they call included in the analysis.

Secondly, compilers use a C library to implement standard functions such as memcpy() , cos() , and atof() ; the source code for these routines may not be available. If the source code is available, it's possible to determine the worst-case stack usage for each library call used by the program. If the libraries are contained only in object files, the compiler vendor must supply the stack used by every library routine. Without this information, the worst-case maximum stack usage of the program can't be determined by analysis. Thankfully, many compiler manufacturers that target embedded systems supply this information.

In general, every time a function is called, the compiler will use the stack to save the return address and pass function arguments. The automatic (local) variables of a function also typically exist on the stack. However, since the compiler is likely to optimize the code by placing parameters or local variables in registers whenever it can, it's important to check the assembly language to accurately determine stack usage. There may also be other places in the code where the compiler chooses to use the stack, for example, to store intermediate calculation results.

Some development environments packaged with a compiler include tools to produce a call-tree. Many third party call-tree documentation tools are also available.[3] However, unless they analyze the assembly language, such tools may miss the run-time and C library calls. In any case, it's simple to develop a script to parse the assembly language files and pull out the function names and the calls made within each function. The results can be written to a file that's easily imported into a spreadsheet.


Figure 1: Portion of a stack analysis spreadsheet

Once the stack usage for each function has been determined, the maximum stack for each thread must be calculated. Since typical programs often involve hundreds of functions, with calls extending many levels deep, a convenient way to process this information is with a spreadsheet. As shown in Figure 1, each row of the spreadsheet contains the function name, the maximum stack used by just that function (which includes the stack needed to call another function), and the list of the functions it calls. The spreadsheet is programmed to iterate through each function “root,” calculating the stack needed for it and all the functions it calls. This information is placed in the Stack Path column. Now you can easily calculate the maximum stack by using the Stack Path data for the root function of each thread (for example, main ). This step comprises the first four steps of the stack analysis process described earlier.

Occasionally, it may not be possible or practical to follow the stack-depth analysis process. If the source code is unavailable for the run-time or C libraries and the compiler vendor provides no stack-usage information, it's impossible to perform a complete stack analysis. In these situations, you have two options:

  1. Observe how deep the stack grows during testing and ensure that there's plenty of stack space to spare.
  2. Detect a stack overflow and provide corrective action.

Observing stack depth is straightforward:

  • Write a specific data pattern signature, such as 55AA, to the entire stack section of memory.
  • Run the system under the conditions that are expected to use the most stack.
  • Examine the stack memory using an emulator or other tool to see how much of the signature pattern has been overwritten by stack usage.

Of course, following these steps doesn't guarantee that more stack won't be required under some different set of usage events. But it does indicate the minimum stack needed.

Detecting a stack overflow at run time is possible when using a processor with a memory management unit (MMU). The memory is partitioned to “guard” the stack area with a protected memory segment. If the stack overflows, the processor will attempt to access this protected segment. The access attempt will throw an exception (for example, a SIGSEGV signal) that can be caught by the program. RTOSes compliant with the Real-Time POSIX standard provide stack-guard functionality of this nature as an option when threads are created, greatly simplifying the programmer's work. Other development environments, such as the GNU tools, have compiler switches to add the code necessary to implement stack guards but still rely on the underlying operating system to handle the stack overflow appropriately.[4] Detecting the overflow in this fashion is only part of the problem, however. For a design of this type to be effective, the system must be able to recover from the stack overflow and continue operating correctly.

In a safety- or mission-critical application, observing stack depth during test or detecting a stack overflow at run-time may not be a sufficient risk-control measure. For some applications, it must be proven a priori that the system will never exceed its stack allocation; such proof can only be done through a complete stack depth analysis. This means that if the entire program runs in the same memory space, the analysis must be performed on all the code. However, if an MMU is used, the analysis can often be simplified. You can design the system to place all the critical code into one or more separate threads that run in their own protected memory segments. Then analyze the stack usage only on these critical threads. This simplification assumes, of course, that the critical threads can still perform correctly if the noncritical threads overflow their stacks and crash.

Since the stack usage data for the analysis is gathered from the assembly listings, whenever the code changes, the stack usage information for that module must be updated. If a different version of the compiler is used, or the optimization settings are changed, the entire analysis must be rechecked. Ideally, the compiler will provide the stack usage for each function, if not for each thread, since it has all the information necessary to calculate it. For example, Renesas provides Call Walker, a tool that's part of the company's High-performance Embedded Workshop development environment. The tool graphically shows the call-tree and the stack used by every function, including the run-time and C library functions. The Call Walker also finds the path that uses the greatest amount of stack. Using a tool like this automates steps 1 through 3 of the analysis.

Race conditions
A race condition occurs when two or more independent threads each access the same resource at the same time. The effects of a race condition vary widely; they're dependent on the specifics of the situation. One of the more infamous incidents involved the Therac-25 radiation therapy machine, in which race conditions contributed to accidents that seriously injured six people, killing three of them.

Listing 1 illustrates a potential race condition. The function Update_Sensor() reads raw sensor data by calling get_raw() . This data is processed by multiplying it by a scaling factor and adding an offset. This processing is performed on a temporary copy of the data, which is then written to the shared variable.

Listing 1: Race condition

extern float shared_sensor;
void Update_Sensor(float offset, float scale)
{
    float temp = get_raw();
    temp *= scale;
    temp += offset;
    shared_sensor=temp;
}

If another thread or ISR that uses shared_sensor preempts this thread before the data is written, it will simply get the older sensor reading. Using the temporary copy avoids the problem of the preempting thread reading data that has only been partially processed. However, if this code runs on a processor with anything less than a 32-bit data bus, a race condition exists.

On an 8-bit or 16-bit processor, the write to shared_sensor is not done “atomically”—without allowing an interruption. On an 8-bit processor, writing a 32-bit float value may take four instructions; on a 16-bit processor it may take two instructions. If Update_Sensor() is preempted in between these consecutive writes to shared_sensor , the preempting thread will read a value from shared_sensor that's composed of some old and some new data. Depending on the application, this could have serious consequences. The solution is to lock the scheduler or disable interrupts around the update of the shared variable.

Eliminating race conditions is often simple, but finding them after they've already been introduced into the code requires a detailed analysis.

A simple system that consists of a cyclic executive and various ISRs is straightforward to analyze for race condtions. Examine each ISR to identify all the shared variables that it references. Shared variables are typically global data in these systems. Once these shared variables are identified, examine every use of each one throughout the code. Each access must be protected as necessary from potential conflicts. In simple designs, protection is usually implemented by disabling interrupts around the critical sections. Avoid problems by observing the following rules:

  • If an ISR writes to shared data, each non-atomic read outside the ISR must be protected.
  • If an ISR writes to shared data, any read-modify-write sequence outside the ISR must be protected.
  • If an ISR reads shared data, each non-atomic write to the data must be protected.
  • If both an ISR and other code check a hardware status flag to determine availability of a resource before using it, for example:

        if (!resource_busy)
        {
          // Use resource
        }

    protection must be applied from the point the flag is checked until the hardware sets the flag to indicate the resource is unavailable.

For more complex systems that use multiple threads at different priorities, the analysis is very similar. The rules above still apply for all data used by an ISR. In addition, the shared data used by each thread must be identified. Starting with the highest priority thread in the system, identify all the data it shares with any lower priority threads. Then apply the same four rules. Repeat for each priority level used by the software.

Note that if a round robin scheduling algorithm is used, all threads within a given priority can preempt each other at any time. This means that the four rules of the analysis must consider all threads of the same priority, in addition to threads of lower priorities.

Multithreading systems typically use an operating system of some kind, which provides more options for protection. A mutex or semaphore could be used, or the scheduler could be locked. Sometimes, other inter-process-communications (IPC) primitives can be used: instead of modifying a shared variable, a message could be sent to a message queue to indicate that data has changed. In many cases, the management of a shared resource is best left to a single thread that processes all read and write requests and prevents access conflicts internally.

Identifying potential race conditions in complex code can be a tedious, time-consuming process. Tools to assist in this range from simple scripts used in identifying accesses to global data, to sophisticated dynamic analysis programs such as Polyspace Verifier. Despite its difficulty, a detailed analysis of the code is the only way to identify these types of errors. Testing is unlikely to generate the exact timing sequences required to trigger a race condition repeatably.

Deadlock
Protection from access conflicts is vitally important in systems that share resources, but it can lead to another problem: deadlock. When race conditions are avoided by “locking” a resource, preventing any other thread from accessing it, the design must be evaluated to ensure that deadlock will never occur. Testing for deadlock is generally ineffective, since only a particular order of resource locking may produce it, and that ordering may not result from the most common tests.

Deadlock is only a problem in multi-threading environments that lock resources. The following four conditions must be present in order for a deadlock to occur. Breaking any one of these conditions eliminates deadlock:

  • Mutual exclusion—only one thread can use a locked resource at a time
  • Nonpreemption—threads cannot force another thread to release a resource
  • Hold-and-wait—threads hold resources that they have locked while waiting for any additional needed resources
  • Circular wait—a circular chain of threads exist, such that each thread holds a resource needed by the next thread in the chain


Figure 2: Circular wait


Figure 3: No deadlock


Figure 4: Deadlock with message passing

The resource allocation graph in Figure 2 shows an example of a deadlock problem. Thread 1 first locks the resource Buf, and while holding Buf, acquires Bus and then Mux. If thread 1 runs to completion, it will eventually release all of them. When thread 2 runs, it will attempt to acquire Bus, Sem, and finally, Mux. When thread 3 executes, it requires Sem and Buf.

In this design example, there are no guarantees that any of the threads will complete before another begins executing. If a thread is unable to acquire a needed resource, it will suspend execution (block) until the resource becomes available. As the system runs, the threads all lock and unlock the resources. As the relative timing of when each thread runs and acquires its resources varies, a situation can occur in which none of the threads is able to run, since each thread is waiting for a resource that is held by some other thread. For example, if thread 1 holds Buf, and thread 2 holds Bus, and thread 3 has acquired Sem, the system deadlocks. The circular wait condition is observed by following the thread allocation arrows from Buf to Bus to Sem and back again to Buf.

Once a potential deadlock problem has been identified, it's often easy to fix. In Figure 3, thread 3 has been modified to attempt to acquire Buf first, before acquiring Sem. The circular wait condition is now broken and the system will no longer be susceptible to deadlock.

Some operating systems make heavy use of message passing for inter-thread communication and synchronization. In these types of systems, when a thread passes a message to another thread, the sender blocks until it receives a response from the receiver. The receiving thread typically blocks until it receives a message from some other thread. Deadlock occurs in these architectures also. To build a resource allocation graph for a message-based operating system, the message channels are modeled as the allocated resources. An example is shown in Figure 4. Thread 2 has created channel T2 Ch. Thread 2 has this channel “locked,” so to speak, whenever it is not blocked waiting for a message on the channel. When it is blocked and waiting for a message, another thread can send it a message on that channel and the message will be received immediately.

Now, consider the following system: thread 1 acquires the Mutex and sends messages to thread 2 on channel T2 Ch. Somewhere in thread 2, thread 2 sends messages to thread 3 on channel T3 Ch. Thread 3 also sends a message to thread 4 on channel T4 Ch. At some place in thread 4, it also tries to acquire the Mutex and will block if it's not available. The circular path among the resources is apparent, indicating that a deadlock could occur. For example, if at some point in time thread 1 holds Mutex and thread 4 attempts to acquire it, thread 4 will block on the Mutex. Then when thread 3 attempts to send thread 4 a message on channel T4 Ch, thread 3 will block waiting for a reply from thread 4 (since thread 4 is blocked waiting for the mutex, not blocked waiting for the message). In a similar manner, this will cause thread 2 to block when it tries to send a message to thread 3. Thread 1 will block when it tries to send a message to thread 2, and since it still holds the Mutex, the system will deadlock.

The easiest way to deal with deadlocks is to avoid them by design. Adopting any of the following design constraints will eliminate the possibility of deadlock:

  • Threads have no more than one resource locked at any time.
  • A thread completely allocates all its needed resources before it begins execution.
  • Threads that acquire multiple resources must lock (and release) them in a systemwide preset order.

If deadlocks cannot be avoided by design, resource allocation graphs should be constructed. Examination of such graphs can identify potential deadlocks. By keeping careful track of all the threads in the system and the shared resources that they lock, the graph can be maintained and examined periodically for the characteristic circular wait.

Building a resource allocation graph requires identifying every protected shared resource and every thread that acquires one of them. If an operating system is utilized, the following procedure can be used:

  1. Identify all system calls that might block, such as Mutex_Lock() . Any protected shared resources will always have some blocking call associated with their access.
  2. Once the blocking calls used to acquire shared resources are identified, search the source code for every one of their occurrences.
  3. For each occurrence, note the thread that acquires the resource and the resource name. The calls themselves generally have the protected resource passed as a parameter. The location of the call within the source code indicates the thread that needs the resource. In this manner, all the protected resources, and the threads that allocate them, can be identified.
  4. Build the resource allocation graph and examine it for a circular path among any of the resources. When the number of threads and shared resources are small, it's simple to draw the graph. In more complex systems, it's better to enter the information into a spreadsheet and write a macro that walks the thread/resource allocation structure to identify potential deadlocks. Once the macro is written, changes to the resource allocations can be quickly re-evaluated. The macro can be written to ignore cycles between resources that do not cause a deadlock. In the example shown in Figure 5, there are numerous cycles between various resources, but the only potential deadlock is between threads 6 and 7.


Figure 5: Resource allocation spreadsheet

In some types of systems, it may be impractical or impossible to identify every shared resource beforehand and build an allocation graph. In these situations, additional code can be added to detect a potential deadlock while the system is running. There are many variations of algorithms that attempt to optimize this detection process, but they all essentially build some sort of resource allocation graph dynamically. This graph is modified and examined whenever a resource is requested, allocated, or released by a thread to determine if a circular path exists, indicating a potential deadlock.

Once a deadlock is detected, the only way to recover from it is to force threads to release key resources. This usually means aborting the threads holding needed resources. Depending on the application, this may not be an acceptable solution. Another interesting solution is to collect the resource allocation information at run-time, and process it in a post-mortem analysis to determine if any deadlocks occurred during the program's run. Although this doesn't prevent a deadlock during the run, it does aid in finding and fixing one after the fact.

There are also some tools that can assist in finding deadlocks in code. For example, Solaris programmers can use Sun's LockLint tool to perform a static analysis of the code. It finds inconsistent use of locking techniques, identifying many of the causes of race conditions and deadlocks.

Sean M. Beatty teaches, consults, develops, tests, and analyzes software. He holds a BSEE from the University of Wisconsin at Milwaukee and has 18 years of embedded systems experience. He can be reached at .

End notes:

  1. If multiple threads share the same priority level, this may give a falsely high stack usage. A more accurate stack need can be obtained by determining what the maximum stack usage is for each thread of the same priority level at the point they could relinquish the processor to another thread of that priority (in other words, become blocked). Start with the largest maximum stack usage among the threads of equal priority. Then add the largest stack usage of one of the other threads of that same priority at the point it could relinquish the processor to another thread (of the same priority). Repeat for the remaining threads at that priority level. If thread scheduling is done on a round-robin basis within a given priority level, the maximum stack needed by each thread must be used in the analysis, since the operating system could switch the context to another thread of that priority at any time.
    Back
  2. This assumes that interrupts within a given priority level cannot preempt each other. If they can, the maximum stack used by every ISR must be added in this step.
    Back
  3. For example, Understand for C++, www.scitools.com
    Back

  • gcc.gnu.org/onlinedocs/gccint/Stack-Checking.html
    Back
  • Leave a Reply

    This site uses Akismet to reduce spam. Learn how your comment data is processed.