Say no to stack overflow

John Regehr

September 09, 2004

John RegehrSeptember 09, 2004

From years of hands-on development, this embedded systems programmer has developed a little trick to take the guesswork out of this onerous programming problem. Learn from the professor.

My first experience developing embedded software involved programming a robot the size of a trash can. The robot contained a complete PC motherboard and could be attached to a keyboard and a monitor when the robot was stationary. This development environment, of course, was very friendly as embedded systems go. Even so, my DOS-based executive would crash unpredictably, especially when my debugging code was compiled in. After exhausting all other possibilities I figured out that printf was using much more stack memory than I thought it should, something like 8KB, causing one thread's stack to overwrite storage belonging to another thread.

Of course, an experienced developer would not have been caught off-guard so easily, but the basic problem remains even today: how much memory should be allocated to the stack (or stacks) in an embedded system? Answering this question involves understanding the tension between overprovisioning the stack, which wastes memory that could be used for other purposes, and underprovisioning the stack, which risks creating a system that is prone to stack overflow.

Most developers are familiar with the testing-based approach to sizing stacks in which you initially pick a large estimate of stack size and refine the guess after observing stack depth during actual or simulated runs of the system. Or, as Jack Ganssle puts it, "With experience, one learns the standard, scientific way to compute the proper size for a stack: Pick a size at random and hope.1

Fewer developers are familiar with the analysis-based approach to sizing stacks, which determines stack size by statically analyzing code: there's no need to run it, even in a simulator. In the simplest case, you can analyze by manually counting push, pop, call, and return instructions along different paths through the compiled system and adding up their effects. This manual approach doesn't scale to large programs, but luckily it's not hard to automate.

In this article I focus on the analysis-based approach and show how to do a stack-size analysis for fairly simple systems using a small, extensible Perl program. The point is not that analysis should replace testing, but that the two approaches are complementary. Both approaches have advantages and each can alert you to weaknesses in the other.

An embedded system is stack-safe if it's completely impossible for a stack in the system to overflow into memory used for other purposes. This notion of safety is analogous to the type-safety guaranteed by languages such as Java, which ensure that, for example, an integer can never be interpreted as a pointer. The ideal system would be just barely stack-safe: if any less memory were allocated to any stack, it would be possible for an overflow to happen.

Analysis vs. testing
To implement the testing-based approach you just run the system and see how big the stack gets. Ideally, you test the system under heavy, diverse loads in order to exercise as many code paths as possible. Measuring the maximum extent of the stack isn't hard: simulators can record this, and in a real system you can periodically check the stack pointer or preload the stack area with known values and see how many of these get overwritten. To a large extent, the testing-based approach treats the system as a black box: it doesn't matter how or why stack memory is consumed, you want to know only how big each stack can get.

The second way to find the maximum size of the stack is the analysis-based approach, which looks at the flow of control through a system to find the path pushing the maximum amount of data onto the stack. This path may be complex, involving the main function plus several interrupt handlers.

Notice that the testing- and analysis-based approaches are trying to do exactly the same thing: find the path through the program that produces worst-case stack behavior. In any complex system, the fundamental limitation of testing is that it's a mathematical certainty that some paths won't be tested. Consider, for example, a system containing five interrupt handlers, each of which fires an average of 50 times per second and spends 200μs at its worst-case stack depth. Assuming that interrupts arrive independently, the interrupts will exhibit their worst-case stacking only about once every 300 years. Therefore, there's often a possibility that the largest stack size you discover during testing is not the largest stack size that can happen when your product is operating in the real world: the stack depth is underestimated.

The analysis-based approach has the opposite problem. It often overestimates the maximum stack size. For example, if an interrupt handler is reentrant, the analysis may be forced to conclude that the worst-case stack depth is infinite, even though as the developer you may know that this interrupt arrives only twice per second and, since the CPU is mostly idle, the interrupt can't possibly preempt itself. Similarly, the analysis might assume that a timer interrupt may fire during a certain function call, when in fact this function is setting up the timer and it's known that the timer can't fire until well after the function returns.

The true worst-case stack size for any complex system is unlikely to be computable, but it can be bracketed by combining the testing and analysis approaches. The three values are related like this:

worst depth seen in testing ≤ true worst depth ≤ analytic worst depth

To some extent the gaps between these numbers can be narrowed through hard work. For example, if testing misses paths through the system, maybe you can devise more clever tests or run the tests longer. On the other hand, if the analysis misses some crucial feature of the system, such as a causal relationship between interrupt handlers, maybe the analysis can be "taught" to understand this relationship and take it into account.

The advantages of the analysis-based approach are that analysis can be very fast (a few seconds or less) compared to testing, and that analysis returns an upper bound on stack depth, resulting in systems that are guaranteed to be stack-safe. Also, by finding the worst-case path through a program, analysis identifies functions that are good candidates for inlining or optimization to reduce stack depth.

The main advantage of testing, on the other hand, is that you're going to test the system anyway, so there's no reason not to include stack size tests. If the results returned by analysis and testing are close together then life is good. If a substantial gap appears between the two results you should ask some hard questions. Are the test cases missing important paths through the code? Is the analysis getting confused in some way?

If the gap between analysis and testing can't be closed then picking a size for the stack becomes an economic tradeoff. On one hand, you might be tempted to make the stack just a little larger than the worst size seen during testing; this minimizes per-unit cost but risks creating an unreliable system. On the other hand, you might want to allocate the analyzed worst-case stack size, which leaves less RAM for other functions but eliminates the risk of stack overflow.

Stack depth analysis
The control-flow graph for an embedded system is at the core of any analysis-based approach to determining worst-case stack depth. The control-flow graph (CFG) is simply a representation of the possible movement of the program counter through your code. For example, an arithmetic or logical operation always passes control to its successor and a branch may pass control either to its successor or to its target. Although the CFG can be extracted from either the high-level language (HLL) code or the executable itself, for purposes of bounding stack depth a disassembled executable is preferable because:

  • parsing HLLs is hard, while parsing assembly language is easy
  • embedded programs written in a HLL often contain inlined assembly language, and the analysis should account for its effects
  • the HLL code for libraries and the real-time operating system (RTOS) may not be available
  • the HLL code doesn't contain enough information to effectively bound stack depth, since optimizing compilers have considerable freedom when translating HLL code into assembly

Therefore, we're mainly interested in analyzing a compiled program image. Also, it's at this point that a system is easiest to analyze since the linker has resolved external references.

For some systems, such as those written in the style of a cyclic executive, constructing the CFG is straightforward. For other systems, particularly those containing recursion and indirect calls, you'll need either a more sophisticated analysis or a human in the loop. Indirect calls and jumps come from a variety of sources, such as event loops, callbacks, switch statements, device driver interfaces, exceptions, and object dispatch tables.

The CFG alone is not enough to determine the maximum stack size; the effect each instruction has on the stack is also important. Assume for a moment that you're examining a system that only manipulates the stack through push and pop instructions. At this point, an analysis program can compute the stack effect of different paths through the CFG. Its job is to find the worst path—the one that pushes the most data onto the stack. This is easily accomplished with standard graph algorithms. However, if the system contains code that loads new values directly into the stack pointer (for example, to push an array onto the stack or to implement alloca), this simple approach is again insufficient. A stronger approach that can identify (or at least bound) the values loaded into the stack pointer is required.

A simple cyclic executive can be described by a single CFG. In the presence of interrupts or threads, a system contains multiple CFGs, one per entry point. You can compute the stack bounds for each of these graphs using the same algorithm. A complication that now arises is that we need to know how to put the individual results together to form a stack-depth bound for the whole system. If all interrupt-mode code runs without interrupts disabled, then this is easy:

worst-case depth = depth of main + worst depth of any interrupt

Since many systems enable interrupts while running interrupt handlers, the following equation is more broadly applicable:

worst-case depth = depth of main + total depth of all interrupts

There are two problems, however, with this second approach. First, it provides an answer that's potentially too low if some interrupt handlers are reentrant. Second, it provides an answer that is too high if, for example, only one interrupt enables interrupts while running. A better approach is to compute stack depth based on the interrupt preemption graph: a precise representation of which interrupt handlers can preempt which other handlers.

From a stack-analysis point of view, there's a huge difference between programs that contain reentrant interrupt handlers and programs that do not. Programs without reentrant interrupts can be shown to be stack-safe without worrying about timing effects. To see this, consider what would happen if such a system were running normally when all the interrupts fired at a very high frequency. As long as there can be at most one concurrent instance of each handler, the total stack consumption for the system is bounded, no matter how frequently interrupts arrive. On the other hand, if a system permits reentrant interrupts it becomes impossible to reason about stack depth without also reasoning about time. Typically the developer guesses at a maximum arrival rate for each interrupt based on knowledge of the hardware (or basic physics) and ensures the CPU has enough spare time to service all invocations of all interrupt handlers. In general, a better solution is to avoid building systems that permit reentrant interrupts unless they're truly necessary.

A simple tool
To illustrate stack depth analysis techniques I've used Perl to implement a simple stack-analysis tool called Stlite. It's suitable for computing the stack depth of programs that consist of an infinite loop plus interrupts. For example, Stlite successfully analyzes the onboard controller for the Autopilot project, which disassembles to about 1,100 lines of assembly code.2 The tool produces a disassembled version of the system where each instruction is annotated with the worst-case stack depth at that point in the program, but where entry points are treated in isolation. In other words, each interrupt handler is treated as being invoked on an empty stack, as shown in Listing 1. It's up to the user to determine a stack depth for the whole program by using, for example, a formula from the previous section. My version of Stlite supports only Atmel's ATmega chips, but it's easy to port to other processors.

Listing 1: A disassembled interrupt handler for an Atmel AVR processor. The left column of numbers indicates the stack depth after executing the associated instruction, assuming that the interrupt starts with an empty stack

  00000978 <__vector_4>:  
2   978: 78 94 sei
3   97a: 1f 92 push r1
4   97c: 0f 92 push r0
4   97e: 0f b6 in r0, 0x3f
5   980: 0f 92 push r0
5   982: 11 24 eor r1, r1
6   984: 2f 93 push r18
7   986: 8f 93 push r24
7   988: 21 e0 ldi r18, 0x01
7   98a: 20 93 9e 03 sts 0x039E, r18
6   98e: 8f 91 pop r24
5   990: 2f 91 pop r18
4   992: 0f 90 pop r0
4   994: 0f be out 0x3f, r0
3   996: 0f 90 pop r0
2   998: 1f 90 pop r1
0   99a: 18 95 reti

Most of Stlite is boring code dealing with parsing the assembly-language version of a program and classifying its instructions. Listing 2 shows the only interesting routine, compute_stack, which recursively explores the control-flow graph, keeping track of the worst stack depth seen so far at each point in the program. Each time the analysis visits an instruction it also visits all instructions that can be directly reached from the instruction unless a termination condition is fulfilled. The first termination condition is that a branch of the analysis may terminate after visiting an instruction with a stack depth that is less than or equal to a previously recorded stack depth at that instruction. The second condition is that a branch of the analysis can terminate when it encounters a jump to address zero; this is how recent versions of gcc for the Atmel AVR architecture reset the program when an unexpected interrupt is received. Finally, the analysis doesn't need to proceed past a return-from-interrupt or a return-from-function-call. To avoid modeling the call stack Stlite makes the simplifying assumption that each function call returns to the address that follows the call that invoked it. The full tool is less than 350 lines of Perl and can be downloaded from ftp://ftp.embedded.com/pub/2004/10regehr.

Listing 2: Perl fragment for recursively exploring a control flow graph and computing worst-case stack depth at each instruction

 sub compute_stack {
 # $addr is the address of the current instruction
 # $vec is the name of the interrupt vector we-re currently looking at
 # $old_depth is the stack depth before executing this instruction
 (my $addr, my $vec, my $old_depth) = @_;

 # compute new depth
 my $new_depth = $old_depth + stack_effect ($addr);

 # termination condition 1 -- nothing new can be learned here
 return if (defined($depths{$vec}{$addr}) && $depths{$vec}{$addr} >=
$new_depth);

 # record new depth
 $depths{$vec}{$addr} = $new_depth;

 # termination condition 2 -- jump to origin resets the program
 return if (is_jmp ($addr) && get_target ($addr) == $ORIGIN);

 # termination condition 3 -- ret and reti don-t go anywhere in our simple model
 return if ($insns{$addr} eq 'ret" || $insns{$addr} eq 'reti");

 if (is_call ($addr) || is_branch ($addr) || is_skip ($addr) || is_jmp ($addr)) {
  compute_stack (get_target ($addr), $vec, $new_depth);
 }

 if (is_call ($addr)) {
  compute_stack ($addr + insn_size ($addr), $vec, $old_depth);
 } elsif (!is_jmp ($addr)) {
  compute_stack ($addr + insn_size ($addr), $vec, $new_depth);
 }
}

Ideally, the results returned by any program-analysis tool would be unconditionally true. Unfortunately, the results of Stlite are contingent upon the absence of a number of program features that complicate analysis, such as:

  • recursive functions
  • self-modifying code
  • the possibility of a stray memory write smashing a return address on the stack, causing the program to return to an unknown location
  • loads into the stack pointer, for example to implement alloca or to change stacks during a context switch
It's the developer's responsibility to either verify that these features aren't present or to extend the tool to handle these features.

Some of these complications, such as recursion and loads into the stack pointer, aren't difficult to detect. Determining the maximum depth of recursion, however, or the range of possible values loaded into the stack pointer can be difficult. On some architectures self-modifying code is either impossible or easy to detect, but by reducting the halting problem, it can't be reliably detected on all architectures. It's difficult to rule out the possibility of a stray memory write overwriting a return address unless you're using a "safe" language like Java, and even then a bug in the Java runtime could do the overwriting. It's also difficult to analyze what happens in the context-switch routine of an RTOS. Instead of trying to analyze certain functions, the analysis tool should import knowledge that comes from a human expert; for example, "switching context in RTOS XXY pushes at most 28 extra bytes onto the stack" and "all interrupts are handled on a special task-neutral stack."

An industrial-strength stack-analysis tool should take all these possibilities into account or at least issue a stern warning to the developer when a potentially unsafe program feature is used. An analysis that makes assumptions about the code it's analyzing is still useful, but can obviously be dangerous if its results are misinterpreted.

In summary, stack-depth analysis is quite easy for simple programs, but becomes difficult or impossible when programs make use of too many features that confuse the analysis. To a large extent, the success of program analysis relies on the conservative, highly constrained coding style in which embedded software is often constructed. In other words, analysis tools can be useful even if they don't scale to large, dynamic programs.

Better tools
Although it's likely that simple tools similar to the one described here have been developed innumerable times, you'll find very little in the academic literature about bounding stack sizes, and there's not much commercial support, either. A reasonable question to ask is, "Why doesn't the typical embedded compiler perform stack-depth analysis?" The answer is that when generating code, a C compiler tends to take a very local view of a program. Stack-depth analysis, on the other hand, is inherently a global property that requires analysis across functions and between compilation units, libraries, and the operating system.

The first academic paper about bounding stack depth by analyzing machine code was presented by Brylow et al. in 2001; their tool works on programs for Z86 processors and generates accurate results by estimating the interrupt-preemption graph for a system.3 More recently, we've developed a stack analysis tool at the University of Utah that implements a more powerful program analysis that tracks the movement of data through registers and across arithmetic and logical operations, with the goal of computing the interrupt-preemption graph for larger, more complex programs. This tool supports Atmel's AVR processors and is described in a paper.4 This tool is not production-quality, but a snapshot can be downloaded from www.cs.utah.edu/~regehr/stacktool. On the commercial side, a tool from AbsInt called StackAnalyzer implements a variant of the method described above.5 In looking at their web site, StackAnalyzer doesn't appear to analyze interactions between interrupt handlers, however.

Reducing stack needs
Beyond simply determining the worst-case stack size for a system, it's useful to design systems that make efficient use of stack memory. Let's look at the two main ways to accomplish this.

The first method relies on the observation that two concurrent computations, whether they're in interrupts or threads, require stack memory equal to the sum of their individual requirements. On the other hand, two sequential computations only require stack memory equal to the maximum of their individual requirements. Excessive preemption is harmful in other ways, too. It incurs overhead in the form of CPU cycles spent saving and restoring program state and in developer time adding proper synchronization. Furthermore, unnecessary preemption increases the number of possible control-flow paths through a system, making software harder to understand, harder to test, less predictable, and more likely to contain latent race conditions and deadlocks.

In concrete terms, you should strictly limit the interrupts that are enabled by each interrupt handler; don't just blindly enable all interrupts. For each pair of interrupts where one can preempt the other, there should be a good reason for allowing the preemption. If interrupt handlers are running for so long that many of them need to be preempted by other interrupts, maybe you're running too much code in interrupt mode. Try writing fast interrupt handlers that do the minimum amount of work necessary to keep the hardware happy and defer most processing to a thread. Similarly, don't necessarily place independent computations in separate threads running at separate priorities. A nonpreemptive event-style scheduler running in a single thread often suffices, and it makes much more efficient use of stack memory. Victor Yodaiken offers some sound advice on this subject: "Do not add priority levels unless you can precisely specify why tasks at one level should preempt tasks at the next lower level."6

The second approach to reducing stack usage has nothing to do with concurrency; it involves tuning the stack usage of sequential code. Large stack-allocated variables, such as function-scoped arrays, should be used with caution. You should look for ways to eliminate them, for example by using an in-place sorting algorithm instead of one that makes a copy of the data. Similarly, inlining a function call removes the compiler's need to push arguments, a frame pointer, and a return address onto the stack. Inlining also improves performance in many cases by permitting the compiler to specialize the function for its calling context. However, indiscriminate inlining causes code bloat and may even increase stack size if it confuses the compiler's live-variable analysis (this is known to be a problem for gcc).

Finally, long chains of calls can result in excessive stack usage. It's sometimes possible to restructure code so that long call chains are broken up. Long call chains can also come from recursive functions, which of course should be used with extreme caution in small systems.

An analysis-based approach to computing maximum stack depth serves two purposes in helping you optimize a system. First, the analysis can give you quick feedback about whether a change you made is useful or not. For instance, if inlining a function grows the maximum stack depth, you should know about it right away. Second, stack-depth analysis helps you focus your efforts by telling you where the critical path through the system is, with respect to the stack. In other words, some functions don't contribute to the worst-case stack depth, and it's pointless to reduce their stack consumption. However, remember that as you optimize the system the worst-case path will likely change.

Know your stack
Because allocating too little memory to a stack usually causes embedded systems to fail spectacularly, most developers are familiar with the basic testing-based approach to allocating stack memory. This article takes a slightly different view by describing a complementary analysis-based approach to estimating stack size. Stack analysis has significant advantages over a purely testing-based approach because it provides rapid feedback to developers, identifies areas of the program that could benefit from optimization, and provides an upper bound on the worst-case stack depth, complementing the lower bound provided by testing.

John Regehr received a Ph.D. from the University of Virginia in 2001 and is currently an assistant professor of computer science at the University of Utah. His research goal is to create tools and techniques supporting the rapid development of efficient and reliable embedded software. His experience developing embedded systems includes work on firmware for a fast network interface, wireless sensor network nodes, and several robots. He can be reached at regehr@cs.utah.edu.

End notes

  1. Jack Ganssle. The Art of Designing Embedded Systems. Newnes, 1999, p. 90.
  2. Autopilot Project, 2004. http://autopilot.sourceforge.net.
  3. Dennis Brylow, Niels Damgaard, and Jens Palsberg. Static checking of interrupt-driven software. In Proc. of the 23rd Intl. Conf. on Software Engineering (ICSE), pages 47-56, Toronto, Canada, May 2001.
  4. John Regehr, Alastair Reid, and Kirk Webb. Eliminating stack overflow by abstract interpretation. In Proc. of the 3rd Intl. Conf. on Embedded Software (EMSOFT), pages 306-322, Philadelphia, PA, October 2003.
    http://www.cs.utah.edu/~regehr/stacktool/
  5. AbsInt. StackAnalyzer, 2004. http://www.absint.com/stackanalyzer
  6. Victor Yodaiken. Temporal inventory and real-time synchronization in RTLinuxPro, April 2003. http://hq.fsmlabs.com/yodaiken/papers/sync.pdf

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

Sponsored Blogs