Editor's note: See the next five bugs in part 2.
Finding and killing latent bugs in embedded software is a difficult business. Heroic efforts and expensive tools are often required to trace backward from an observed crash, hang, or other unplanned run-time behavior to the root cause. In the worst cases, the root cause damages the code or data in a way that the system still appears to work fine or mostly fine-at least for a while.
Too often engineers give up trying to discover the cause of infrequent anomalies that cannot be easily reproduced in the lab- dismissing them as user errors or “glitches.” Yet these ghosts in the machine live on. Here's a guide to the most frequent root causes of difficult to reproduce bugs. Look for these top five bugs whenever you are reading firmware source code. And follow the recommended best practices to prevent them from happening to you again.
Bug 1: Race condition
A race condition is any situation in which the combined outcome of two or more threads of execution (which can be either RTOS tasks or main() and an interrupt handler) varies depending on the precise order in which the interleaved instructions of each are executed on the processor.
For example, suppose you have two threads of execution in which one regularly increments a global variable (g_counter += 1; ) and the other very occasionally zeroes it (g_counter = 0; ). There is a race condition here if the increment cannot always be executed atomically (in other words, in a single instruction cycle). Think of the tasks as cars approaching the same intersection, as illustrated in Figure 1 . A collision between the two updates of the counter variable may never or only very rarely occur. But when it does, the counter will not actually be zeroed in memory that time; its value is corrupt at least until the next zeroing. The effect of this may have serious consequences for the system, although perhaps not until a long time after the actual collision.
• Best practice: Race conditions can be prevented by surrounding critical sections of code that must be executed atomically with an appropriate preemption–limiting pair of behaviors. To prevent a race condition involving an ISR, at least one interrupt signal must be disabled for the duration of the other code's critical section. In the case of a race between RTOS tasks, the best practice is the creation of a mutex specific to that shared object, which each task must acquire before entering the critical section. Note that it is not a good idea to rely on the capabilities of a specific CPU to ensure atomicity, as that only prevents the race condition until a change of compiler or CPU.
Shared data and the random timing of preemption are culprits that cause the race condition. But the error might not always occur, making the tracking of race conditions from observed symptoms to root causes incredibly difficult. It is, therefore, important to be ever-vigilant about protecting all shared objects. Each shared object is an accident waiting to happen.
• Best practice: Name all potentially shared objects–including global variables, heap objects, or peripheral registers and pointers to the same–in a way that the risk is immediately obvious to every future reader of the code; the Netrino Embedded C Coding Standard advocates the use of a “g_ ” prefix for this purpose. Locating all potentially shared objects would be the first step in a code audit for race conditions.
Bug 2: Non-reentrant function
Technically speaking, the problem of a non-reentrant function is a special case of the problem of a race condition. And, for related reasons, the run-time errors caused by a non-reentrant function generally don't occur in a reproducible way–making them just as hard to debug. Unfortunately, a non-reentrant function is also more difficult to spot in a code review than other types of race conditions.
Figure 2 shows a typical scenario. Here the software entities subject to preemption are also RTOS tasks. But rather than manipulating a shared object directly, they do so by way of function call indirection. For example, suppose that Task A calls a sockets-layer protocol function, which calls a TCP-layer protocol function, which calls an IP-layer protocol function, which calls an Ethernet driver. In order for the system to behave reliably, all of these functions must be reentrant.
Click on image to enlarge.
But all of the functions of the Ethernet driver manipulate the same global object in the form of the registers of the Ethernet controller chip. If preemption is permitted during these register manipulations, Task B may preempt Task A after Packet A has been queued but before the transmit is begun. Then Task B calls the sockets-layer function, which calls the TCP-layer function, which calls the IP-layer function, which calls the Ethernet driver, which queues and transmits Packet B. When control of the CPU returns to Task A, it requests a transmission. Depending on the design of the Ethernet controller chip, this may either retransmit Packet B or generate an error. Packet A is lost and does not go out onto the network.
In order for the functions of this Ethernet driver to be callable from multiple RTOS tasks near-simultaneously, they must be made reentrant. If they each use only stack variables, there is nothing to do. Thus the most common style for C functions is inherently reentrant. But drivers and some other functions will be non-reentrant unless carefully designed.
The key to making functions reentrant is to suspend preemption around all accesses of peripheral registers, global variables including static local variables, persistent heap objects, and shared memory areas. This can be done either by disabling one or more interrupts or by acquiring and releasing a mutex. The specifics of the problem dictate the best solution.
• Best practice: Create and hide a mutex within each library or driver module that is not intrinsically reentrant. Make acquisition of this mutex a precondition for the manipulation of any persistent data or shared registers used within the module as a whole. For example, the same mutex may be used to prevent race conditions involving both the Ethernet controller registers and a global or static local packet counter. All functions in the module that access this data, must follow the protocol to acquire the mutex before manipulating these objects.
Beware that non-reentrant functions may come into your code base as part of third-party middleware, legacy code, or device drivers. Disturbingly, non-reentrant functions may even be part of the standard C or C++ library provided with your compiler. If you are using the GNU compiler to build RTOS-based applications, take note that you should be using the reentrant “newlib” standard C library rather than the default.
Bug 3: Missing volatile keyword
Failure to tag certain types of variables with C's volatile keyword can cause a number of unexpected behaviors in a system that works properly only when the compiler's optimizer is set to a low level or disabled. The volatile qualifier is used during variable declarations, where its purpose is to prevent optimization of the reads and writes of that variable.
For example, if you write code like that in Listing 1 , the optimizer may try to make your program both faster and smaller by eliminating the first line–to the detriment of the patient's health. If, however, g_alarm is declared as volatile , then this optimization will not be permitted.
Click on image to enlarge.
• Best practice: The volatile keyword should be used to declare every:
- Global variable accessed by an ISR and any other part of the code,
- Global variable accessed by two or more RTOS tasks (even when race conditions in those accesses have been prevented),
- Pointer to a memory-mapped peripheral register (or set or registers), and
- Delay loop counter.
Note that in addition to ensuring all reads and writes take place for a given variable, the use of volatile also constrains the compiler by adding additional “sequence points.” Other volatile accesses above the read or write of a volatile variable must be executed prior to that access.
Bug 4: Stack overflow
Every programmer knows that a stack overflow is a Very Bad Thing. The effect of each stack overflow varies, though. The nature of the damage and the timing of the misbehavior depend entirely on which data or instructions are clobbered and how they are used. Importantly, the length of time between a stack overflow and its negative effects on the system depends on how long it is before the clobbered bits are used.
Unfortunately, stack overflow afflicts embedded systems far more often than it does desktop computers. This is for several reasons, including: (1) embedded systems usually have to get by on a smaller amount of RAM; (2) there is typically no virtual memory to fall back on (because there is no disk); (3) firmware designs based on RTOS tasks utilize multiple stacks (one per task), each of which must be sized sufficiently to ensure against unique worst-case stack depth; and (4) interrupt handlers may try to use those same stacks.
Further complicating this issue, no amount of testing can ensure that a particular stack is sufficiently large. You can test your system under all sorts of loading conditions but you can only test it for so long. A stack overflow that only occurs “once in a blue moon” may not be witnessed by tests that run for only “half a blue moon.” Demonstrating that a stack overflow will never occur can, under algorithmic limitations (such as no recursion), be done with a top-down analysis of the control flow of the code. But a top-down analysis will need to be redone every time the code is changed.
• Best practice: On startup, paint an unlikely memory pattern throughout the stack(s). (I like to use hex 23 3D 3D 23, which looks like a fence '#==# ' in an ASCII memory dump.) At runtime, have a supervisor task periodically check that none of the paint above some pre-established high water mark has been changed. If something is found to be amiss with a stack, log the specific error (such as which stack and how high the flood) in nonvolatile memory and do something safe for users of the product (for example, a controlled shut down or reset) before a true overflow can occur. This is a nice additional safety feature to add to the watchdog task.
Bug 5: Heap fragmentation
Dynamic memory allocation is not widely used by embedded software developers–and for good reasons. One of those is the problem of fragmentation of the heap.
All data structures created via C's malloc() standard library routine or C++'s new keyword live on the heap. The heap is a specific area in RAM of a predetermined maximum size. Initially, each allocation from the heap reduces the amount of remaining “free” space by the same number of bytes. For example, the heap in a particular system might span 10 KB starting from address 0x20200000. An allocation of a pair of 4-KB data structures would leave 2 KB of free space.
The storage for data structures that are no longer needed can be returned to the heap by a call to free() or use of the delete keyword. In theory this makes that storage space available for reuse during subsequent allocations. But the order of allocations and deletions is generally at least pseudo-random–leading the heap to become a mess of smaller fragments.
To see how fragmentation can be a problem, consider what would happen if the first of the above 4 KB data structures is free. Now the heap consists of one 4-KB free chunk and another 2-KB free chunk; they are not adjacent and cannot be combined. So our heap is already fragmented. Despite 6 KB of total free space, allocations of more than 4 KB will fail.
Fragmentation is similar to entropy: both increase over time. In a long running system (in other words, most every embedded system ever created), fragmentation may eventually cause some allocation requests to fail. And what then? How should your firmware handle the case of a failed heap allocation request?
• Best practice: Avoiding all use of the heap is a sure way of preventing this bug. But if dynamic memory allocation is either necessary or convenient in your system, there is an alternative way of structuring the heap that will prevent fragmentation. The key observation is that the problem is caused by variable sized requests. If all of the requests were of the same size, then any free block is as good as any other-even if it happens not to be adjacent to any of the other free blocks. Figure 3 shows how the use of multiple “heaps”-each for allocation requests of a specific size-can be implemented as a “memory pool” data structure.
Click on image to enlarge.
Many real-time operating systems feature a fixed-size memory pool API. If you have access to one of those, use it instead of malloc() and free() . Or write your own fixed-sized memory pool API. You'll just need three functions: one to create a new pool (of size M chunks by N bytes); another to allocate one chunk (from a specified pool); and a third to replace free() .
Code review is still the best practice
You can save yourself a lot of debugging headaches by ensuring that none of these bugs is present in your system in the first place. The best way to do that is to have someone inside or outside your company perform a thorough code review. Coding standard rules that mandate the use of the best practices I've described here should also be helpful. If you suspect you have one of these nasty bugs in existing code, performing a code review may prove faster than trying to trace backward from an observed glitch to the root cause.
In an upcoming column, I'll introduce you to a few other common causes of nasty embedded software bugs.
Michael Barr is the author of three books and over 50 articles about embedded systems design, as well as a former editor in chief of this magazine. Michael is also a popular speaker at the Embedded Systems Conference and the founder of embedded systems consultancy Netrino. You may reach him at or read more by him at .