Five top causes of nasty embedded software bugs
Editor's note: See the next five bugs in part 2.
Finding and killing latent bugs in embedded software is a difficult business. Heroic efforts and expensive tools are often required to trace backward from an observed crash, hang, or other unplanned run-time behavior to the root cause. In the worst cases, the root cause damages the code or data in a way that the system still appears to work fine or mostly fine-at least for a while.
Too often engineers give up trying to discover the cause of infrequent anomalies that cannot be easily reproduced in the lab- dismissing them as user errors or "glitches." Yet these ghosts in the machine live on. Here's a guide to the most frequent root causes of difficult to reproduce bugs. Look for these top five bugs whenever you are reading firmware source code. And follow the recommended best practices to prevent them from happening to you again.
Bug 1: Race condition
A race condition is any situation in which the combined outcome of two or more threads of execution (which can be either RTOS tasks or main() and an interrupt handler) varies depending on the precise order in which the interleaved instructions of each are executed on the processor.
For example, suppose you have two threads of execution in which one regularly increments a global variable (g_counter += 1;) and the other very occasionally zeroes it (g_counter = 0;). There is a race condition here if the increment cannot always be executed atomically (in other words, in a single instruction cycle). Think of the tasks as cars approaching the same intersection, as illustrated in Figure 1. A collision between the two updates of the counter variable may never or only very rarely occur. But when it does, the counter will not actually be zeroed in memory that time; its value is corrupt at least until the next zeroing. The effect of this may have serious consequences for the system, although perhaps not until a long time after the actual collision.
• Best practice: Race conditions can be prevented by surrounding critical sections of code that must be executed atomically with an appropriate preemption--limiting pair of behaviors. To prevent a race condition involving an ISR, at least one interrupt signal must be disabled for the duration of the other code's critical section. In the case of a race between RTOS tasks, the best practice is the creation of a mutex specific to that shared object, which each task must acquire before entering the critical section. Note that it is not a good idea to rely on the capabilities of a specific CPU to ensure atomicity, as that only prevents the race condition until a change of compiler or CPU.
Shared data and the random timing of preemption are culprits that cause the race condition. But the error might not always occur, making the tracking of race conditions from observed symptoms to root causes incredibly difficult. It is, therefore, important to be ever-vigilant about protecting all shared objects. Each shared object is an accident waiting to happen.