Complexity exaggerates the inefficiencies of our debug techniques. As we start to rely on operating systems with multiple threads and nested interrupts, an even bigger portion of our time will be spent debugging. The real-time nature of many applications makes it difficult to isolate problems when multiple events happen simultaneously. Thanks to this complexity, debugging, not development, will soon be the process that delays the completion of most products. In this article, I'll discuss some common real-time multitasking debug problems and suggest methods to prevent and troubleshoot them.
The debugging process has historically included two categories of questions that you ask to verify that the system is running as expected. The first question asks, “Where is my code executing right now?” To discover the answer you usually rely on printf statements or flashing LEDs to indicate that the application has arrived at a certain point. When the development tools support it, you might start inserting breakpoints along the path where the application should be executing. The second question asks, “Where is the value that I'm seeing coming from?” To answer this one, you most likely rely on register displays, variable information, and memory dumps. You might also try single-stepping while watching all these windows to see when a register goes awry, a memory location gets wrong data, or a pointer gets corrupted.
You might be able to employ these techniques to get your application running if only one person writes all the code, the system doesn't have a network infrastructure to keep track of, and the operating system is not swapping tasks. But these circumstances are rare today. Embedded processors often run faster than 500MHz, have all sorts of embedded peripherals for Ethernet or USB, and run full-featured operating systems. The applications scheduled by these operating systems can include many thousands of lines of code. Sprinkling printf statements and debugging with LEDs isn't realistic when so many functions are executing and the extra debug I/O affects the processor's performance. Your processor might be so fast that LED toggling occurs faster than the eye can see. Setting breakpoints in a modern embedded system is usually possible but the amount of code that's often involved in these applications can make this type of debugging unruly. With interrupts and multithreaded systems, sticking a breakpoint into the code may not even indicate the correct state of the system. Since breakpoints are placed at physical (not virtual) memory addresses they're not always aware of the thread state. Watching your register display, local variable windows, and memory windows can be helpful in spotting where an inappropriate value gets loaded but since these are static tools and don't give meaningful run-time debug information their usefulness is still limited.
The most common problems with debugging real-time software can be broadly classified in the following groups:
• Synchronization problems
• Memory and register corruption
• Interrupt-related problems
• Unintended compiler optimizations
In any system, synchronization problems arise when you've got multiple sequential execution units (meaning threads, processes, or interrupts) all running and sharing data asynchronously. All the operations on the shared data must be carried out atomically–that is, one execution unit must finish its operation before another unit can operate on it. For example, Figure 1 shows Thread-A and Thread-B both operating on shared variable “counter” in which A increments the counter and B decrements the counter. Assembly language code for Thread-A's counter++ and Thread-B's counter –– is shown in Figure 1. Assume that Thread-A is currently running and Thread-B is waiting for some other event.
Let's say the initial counter value is 2 and Thread-A is the currently executing thread. Thread-A reads the counter value into a processor register, increments it, and writes it back to the counter variable in memory.
In a multithreaded system, a high-priority thread can preempt a low-priority thread. So for example, at Time 0 while Thread-A is executing the instructions Reg1 = counter and Reg1 = Reg1+1 an event wakes up Thread-B. At this point Reg1 contains value 3. Now Thread-B wakes up (as indicated by the blue line), reads the variable counter , gets the value of 2 (because Thread-A has not yet written the updated variable to memory), decrements it to 1, and stores the new value into counter . As shown by the brown line at Time-2, Thread-A now resumes executing and writes Reg1 (the value 2) into counter , which is already holding the value of 3. In this process, Thread-B's work was lost. The value of counter should be 2, reflecting one increment operation by Thread-A and one decrement operation by Thread-B. We've got a failure.
In multithreaded systems another common synchronization problem is with linked-list corruption. If the links aren't manipulated atomically you may get invalid or dangling links. For example consider a singly linked list in which Node-A connects to Node-B, then C, then D. If Thread-A wants to append Node-E to the end of the linked list and Thread-B wants to append Node-F, the list can easily get corrupted unless the “append node” function is handled atomically.
Often synchronization problems are hard to debug because they depend on timing and seem to occur randomly. Fortunately, most of these problems can be avoided by properly guarding the shared data. Most real-time operating systems offer various services to protect and synchronize access to shared data. You have to choose the most appropriate mechanism to protect shared data without affecting the performance of the system.
You might consider any of the following approaches where data is shared among multiple threads:
• Turn off the scheduler so that the current thread is never preempted, creating “unscheduled regions” in your program
• Use semaphore or mutex to protect the shared data
• Protect critical regions by, for example, masking all interrupts
Each of these options comes with its own performance considerations, listed in Table 1. Turning off the scheduler will prevent any context switches and allow the current thread to execute until the scheduler is turned on again. This technique has a negative effect on performance, as it will delay any high-priority threads that are ready to run.
Turning off interrupts is safest and may be ideal for short periods of execution. This technique will likely increase your interrupt latency since interrupts will be ignored for some period of time. “Hard” real-time systems typically have an upper limit on the amount of time you can disable interrupts.
You can use semaphores, mutex , or any other synchronization service to protect the data. A mutex can be owned by only one execution entity at any point; if you want to share a resource with a finite number of execution entities semaphores are best to use. For example your system has a shared resource, such as a message pipe, which can handle only finite number of messages, say 10. A counting semaphore could be associated with the message pipe having the initial count set to 10. If an execution entity wants to place a message it will acquire the semaphore and place the message. The acquiring process will decrement the semaphore count, if the count becomes zero then the execution entity will be blocked until another entity releases the semaphore. Once the message is read from the pipe the semaphore will be released that increments the count. In this example at most 10 execution entities can access the shared resource at a time to place their messages.
If shared data is getting corrupted, you should first check for any simultaneous execution of threads or interrupts on the shared data. If thread and interrupts share data, interrupts must be turned off in the thread code. If data is shared among multiple interrupt routines, interrupts must also be turned off because a high-priority interrupt can preempt a low-priority interrupt. In multithreaded systems a high-priority thread can preempt a low-priority thread. So if data is shared among threads an appropriate mechanism must be adopted to protect the shared data.
Another synchronization problem is related to improper configuration of thread priorities. It's important to make sure the system-initialization threads start during boot time and initialize the entire system before starting any higher-priority threads in the system. For example, if a low-priority thread that configures a device is preempted by a high-priority thread that uses the same device, the configuration might not happen properly and cause the device to fail. To avoid such conditions you should use semaphores or other synchronization primitives provided by the operating system.
Most embedded systems use a flat memory model without a memory-management unit (MMU) so there's no hardware-supported memory protection available to you. Even processors that do offer this feature leave it up to the programmer to enable protection of various memory regions. If activating your processor's memory protection, you must weigh the trade-off between protection and performance. Performance is a factor in enabling the MMU in embedded systems because of the time required to switch from user mode (MMU protecting memory) to supervisor mode (MMU not protecting memory space). In such systems, processes and threads will have complete access to other processes' and threads' memory spaces. This may cause variety of stack overflow problems.
A run-time stack is temporary memory space used during a function call for passing parameters in and return values out. It also stores local variables. A processor register will usually keep track of the memory address of the stack pointer (SP). If your program is written in a high-level language such as C, the compiler generates prologue and epilogue code to create and destroy this stack with every function. For C programs, such code will build a stack that conforms to the C run-time model specific to your processor. The run-time model defines how variables are stored on the stack and how the compiler uses the stack. Figure 2 shows an example of how memory is used on the stack.
Allocating memory to the stack takes place during run time. For standalone applications, run-time initialization code sets up a stack. If the applications are linked to an operating system, each thread or process will typically get its own stack. Stack-management functions (such as changing the thread stack) will be performed by the operating system. Stack overflow occurs when stack pointer crosses its allocated boundary, corrupting memory and probably leading to system failure. In the example above, a stack overflow would occur if the total stack area were not sufficient to hold all the local variables.
For performance reasons, many real-time operating systems have predefined (or programmer-defined) stack sizes that will not increase dynamically during run time. That means improperly selecting the stack size for any particular thread or process may result in stack overflow.
If the application has considerable large local variable usage (such as arrays or large structures), there's an opportunity for stack overflow when all the local variables gets pushed onto the stack. Use large local variables with discretion; where appropriate you should allocate memory using malloc() or define static global variables.
Stack overflows are sometimes difficult to debug because they occur at run time and mostly depend on the execution path. Some processors detect and trap stack exceptions in hardware, including stack overflows. Applications can register exception handlers to catch these stack-overflow exceptions during debugging. Real-time operating systems might also offer debug features, such as guard bits, to protect against stack overflow. The RTOS will either log an error message about the stack overflow or raise a request to dynamically increase the size of the stack. In the simplest case, most of today's real-time operating systems can report the peak stack size used by all the threads in the system, which can help you to configure the stack.
In any interrupt-driven system, you have to consider the stack used by the interrupt-service routines (ISRs) if those interrupt routines are designed to use current thread's (or other execution entity's) stack. In such scenarios, every thread or process should have minimum stack size that's greater than the requirements of that thread or process, plus the maximum stack required for all cumulative interrupt routines.
You also have to be conscious of the requirements for any libraries that will be linked into your application. Some third-party libraries may make assumptions about the available stack space.
Register corruption and interrupt-related problems
ISRs are often written in assembly language, either for performance reasons or to better manipulate the hardware. Interrupts are asynchronous by nature so they can occur any time at all during any application's execution. A common problem with assembly-level interrupt routines is register corruption. A poorly written ISR can crash the entire system. For example, if the registers that an ISR uses are not all saved and restored, invalid values will mysteriously end up in the task that was interrupted. This leads to variety of difficult-to-diagnose bugs and exceptions.
To illustrate the problem consider the scenario in Figure 3 in which Thread-A has executed the instructions Reg1 = counter and Reg1 = Reg1 + 1 . Now let's say an interrupt occurs at Time 1 and the processor vectors to the appropriate ISR. The ISR corrupts register Reg1 without first saving its value and restoring it later. When the ISR completes and execution returns to Thread-A, Reg1 contains an invalid value of 0. The counter will be set to the wrong value, even though Thread-A has done nothing wrong.
If you write ISRs in C rather than assembly language, the compiler will place any local variables you use in the interrupt routines onto the current stack. Stack overflow then becomes another potential problem if stack is configured improperly, as we saw earlier.
Registers used in ISRs must always be saved before using them and restored before returning from the service routine. You also have to pay attention to the processor's status and flag registers that may be affected by arithmetic operations within the ISR. The ISR should save and restore those registers, too.If your ISRs were written in C and they use the current stack of the operating system, every thread must have enough stack space to handle interrupts or nested-interrupt stack requirements. The best practice is to keep the interrupt routines as short and simple as possible and defer the processing to a thread or a low-priority interrupt. On the interrupt comment, processing in a lower priority interrupt or a deferred callback as it is normally called is fairly common. During development you can add diagnostic functions at the start and end of your interrupts to compare each register used in the ISR to ensure that the state of the system is maintained.
Interrupt nesting enables a high-priority interrupt to preempt the execution of a low-priority interrupt. You should allocate enough space for peak stack requirements, considering the worst-case scenario in which each interrupt in your system is active and preempted by a higher-priority interrupt.
In-line assembly code is frequently used to manipulate memory-mapped registers and improve performance. For example, you may want to mask off interrupts by directly setting the interrupt mask register of your processor, rather than calling the equivalent application programming interface provided by the operating system. Simple operations like atomic (uninterruptible) increment and decrement functions are commonly written in assembly language, for instance. These might be used as macros in C programs, in which case the compiler might not be aware of the registers that were used inside the macros. It may therefore generate code that uses those same registers, leading to register corruption. See if your compiler offers any assembly-language constructs to describe to the compiler how a macro uses registers or memory locations. These enable the compiler to generate appropriate code.
Sometimes functions written in assembly language will be called by C functions. If the assembly code isn't written according to the C run-time calling conventions dictated by the compiler, it may lead to invalid argument passing or data corruption. For example, if the run-time model stipulates that the first two arguments are passed in processor registers R0 and R1, your assembly-language implementation must follow these rules. In another cases, the run-time model may require you to save the return address of the function on the stack your assembly-language implementation doesn't conform to the run-time model it may trash some registers and lead to failure. You must be aware of your compiler's run-time semantics before you use mixed languages.
Compiler optimizations, which are always logically correct, may still result in failures. This issue is particularly troublesome with low-level device drivers. Instruction reordering is a common way for a good compiler to achieve better performance since processors can sometimes handle multiple instructions in a single cycle. The compiler therefore looks to schedule instructions so that all of the processor's execution slots are full.
For certain devices, order of execution is critical. Examples include network cards and flash memory devices. Consider the case of a network interface in which device registers are grouped into banks, with a bank-select register that must be set before accessing any of the registers from that bank. Listing 1 shows the original code, and Listing 2 gives you a possible logical shuffle of that code.
As depicted above, the compiler might shuffle the instructions from the original source. The BSR_REG (bank-select register) must be set before setting up the pointer register (PTR_REG ). If the compiler shuffles these two logical blocks we'll get invalid results.
If your device driver works in the debug build but fails in the optimized build, look for shuffled instructions that have been “optimized” by the compiler. You may have to use a specific compiler flag or directive to guide the compiler away from such optimizations. You may want to do this just on a per-function basis to avoid losing performance unnecessarily.
Migrating code from one architecture to another sometimes leads to data-type problems. For example, integers may be 32 bits long in one processor architecture and 64 bits long in another. This can lead to invalid data or truncation of the data. The relationship to the compiler is that compliers are the interface between the application and the hardware. If the application made an assumption about the size of a data type the compiler might not complain since the data size is still valid but the algorithm will not return valid results. We agree it does not fit perfectly here, but we are not sure where else to place this common stumbling block.
Exceptions that are synchronous with program execution (as opposed to asynchronous and unpredictable interrupts) are usually the result of an improper instruction such as a divide-by-zero operation or a jump to an invalid memory address. Exceptions are also used to raise system events like cache miss, stack overflow, or hardware trace buffer full. Although there are some common exceptions for all processors, most processors will also have their own architecture-specific exceptions. The operating system typically handles such exceptions. Probably the most common exceptions you'll encounter in embedded development are those caused by corruption of the instruction memory. This could happen because of an invalid pointer, a stack overflow, or register corruption–in other words, by an earlier fault. The root cause of instruction corruption is sometimes very difficult to reproduce because of the real-time nature of the applications and the long chain of events that may lead up to it.
To begin dealing with processor exceptions start with a default exception handler and examine the exception context (the processor's registers and relevant stack contents) when the exception occurs. Most processors have a register that holds the address of the offending instruction, or at least, the instruction that raised the exception. In most cases it's easy to know why an exception occurred but identifying the execution path that led to this failure can be troublesome. Some processors have hardware-level tracing, which allows you to see the history of instructions the chip executed most recently. Examining this trace is definitely helpful during debugging. Memory and register corruption combined with logical errors in the program are the primary reason for exceptions. Examine the memory references or registers that caused the exception and you can narrow down the problem domain.
Exploring architecture-specific features
Most embedded processors include some type of built-in debug capability. A trace unit is one such hardware-supported tracing mechanism that raises a trace exception when the hardware trace buffer is full. Using this trace unit one can construct the complete path of execution. Sometimes the asynchronous nature of interrupts makes the debugging pretty hard in embedded systems. A problem or exceptions might appear only in certain scenarios and certain conditions. In case of such problems the execution path or program counter trace will be highly helpful for programmers to narrow down their problem domain.
Watch points are like breakpoints but enable you to monitor particular memory locations as they are being changed. Watch points monitor the processor's internal data bus and raise an exception if a match is found in the watch point registers. Watch points are useful if a particular memory location is getting corrupted consistently and you can't pinpoint the instruction that's doing it. Watching the memory can identify errant code corrupting variables or pointers.
Most debuggers allow you to modify memory and registers directly. Sometimes modifying a register gives insight into what is going wrong. For instance, changing the program counter value stored on the stack can force execution to resume at a particular function. Once again, you have to be careful to set up proper values with regard to the processor's or compiler's run-time model.
Because it is among the last steps in a long development project, debugging has a direct impact on product delivery and the ever-important time to market. Debugging is also inherently difficult to schedule. The problems found can vary greatly in both complexity and elusiveness. We've covered some of the common issues encountered during embedded systems development; these overviews and proposed solutions are meant to emphasize the time savings and improvements that modern development tools and processors with rich debug facilities can bring when it comes to developing complex embedded systems.
Giuseppe Olivadoti is a field applications engineer who has worked at Analog Devices for eight years. He holds a BSEE from Northeastern University and an MBA from the University of Phoenix.
Srinivas Gollakota is a systems engineer and has worked at Analog Devices for seven years. He holds a BTech Computer Science and Engineering from S.V University, India, and MS in computer science from Boston University.