Watchdogs redux

February 28, 2011

A novel approach
The latest issue of IEEE Embedded Systems Letters has an article that practically grabbed me by the throat.4 Titled "Control Focused Soft Error Detection for Embedded Applications" by Karthik Shankar and Roman Lysecky, it's a bit academic and rather a slog to get through. But the authors have come up with a fascinating twist on the concept of watchdog timers. In fact, they don't use the word "watchdog," and no timers are involved.

The idea is quite simple and at first glance not particularly novel: monitor the addresses the processor issues and compare those against a profile obtained during development. If the CPU goes to an unexpected location, take some sort of remedial action. Do the same if it doesn't issue an expected address. The authors go further and compare against the number of expected loop iterations and the like.

But what got my attention is how they monitor addresses. They simply suck in compressed trace data from the processor's serial trace data port. The paper talks about using an ARM CPU, but other parts also support various kinds of serial trace data, and there's even a standard named Nexus-5001, which is IEEE–ISTO 5001–2003.5

ARM supplies a bewildering array of debugging IP, including a macrocell that sends trace data out just two pins. The so-called Program Flow Trace (PFT) is described at ARM's infocenter.6 It can be set up in a zillion different configurations, but will at the very least emit compressed information that lets one track changes in program flow. For instance, the execution of a branch instruction pushes address data out. So does Branch with Link, which is ARM's way of calling a subroutine. Similar instructions in Thumb mode do the same.

There's a lot to like about the Shankar and Lysecky's approach. First, modern processors are barricaded behind caches, pipelines, and speculative execution logic. Even if you're using a CPU that has address pins (as opposed to a self-contained microcontroller or IP on an FPGA), those address lines simply don't match anything that is going on in the processor's grey cells.

Second, pins are valuable. By squeezing addresses through a couple of debug pins few of these valuable resources are consumed. And logic is cheaper than pins; adding circuitry to decompress the trace stream and make sense of it may cost very little. In the paper, the authors state that the overhead for their particular approach was only 3% of the silicon area of an ARM1156T.

Third, most watchdogs use ad hoc approaches that just are not reliable indicators of system operation. This new approach lets the designer decide just how fine-grained the monitoring should be. Check an occasional call… or watch every instance of every branch and call.

Fourth, there's no overhead in the system's firmware. Zilch. Unlike traditional watchdog approaches, which require one to seed the code with instructions to periodically kick the dog to keep it from resetting the system, address profiling is transparent to the software.

There are a few caveats, of course. The logic needed to make sense of the address information is substantial, and is probably impractical unless implemented in an FPGA. Building such a monitor would be a lot of work. But it needs to be done once, and then can ever after be used in a succession of products. I can see this as packaged IP sold by a third party, or as an open source project.

Remember, though, there's no guarantee that your ARM CPU will have the trace logic needed. Every vendor is free to include the IP or not.

Going deeper
I haven't thought out all of the implications or possible ways to actually use this idea in a real embedded system, but here are a few ideas.

One problem with the proposed approach is that every recompile will shift all of the system's addresses, requiring the developer to re-profile the code to determine where the branches are. An alternative is to monitor just function calls, but use a level of indirection. Build a table of jump instructions that lives at a fixed address in memory; each entry corresponds to a particular function. Make calls through that table. Then monitor those jumps. One would have to be careful that the compiler didn't optimize the jumps away.

This does mean the software changes a bit, of course. But more than a few of us use these jump tables anyway. They can aid debugging and sometimes simplify on-the-go firmware updating.

ARM's program flow trace also very intriguingly sends address information out when a DSB, DMB, or ISB instruction is executed. DSB (Data Synchronization Barrier) holds up program flow until all of the instructions before it complete; DMB (Data Memory Barrier) ensures that memory accesses before it compete before one after it starts, and ISB (Instruction Synchronization Barrier) flushes the instruction pipeline, ensuring that the instructions after the ISB are fetched from memory or the cache.

Why are these interesting instructions? ARM mandates that at the very least a DMB be issued before locking or unlocking a mutex, and before incrementing or decrementing a semaphore. One could monitor these, just like watching branches, to ensure that the code is properly going through all of the activities mediated by the RTOS. In fact, monitoring only these actions may be simpler than the addresses, because program flow can change quite a lot even from a simple change (requiring re-profiling), but resource management tends to change much less frequently.

As far as the DSB and ISB instructions, I'm not quite sure how they could offer useful watchdog information, but something makes me think they could offer some interesting options.

< Previous
Page 2 of 3
Next >

Loading comments...

Parts Search Datasheets.com

KNOWLEDGE CENTER