Watchdogs

July 23, 2007

JackGanssle-July 23, 2007

In a number of recent emails some readers claim that great embedded products don't need a watchdog. Correspondents reason that watchdogs are the last line of defense against software crashes, so write great code and your system will be crash-proof.

I disagree.

Software is unique in that it's probably the only human endeavor - and certainly the only engineering field - where it's at least theoretically possible to achieve perfection. Software is unmarred by the gritty realities of poor castings, cyclic loadings and counterfeit parts that mechanical engineers must deal with. It doesn't suffer from EE nightmares like lightening strikes and poor solder joints.

But software isn't something that comes down from on high. It's designed and built by imperfect humans, who craft their code from often-misinterpreted and vague requirements, and interface the software to other complex systems whose behavior is usually poorly-specified.

Complexity grows exponentially; Robert Glass figures for every 25% increase in the problem's difficulty the code doubles in size. A many-million line program can assume a number of states who size no human can grasp.

Perfection, giving these challenges, will be elusive at best. And how can one prove their code is perfect?

The review board that studied the software-induced $500 million Airane 5 failure had a number of conclusions. One was that the organization had a culture that assumed software cannot fail. A half century of experience has taught us quite the opposite.

Software doesn't run in isolation. It's merely a component of a system. Watchdogs are not "software safeties." They're system safeties, designed to bring the product back to life in the event of any transient event that corrupts operation, like cosmic rays. Xilinix, Intel, Altera and many others have studied these high energy particles and have concluded that our systems are subject to random single event upsets (SEUs) due to these intruders from outer space.

Currently cosmic ray SEUs are thought to be relatively rare. One processor datasheet suggests we can expect a single error per thousand years per chip. That sounds pretty safe till you multiply that by millions, or hundreds of millions of processors shipped per year. But recent research suggests that as geometries scale below 65 nm even SRAMs will be surprisingly vulnerable to random SEUs.

Systems and software operate in a hostile world peppered with threats and imperfections that few engineers can completely anticipate or defend against. A watchdog timer, which requires insignificant resources, is cheap and effective insurance. It's the fuse EEs have routinely employed for a hundred years, and it's one that automatically resets.

What do you think? Do your products use a watchdog?

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at jack@ganssle.com. His website is www.ganssle.com.


Watchdogs are essential. In many of my products I use both a hardware one and a software one that runs in the highest-priority timer interrupt.

In the course of normal operation, the hardware dog is kicked by the application/RTOS at an appropriate rate.

The software dog is watching for excessive time use by a particular task - if a task knows it is going to take a long time, it can tell the software dog how long it expects to be (and if that is VERY long, it can handle the hardware dog during that time). Using them both helps catch issues of unexpected task runtime (sprintf a float sometime!), while the hardware one is there to catch things if the software one fails in some way.

A watchdog is essential in ANY software product, not just for SEUs and bad code, but for a slow power supply startup or any other of a host of unforeseen circumstances.

Not having a protection mechanism in the product is begging for liability issues. Bridges, airplanes, escalators and elevators all have some sort of last ditch safety effort to prevent injury/damage. Why not code?

- Andy Kunz


Since any system reflects the limitations of its' creator, then the limitations of the programmer are introduced into the system. So until a perfect programmer is produced, then software will always have potential flaws. Unless a program is small enough to be mathematicalll proven correct, a watchdog timer is a needed component of an embedded system to ensure that an UNFORSEEN problem can be recoverable.

- Tom Mazowiesky


Who's watching the watchdog? Anyway... a watchdog should always be implemented as it will always get you back to sqaure one.

- Steve King


I have to absolutely agree that watchdogs are not merely for coding errors. And they're not merely for cosmic ray event errors.

Even perfect software executes on hardware. Even if hardware is perfectly designed, and completely shielded against cosmic rays, hardware breaks. Even if thermal and power derating is done in the design, every semiconductor will perish eventually.

Great embedded products are robust and trustworthy embedded products. No product without an independent watchdog is robust and trustworthy.

That does not mean that all embedded products must have a watchdog, if there it has no hazard potential and users can be expected to put up with occasional annoyance. Have you ever had to take the battery out of a cell phone to get a complete hardware reboot to finally exit some snit it has gotten itself into?

I have a truly excellent Wireless G router connected to my cable modem, from (one of) the top manufacturers of such equipment. In three years, I have had to pull out the power code two or three times to get it to start working properly again.

Would a watchdog have helped my cell phone or router? Maybe. Maybe not.

But what about a embedded products that do have potential hazards, such as injury to people or damage to property? Then the question is not whether a watchdog is required, but whether the watchdog is good enough to mitigate the hazards.

There are, or were, a small minority of microcontrollers with a separate RC oscillator to drive their on-chip watchdogs. The majority of such parts now, if they have on on-chip watchdog at all, drive it from the system clock derived from an oscillator or clock input to the chip. A single-point failure, such as a solder joint on the crystal or oscillator, can stop the micro and its internal watchdog. Perhaps with the motor drive running, or the high voltage switched on, or insert your particular hazard here.

And as for the software protection, I know programmers who do indeed write "great code". I don't know any who can guarantee to write perfect code. Especially these days when your embedded system might contain megabytes of commercial or open-source operating system, not written by your "great coders", and megabytes of application.

The authors of the emails you mention are certainly entitled to their opinions. But anyone who maintains that even the "greatest" code ever written eliminates all need for watchdogs is starting off at a disadvantage in convincing me that their opinions are worth consideration.

- Jack Klein


Just one reason for using watchdogs: imagine your unique and flawless code got lost in a team of even few sw developers - will you ensure the code is still faultless?

- Marcin Matczuk


We design electricity meters. Part of the acceptance test is a Fast Transient Burst - 4000V fired up the Live and Neutral wires in bursts of pulse. The meters fall over under this test, but the watchdog brings them back to life, and that is acceptable. This is quite normal for any device connected to the mains supply.

One issue we have identified however is that if the watchdog is clocked off the main crystal, the FTB can stop the crystal oscillating and so kill the watchdog.

A separately-clocked watchdog is an absolute requirement in our industry and has nothing to do with code quality.

- Paul Hills


Drawing an analogy to a previous article of yours "Contract based programming" , I like to see, the Watchdog is the system's

version of an "ASSERT". The asserting condition being a heartbeat of the system.

in other words , a watchdog helps us to recover from missed out ASSERTs . However in an ideal state we should have ASSERTs in all possible fatal error cases in the system, if we dont wish to implement a watchdog in the system.

- Sachin Panemangalore


I agree Watchdog is a must in all the products. We have had watchdogs in almost all the products we have designed so far.

I have one question for which I have not found a satisfactory answer.

Is a Hardware Watchdog better than the one in SW? Is there any major difference between both of them? The only advantage that I can think of is, the HW WD external to processor will continue to work even if the processor(housing the SW WD) fails.

- Prasad Rotti


The last major product I worked on was a single processor controller in a vehicle. It had a 'safe failure mode' -- switch everything off and the system mechanically locks and the vehicle is still drivable, although the advanced function is missing.

We had two parallel, independently coded algorithms, with an integrator on the difference between the two, so if one went up the swannee we could reach the safe state.

We had checks on every input and output so if anything odd happened, we could head for the safe state.

The only problem was we had the one clock source, and we were driving PWM outputs. So if the clock failed during the ON portion of the PWM, the whole thing did some very strange things. Now while we never saw this failure mode, and didn't realistically expect to see it, we came up with a simple solution - drive the 'main relay' for the system from a charge pumped drive. While the input was clocked fast enough, the drive stayed on, and everything worked. Within milliseconds of the input becoming static (like when the crystal drops off the micro), the main relay drops out and all is safe again.

This works for 'fail safe' systems. Automotive is heading towards 'fail operational' systems, and new ideas are needed - including watchdogs for getting back to square one.

- Paul Tiplady


Watchdog timer is important especially on embedded systems. This timing device does not related only on programs that needs to be reset if there are fault condition but also on control systems, if ever something wrong on the hardware such as high voltage outputs of malfunctions of motors, WDT can trigger the control system to put it on a safety state.

Yes, we can create a perfect code and a perfect hardware, but we cannot say that hardware components will always be perfect as we go along the way.

- Romeo Marcos Jr


I learned that even perfect software can fail when Flash memory is less reliable than claimed by its manufacturer.

There are several mechanisms to detect such cases, depending on where the software is corrupted, a watchdog can also help.

- Bernhard Kockoth

Loading comments...