In 2003 a Boeing 747-400 aircraft lost all engine and flight displays. Pilots flew on backup instruments for 45 minutes before ground technicians radioed back a fix. In 2001 the same type of aircraft experienced the same problem, which in this case wasn't repaired till the plane landed.
In both cases the fix was the same: cycle the circuit breakers. Punch reset, hit control-alt-delete, cycle power.
Though various authorities continue to look for the problem's root cause, Boeing has issued an interim solution: cycle the breakers. Just reset it.
A software problem locked up the Pathfinder spacecraft's computer as it descended to a landing on Mars. The watchdog timer brought the system back to life. It just reset the CPU.
The Clementine lunar mapper dumped all of its fuel when the software ran amok. There was no watchdog. The mission was lost because ground controllers couldn't just reset it.
One reader wrote that his stove's oven fan went whacko; apparently computer-controlled, its health was restored when he cycled power. He just reset it. My 4 year old niece offered a helpful suggestion while I was in the middle of resolving a LAN routing problem: “turn it off and turn it on, Uncle Jack; that always works for me.” And we all know how to fix a Windows machine that's low on resources.
We just reset it.
(I'm fortunate; in my neighborhood the electric company provides regular power cycling as a customer service).
I wrote a series of articles about watchdog timers in ESP (“Born to fail,” “Li'l Bow Wow,” and “Watching the watchdog”) and then condensed them to a single piece, adding drawings and more thoughts. In two months on-line it has been downloaded over 4,000 times. Developers are apparently aching for a device to restore crashed systems back to life. Something that just resets it.
The culprit is buggy code, of course. The software revolution gave us tremendous functionality in the most mundane products. And it takes those capabilities away, randomly, usually at the most inconvenient moments. Sometimes for inexplicable reasons doing things in exactly the same manner we've used for months leads to a crash. But that's not a problem, cycle power or yank the batteries for 5 minutes. Just reset it.
We've created new life strategies to cope with the problems. My family calls me the “save it” czar, since anytime I notice anyone working on a document, spreadsheet, or other data manipulating tool, and the task bar is labeled “new document” instead of some filename that indicates at least one save took place, I slap them around. Metaphorically, of course. When using Word my left hand subconsciously rolls through the ALT-f-s save-mantra every 10 or 15 seconds. Mostly that's a habit induced by past years of struggles with Windows 3.1, 95, and 98. XP has been astonishingly reliable.
One friend adamantly insists that our code should be perfect. Once I agreed. Small apps sporting 10k lines of code or so are truly tractable.
Old-timers will recall the debates that raged 30 years ago about reset switches. Were they needed? Desirable? Or perhaps a red flag to customers that even the vendor didn't trust the code they were shipping?
Today there are no reset switches, and products are huge, often employing hundreds of thousands of lines of code. Fact is, humans write this stuff. Humans are, last I checked, imperfect. Our work products always reflect ourselves, for better or for worse. Bugs abound.
I think the next great change in firmware development will be self-recovering code, software that detects failures and initiates a graceful recovery transparent to the user. One approach is to carefully segment the code into tasks, each protected by an MMU, coupled with exception handlers smart enough to restart the flawed thread. Now that transistors are free why not stick an MMU even in a cheap 8-bitter?
But until then we'll use the same old technique.
We'll just reset it.
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at . His website is .
Many of the latest Cell phone processors have made the leap to MMU's and protected mode operation. Plus all the downloaded applications are Java, which has extensive pre-runtime, and runtime checking, un-like 'C' or assembler. See the article “Java Security Guards Embedded Networks” in Wireless Systems Design by Prithvi Rao. Steps like this go a ways toward ensuring your game you download for you're all in one phone/pda does not affect it's ability to make that critical call. Other conservative software design steps like built in test (BIT) to check for the eventual failure of the circuitry and or IC's and let you know it's time to take the phone in for service help ensure that your phone does not cause a problem that prevents everyone in a whole cell from making calls. Semiconductor manufacturers are happy to provide sample test code to exercise their specific memory and periperial architechtures in many cases, and this can provide an additional measure of assurance that the phone is opperating correctly. Some routines like RAM and Flash test might only be able to be run at power up, while others can be run periodicaly. All this must be thought about once a reset has occured — did it happen because of bad SW or bad HW?
All this is a lot to think about, but worth it, knowing that your wife, daughter, or mother's 911 call will go through when it needs to.
– Bill Murray
I enjoyed reading “Just reset it” – of course, it is a very familiarstory. I have a relevant recent experience:
I have a tri-band GSM cell phone from a very well-known manufacturer. Ithas numerous little bugs in the UI, but it basically works. However, Irecently found a problem that only shows up when I'm in the US. Sometimes the phone goes into “loopy mode”. The display looks fine andthe controls appear to work, but incoming calls get “busy” and outgoingcalls just end up with a network error tone. Eventually, I found that apower cycle would restore normality. But why did this happen just in theUS? Answer: because I would leave the phone switched on for days at atime [to span time zones] instead of turning it off at night as I wouldat home. I think I know what the bug in the software is now …
I also find that the batteries last less time in the US. The reason forthis, I figured, is quite subtle. The US GSM networks function at ahigher frequency than in Europe. I may be a humble software engineer,but my hardware knowledge tells me “higher frequency = higher powerconsumption”. QED.
– Colin Walls
My sister just bought a new house with a new gas water heater. She calledme up and said there was no hot water, and they couldn't get someone outto look at it until the next day. The manual on the water heater saidthat if the front panel light was lit, but not flashing the controllerwas bad. I told her to unplug the AC power, wait 10 seconds and plug itback in.
The water heater now works fine again.
What sad about this is that a simple hardware monostable triggered by thelight circuit could reset the processor if the light stops flashing.There can't be more than a few hundred lines of code in the micro, so itsunlikely to have buggy code, more likely the hardware isn't properlyhardened against power line noise and static.- Tom Blandino
Yes, I too have the cell phone that after a few days appears to be working, as the clock is being displayed, but no other functions work. Remove the battery and do a reset.
High efficiency gas furnace also got into a loop that wouldn't ignite the gas. Seems that if the pressure drops at the exact instant the electronic pilot is trying to light the burner, the firmware goes into error mode, and blinks a light. No heat until you cycle the power on the gas furnace.
Too many designers assume a reset/power cycle is perfectly ok to recover.
Today I took some empty pop bottles back for refund. The machine said out of paper, contact manager. Manager opens door, pushes some buttons, and the machine goes through an obvious reset sequence, blinking each light in sequence, and printing out the manufacturers blurb on the paper. But I don't get my credit slip for the returned pop bottles.
I don't want my car to be reset on the highway, nor my plane to be reset in the air, nor my pacemaker or IV drip or …
Will my food spoil if my refriderator needs a reset ? Will I get radiation from the microwave if it needs a reset ?
We need to better train the code writers, or cause them to be licenced, something to remove such idiotic stupidity from the products being developed.
– Paul Burega
I have another real life reboot story. My wife and I both haveSprint PCS phones and call each other a lot during the day sincewe have a plan where we can do that for free. After a number ofdays of not being able to reach me – even though we both had a fullstrength signal – she called Sprint for help. Their suggestion was to turn off my phone (I typically left it on 24 x 7), take outthe battery, wait 5 minutes, insert battery and reboot. It worked.
The customer support person also suggested that I turn off my phone every night. The reason was something like “that's required so the system gets a chance to reset itself” or some such mumbo-jumbo.So, like every other reboot-trained monkey on the planet, I do as I'm told and reboot my phone every day. Who would have thought thatWindows 2000 (and XP) would have to be rebooted MUCH less often(once a month if that) than a cell phone? And this is with a “new”phone. It was about 6 months old when we had the trouble and it'sworked flawlessly since its daily reboot.
– Dan Miner