Understand your users' needsCall me Ishmael. I'm writing this in the mid-Atlantic, bound from Baltimore to Grand Turk aboard Voyager, my 32-foot sailboat. Like Melville, I find relief from the pressures of modern life by chasing adventure at sea.
Voyager, though 28 years old, is nevertheless a child of the microprocessor revolution. Over the years I've added a lot of electronics to make sailing easier and safer; each addition brings yet one more embedded system aboard. With the exception of the beautifully designed digital VHF radio, every microprocessor-based product on the boat suffers from one or more design defects that erodes the equipment's usefulness just a bit.
The environment on a small boat far offshore couldn't be worse for electrical items. Salt spray and high humidity insidiously find their way into even the best enclosures, rapidly corroding every soldered connection. Switch contacts are the first to go, followed by connectors and the traces on PC boards. A partial solution is to use only the finest gold contacts, an obvious approach all too few vendors employ.
While the marine environment is perhaps a bit extreme, every system we build is subject to some level of mechanical and electrical failure. Even in the most benign laboratory conditions contacts get dirty. It makes sense to design code that will work in at least some fashion if, say, a switch fails. In situations where failures are likely or inevitable, a wise designer will devise software solutions, even if the system cannot continue to run with complete functionality.
Given that dirty or corroded contacts are a perennial source of trouble, firmware should always check input bits for validity. Obviously, if only a single switch should ever be pressed at a time, then by all means don't accept conditions where several are asserted. But be kind to your users and take some reasonable action faced with unreasonable inputs. Can the system ignore the extra bit and carry on? If the software sees a switch continuously pressed, it may make sense to assume there's a doofus user or failed hardware.
Many systems use a debouncing algorithm that will loop forever if an input is shorted. Don't let a simple failure shut down the entire product!
For example, on this trip Voyager's digital autopilot went insane. God knows how long we went in circles till I woke up and realized there was a problem. After an entire day of tracing the circuit and looking for the source of the trouble I found that the front-panel switches were wired in parallel with an unused external connector all of which were scanned by the unit's 8051. These course-setting switches are used one at a time— never should more than one be pressed— and a user would never hold a switch in for more than a few seconds. The relentless sea found its way into the O-ring sealed computer module and created a high-impedance short between scan lines. The code was too simple-minded to reject the impossible signals it received and bizarrely steered us in circles.
Such poor code is inexcusable, as autopilots are famous for suffering corrosion problems. Wealthier sailors usually carry three or four units in the hope that one will survive a trip. Smarter software could help keep customers a lot happier. The engineering costs will be a bit higher, but the extra software costs nothing in production if ROM space is available. If it isn't, the company must weigh the cost of unhappy customers against a microcontroller with more program space.
Still, the designers had addressed a similar problem, though perhaps more to satisfy their own internal production requirements than to deal with frantic mid-ocean repairs. While troubleshooting the scan-line short I disassembled the unit, removed the circuit board, and clipped power to the computer so I could trace out problems with a voltmeter. Unfortunately, with the board removed an important rotary switch could not be connected into the system. I feared the autopilot's firmware would find that since no rotary-switch input was presented, another "impossible" condition, the code would go haywire, making tabletop diagnosis difficult. In fact it worked even without this input, indicating that the designers realized that during repair the unit's mechanical construction was such that no input could be expected. The code must have assumed some reasonable default value instead of looping for an input.
Sure, in real life most embedded systems don't have to run partially dismantled. Always remember that during production test and repair (to say nothing of field repair) your carefully engineered package might be violated. Technicians will run the system with boards hanging out and connectors dangling. If the system runs when opened up, they'll have a much easier time probing with scopes and meters to find faults.
Can your system run with important cables removed? What happens if a cable isn't connected when power is applied? If the code won't run without a cable, a technician might have to build extension wiring harnesses just to gain access to the circuit boards. Certainly no one in the field will have these harnesses. Where possible, make sure the code continues to run in some fashion with some or all of the cables removed.
I've written extensively about software diagnostics in the past. In the case of our autopilot, an off-course alarm beeped incessantly till I woke up and realized something was wrong. The unit gave no help in figuring out just what the problem was, though a trivial amount of code could have produced beep codes indicating which switches seemed to be on. As it was, I spent most of a day isolating the problem, not much fun in heavy seas.
With no feedback from the microcontroller, it's awfully hard to differentiate between switch, electronics, actuator, or flux-gate compass failures. Why not use an LED to blink error codes? Your Ford has such a self-test mode: short two wires together and it will produce a two-digit code indicating what sort of failures are where. This is embedded systems programming with style!
Sure, sometimes embedded systems are essentially disposable in event of failure. Mission-critical applications must be repairable and demand firmware that helps the user even when things fail. It's important to sit in your customer's shoes when deciding what is truly mission critical. If we couldn't fix Voyager's autopilot the two of us aboard would have had to steer, 24 hours a day, for almost two weeks! That's too much like working.
Similarly, never assume that the software is entirely glitch-free. Yes, even your meticulously maintained and painfully debugged code could very well harbor a latent problem. Even small embedded systems are now getting frightfully complicated. When programs were 4KB long it was reasonable to demand bug-free code. Today's multi-megaline systems will always have some lurking bugs.
It would be nice to write code that can survive any sort of software bug but surely this is impossible. However, with a little forethought you can usually craft firmware that, by its design, is robust enough to handle many sorts of faults.
Always write exception handlers. You might not expect a divide overflow or a spurious interrupt, but strange stuff does happen. The unexpected turns out to be more likely than you would think. Test those routines carefully. I'm giving a talk at the Boston Embedded Systems Conference this month about embedded systems disasters, and barely-tested exception handlers that don't quite work right are at least partially responsible for over half of the examples I'll discuss.
If the error is one for which there's no decent recovery strategy— such as a memory error— it might make sense to report the problem and at least restart the code. Any sort of service is better than a dramatic crash.
Fill unused ROM and even RAM locations with a single byte opcode that traps to a particular address and then put a handler there. For instance, the Z80/64180 goes to location 0x38 when executing the RST7 (FF) opcode; the 80x88 picks up a vector at 0C after executing INT3 (C4). The handler should try to recover gracefully, perhaps by reentering the program's main loop or even by restarting the code. This approach gives the code a prayer of recovering despite momentary hardware or software glitches that make the firmware wander off. Wandering code will likely wind up in the middle of data or even in the middle of a multibyte opcode. There's not much we can do about this, but filling ROM and RAM with a one-byte trap will improve the recovery odds quite a bit.
Be sure you can disable this extra-robust code during debugging. You don't want these routines to mask real problems. Use a conditional compile or runtime switch to vector error conditions to a breakpoint.
Similarly, during debugging always set your emulator, simulator, or whatever to break on any access to unused locations. Otherwise, how can you be sure the code isn't banging on locations it shouldn't be? This is always a sign of a latent problem. I often hear from folks whose software runs fine from system ROM but not from emulator RAM, a sure sign of rogue code writing into code space. Over the last few years half of the systems I've examined do spurious reads and writes, sure signs of a latent bug.
Really complex loops always hold potential for locking up a system. The world is indeed growing ever more complex, and our embedded systems reflect this. Some equipment solves torturously difficult series of equations before producing a result. For instance sometimes we use iterative instead of deterministic algorithms to reduce matrices or converge a series. Newton's method involves solving the same equation repeatedly using the answer from step n as the input to step n+1, continuing until the errors are below some arbitrary value. What if the input data is such that a solution cannot be found within specified precision? Sometimes iterative solutions can actually start to diverge, rather than converge, making a solution impossible. Iterative algorithms are fine as long as the software is smart enough to detect that a solution is unlikely, and then give the user some options. Locking up into an infinite loop is always unacceptable.
On this voyage our GPS hung several times trying to reduce crummy data from weak signals or marginal satellite geometry. Worse, even the software-controlled power switch wouldn't work when stuck in this loop. The designers left no option but to remove the unit's batteries, wait 30 minutes(!), and then restart it from scratch. Of course, after a half hour without batteries we had to reload dozens of setup parameters. Ironically, the restart required us to figure our position with the centuries old method of celestial navigation and preload that position into the GPS. A much better design would make the iterative loop read the keypad and exit when a key is pressed.
An even better approach might have been to use a real-time operating system, with one task always reading keys in the background. An operating system that runs some sort of keypad task will inherently prevent well-behaved code from getting into unbreakable infinite loops.
Far too many years ago I worked on an 8008-based instrument that used a Gauss-Siedel iteration to produce an answer. We programmed it to escape the loop if the iteration proceeded for 20 minutes without a solution (computers were a lot slower then). In this case seven-segment LEDs displayed "HELP" to let the user know no solution was possible. Years passed and the code was made obsolete by an algorithm that converged quickly, every time. Memories of the earlier version faded. One day an ashen-faced technician came to me and explained that he was repairing a very old unit. While fiddling with it, it started flashing "HELP HELP," confirming his long-held belief in the supernatural.
Never, never shut the user down. He bought your product to do something. Try to keep the widget at least partially operational no matter what might go wrong.
Embedded systems often quietly compute in the background, day in and day out. You might be willing to re-setup a lab instrument if a power outage caused the unit to reset, but this just is not acceptable in a lot of other applications. I often wonder why we put up with resetting every digital clock in the house after even a one-second power failure— in this day and age there is no technical reason why they shouldn't keep track of time for at least a few minutes.
With the power grid getting ever more overloaded we must expect linepower-based equipment to have to deal with regular power shortages. While it might be unreasonable to expect an embedded system to continue operating without power, I do feel that some equipment should at least reset to a reasonable mode when power is re-applied. For example, a remote data-acquisition site should start acquiring data as soon as power is restored, rather than enter some sort of setup mode. No user may be available to press the "start" key.
Can your critical equipment come back up without human intervention? If this is an important design criterion, be sure the code recognizes that the unit was at one point alive. If important variables are protected in flash or battery-backed RAM, in most cases it's easy to resume operation automatically. Be sure to maintain a checksum of the really important parameters so the code knows if the machine's data is intact.
On our voyage we run all of the boat's equipment from a pair of 12V batteries, recharged daily by the alternator on the diesel engine. If we're not careful to switch a full battery online before cranking the engine, the tremendous amount of current needed by the starter motor drags the entire 12V-system down to 8V or so, which resets every piece of electronics with an embedded computer. None of the equipment is smart enough to carry on without our help. We're forced to reenter a course into the autopilot, restart the radar, and so on. After a brownout (a not unusual condition on a cruising sailboat) no piece of embedded electronics is smart enough to remember that it had been on.
An old business adage advises one to "stick to one's knitting" — develop and sell products to markets you truly understand. If you don't deeply understand what your user expects and haven't got lots of experience operating in the industry, you can't make a product that will really satisfy your customer. Make sure your widget is designed to satisfy the user's real needs, in the real gritty world, not just under lab conditions.
But wait! The white whale's to windward! No more of this dull plodding— helm alee!
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at firstname.lastname@example.org.
I'll bet you enjoyed the Prologue of Tracy Kidder's "The Soul of a New Machine" (A Good Man in a Storm)!
- Grant Beattie
"While the marine environment is perhaps a bit extreme, every system we build is subject to some level of mechanical and electrical failure."
If you think sail boats are bad try design equipment for a coal mine. The machines are literally beating themselves against rocks, and the dust suspended in the air is so thick it the air looks like a solid black wall, caustic water everywhere.
The "solution", usually proposed by management, to such environment as Coal Dust or Salt Spray is inevitably Conformal Coating. A very common misconception.
Since conformal coating is not a hermetic seal what real happens is the impurities in the water are kept away from the circuit, but the water itself reaches the traces. Since the water is now fairly devoid of contaminates the water acts more like a dielectric insulator. You never notice in a digital circuit, but unless debugging is an obsession don't let it get near a RF tuning circuit or a high impedance sensor circuit.
"If the error is one for which there's no decent recovery strategy? such as a memory error? it might make sense to report the problem and at least restart the code".
Restarting the code is not always the best options either. Take for example a system that has a Maintenance Mode and a Operator Mode. Maintenance Mode lets you do things that are generally unsafe, but required to do maintenance. Starting up Maintenance Mode just because you crashed in that mode might be the wrong thing to do. Starting up in Operator Mode could be just as dangerous if you have the machine in some odd have done maintenance state when it crashed.
A clearer example is an elevator. In some places elevators are required to go to the lowest level when there is a fire alarm. Makes sense because smoke rises. However what happens if the fire was started due to a short circuit caused by a flood at the lowest level? The person in the elevator going down drowns rather than dies from the the smoke going up. Figuring out a "Fail Safe" state is not always clear cut unfortunately.
"Given that dirty or corroded contacts are a perennial source of trouble, firmware should always check input bits for validity."
Continuing with our elevator, in one of the Kentucky Coal Mines someone designed a elevator control with two buttons. One said "Run" the other was a button labeled "Up/Down". Mine operator called the service section saying the that the elevator would only go in the "Up" direction. Service person told the elevator operator several things to try. To which the elevator operator replied in his Kentucky Coal Miner Voice "There ain't no more up!". The hoist had already wound all of the cable around the winch drum. The problem was found to be someone trying to conserve on micro-pins. Rather than one input for "Up" and one input for "Down", there was a "Up/Down" input. Low meaning "Down", high meaning "Up". So a failure in any part of the wiring would only let the elevator to "Up", due to the CPU pull-up. They should have used two different, active low inputs, at least.
"An even better approach might have been to use a real-time operating system."
That assumes the RTOS is bug free. I recall someone doing a Z80/64180 RTOS based system, that perhaps you remember, long ago that had a particularly hard to find bug. When I finally diagnosed and reported the bug on the BBS of the day, the reply was "it was an esoteric problem". Apparently bugs that only happen during Blue Moons are fine to leave around. :-(
"Resetting every digital clock in the house after even a one-second power failure? in this day and age there is no technical reason why they shouldn't keep track of time for at least a few minutes."
Comes down to simple cost. Management won't pay that penny extra for the required part(s). You see the Bill of Material price is something that they can measure, it is tangible, it is easy to do. They do not understand, in any company that I've worked for at least, that it is the System Cost that is important, not the cost of the parts. Maybe using a eleven dollar part is better than using a three cent part if it saves five hours of assembly time per unit, or eightteen weeks of design time, but that is hard to measure, so they don't... :-(
One last thought to ponder:
There are criminal penalties for code that injures someone in the Coal Mining industry in Austral. Is Bug=Jail going to be what it takes for people to write bug free code?
- Bob Paddock