System reliability: Do the numbers
I was talking to a group of 80 firmware engineers recently and asked "How much time to you allot for debouncing switches?" The answers were all over the map. But my follow-up "And where did that number come from?" elicited a universal shrug. Debounce milliseconds seem to be like a mantra, handed down from some revered sensei.
It turns out that all of the proffered answers were wrong. Wrong, simply because they were based on hearsay rather than engineering. Curious about this, some years ago I tested dozens of switches and found bounce times varying from hundreds of microseconds to hundreds of milliseconds. It's highly dependent of the switch selected. Some exhibit horrifying behaviors that should (but usually don't) greatly affect hardware design. (Summary of my results is here.)
"How much time is your system idle?" "What percentage of flash are you using?"
If you can't answer these then you have no idea what maintenance will be like. Down to 10 bytes or running at 99.9% CPU utilization means product support will be agonizing, or even impossible. I advocate for keeping a running set of numbers from the very beginning of the project, so one can anticipate a crisis before disaster strikes.
"How long do each of your interrupt service routines (ISR) execute?"
Even on the most brain-dead simple projects with gobs of extra compute power these numbers are important. If you don't measure ISR behavior you'll never improve your ability to estimate real-time response. Does that code take a microsecond or a millisecond? The numbers will vary hugely depending on a lot of factors, but those who make a practice of taking this data will at least have a sense of what to expect; those who don't will be clueless.
Sometimes people complain that these questions are too hard to answer. An ISR might take a few microseconds, or much more than that, depending on what is going on in the system. Cache misses, pipeline stalls, and preemption by higher-priority interrupts will alter the results.
But that's exactly why we need these numbers, so we can bound the system's operation to insure it will always behave correctly. And it's not hard to instrument the code to use an oscilloscope to get quite precise measurements. Heck, modern scopes don't even make you count boxes and multiply by the time base setting anymore. My Agilent will compute min, max and even the standard deviation of the displayed data.
The firmware community doesn't seem to have the same focus on numbers shared by most other engineering disciplines. In electronics we carefully assess everything: is a 5% resistor OK here? What tempco should it be? Yeah, we need 10 uF, but we know an applied bias reduces an MLCC capacitor's rating, so what part should we select?
Some argue that cheap 32 bit parts means we can just hope all of that power will be adequate for our real-time needs. But the truth is that most of these parts are quite complex. It's hard to tell from long sessions with the databook even how long a bus access might take, and that may vary considerably depending on what the chip is doing.
"But it works" is not good enough. If the system is right on the hairy edge of failure your in-house tests may all pass, but odds are something unexpected will cause a problem in the field.
This is engineering, not development by divine inspiration. Engineering is all about making a prediction, implementing a solution, and then measuring the result to compare it against the prediction.
What's your take? What do you measure?
Jack G. Ganssle is a lecturer and consultant on embedded development
issues. He conducts seminars on embedded systems and helps companies
with their embedded challenges, and works as an expert witness on
embedded issues. Contact him at firstname.lastname@example.org. His website is