Determinism & predictably reliable systems
Engineering is all about building predictably reliable systems. But most firmware engineers ignore the role of determinism in real-time systems. Few can answer questions like “how can you guarantee that the system won’t fail when stressed?”
Today’s hardware is often cursed with all sorts of nifty speed-enhancers like cache, pipelines, and speculative execution. All of these contribute to execution time uncertainty. The system’s performance can vary wildly depending on a lot of hard-to-predict events.
An interrupt may occur at any time, and will require at least a partial cache flush. Resuming execution flow means rereading instructions from L2 or memory, which can take a surprisingly long time. A system that is running fine but close to the edge may suddenly crumble in meeting its hard real-time deadlines.
Can you really guarantee the highest priority task will complete on time? What if there's a perfect storm of interrupts? Or of bus activity (DMA or having to yield the bus to another master)? In big systems a task may depend in very complex ways on externalities (other computers, systems, I/O) that aren't ready in time.
Preemptive multitasking is itself inherently non-deterministic, though techniques like rate-monotonic analysis can mitigate the problem. But RMA requires more analysis than most developers will ever do.
Even extremely simple systems that have none of these speed-enhancing features can suffer from serious timing problems. A little bit of C code that looks quite deterministic probably makes calls to the black hole that is the runtime library, which is generally uncharacterized (in the time domain) by the vendor. Does that call take a microsecond or a week? No one knows.
It’s my belief that too many systems “work” due only to divine intervention. Developers chase down the usual procedural bugs and then breathe a sigh of relief that, once again, a miracle has occurred.
But all too often that gift from heaven is merely a reprieve, an indulgence, with damnation still possible or even likely when the system experiences unexpected stresses. Or when luck runs out and interrupts bunch up.
Unlike most other engineered systems our real-time devices don’t have fuses that blow when something goes wrong. Instead of a controlled shutdown or fallback to a less-capable mode, firmware completely collapses in an unpredictable way.
What do you do to convince yourself (at least) that the system will be reliable in the time domain?
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at firstname.lastname@example.org. His website is www.ganssle.com.