Software is strange stuff. If a newly-constructed bridge is fine with heavily-loaded trucks going across, odds are pretty good that little Sally on her bicycle won’t cause it to collapse. If the bridge were made of code it might be fine with a dozen 18-wheelers but fail when six of them try to cross. Or even at some random time when completely unloaded.
Most firmware is tremendously fragile. A single failure – a null pointer dereference, a divide by zero – can cause the entire system to crash. Before a bridge fails, generally cracks will start near where the loads are highest. Not so with code; that ‘divide by zero’ might be followed by millions of instructions before the system locks up or exhibits some other unwanted behavior.
There are good reasons for this. Computation is inherently fragile. Think of how many things have to go right for any decent-sized program to run! The hardware has to be almost perfect, as one bad fetch, one out of millions of bits of memory getting dropped, any of billions of logic operations per second going wrong, will cause a crash. Software is the same: millions of instructions and data values per second must sequence perfectly. The code itself must be perfect as well, yet it is comprised of millions of bits of information that interact in complex ways.
In the embedded world we usually deal with failure by going to a safe, but dead, state. A watchdog timer might reset the system. Interlocks might shut the system down but park dangerous mechanical components in a safety zone. These responses are hardly what anyone wants, but most of the time that’s the only sort of response we know how to implement. How do you respond to a divide by zero that could occur in any of a thousand divide operations?
Some languages, like C++ and Eiffel, provide mechanisms to respond to these unwanted events, but it is still hard to recover and continue computing. An FMEA (Failure Mode Effects Analysis) followed by a strategy to mitigate the risks is helpful, but often incredibly expensive. Some systems use a supervisor, a separate processor and code base, which watches the primary system’s outputs and takes corrective actions if something goes wrong. This might be a limp-home mode in a car, or, like in the Space Shuttle, a simpler set of software that does little more than get the craft safely back to Earth. These are great approaches. But they simply mask the deeper truth: that when a program fails, it does so catastrophically.
There are no easy answers, but in some cases it’s possible to contain failure. A memory management unit (MMU) can detect component failures and restart the component.
Old-timers remember segmented addressing on the 8086. Four segment registers let 16-bit address registers handle 1 MB of memory. They drove developers mad. At the very first Embedded Systems Conference a (in my opinion rude) member of the audience asked keynoter Andy Groves when they would move beyond that addressing scheme.
Intel did. They came out with the 386 with a 32-bit address space, but, to the horror of almost everyone, used thousands of segment registers to get to all of that memory. These were embodied in the on-chip MMU.
The entire software world revolted. Nearly everyone programmed the MMU to give each program a flat 32-bit address space.
That was a mistake. The MMU offered a new way to make programs run safely. Sure, an individual application might crash as programs remain fragile. But if properly constrained into its own MMU-protected sandbox it wouldn’t damage other programs. The system itself could keep running despite the failure of one part.
By and large microcontrollers were not outfitted with memory management units. That was sensible since most MCUs ran small programs doing a single thing. Even today it’s not unusual to see, say, a PIC running a program that’s just a couple of hundred bytes of code. But more modern MCUs often sport large address spaces and run complex applications. Why don’t silicon vendors provide an MMU to permit more graceful degradation of a complex application? Divide the code into a number of individual components – e.g., tasks – and put each in a protected chunk of memory. A supervisor would look for access violations and attempt to restart or recover from one component’s failure.
Transistors are essentially free. Vendors should give us a hardware resource like this to help us build more robust programs.
Recently ARM introduced the Cortex-M7, a sort of super-core for MCUs that gives around 5 Coremarks/MHz, not far from some of Intel’s older speed demons. ST is the first company to offer M7 silicon, and their STM32F756xx comes with a memory protection unit (MPU), which is sort of a poor-person’s MMU. It allows eight memory regions that are hardware-protected from interfering with each other. Access to each region can be finely controlled to permit, for instance, fetches only, or read-only, or read/write access. The MPU is available on some other ARM cores as well. That’s a nice start.
ARM often cackles with delight about how Cortex M0 parts can be made in just 0.01mm**2 of silicon using a 45 nm node. That’s pretty impressive. How about adding 0.001mm**2 for an MMU? The cost is tiny, the benefits can be huge. Developers who don’t need it can just leave it off. Put it in its own power domain so it consumes nothing when idle.
Many RTOSes include code to manage the MMU; the incremental firmware cost is not large.
What’s your take? Would you use an MMU if one were offered?
Jack G. Ganssle is a lecturer and consultant on embedded developmentissues. He conducts seminars on embedded systems and helps companieswith their embedded challenges, and works as an expert witness onembedded issues. Contact him at . His website is.