Soft RAM errors cause crashes, but will become ever more prevalent.
While many systems sport power-on self-tests, in some cases, for good reasons or bad, we're required to check RAM continuously as the system runs. Since the RAM test necessarily runs very slowly (so as not to turn the CPU into a tortoise), I think that in most cases these tests are ineffective. Most likely the damage from bad RAM will be done long before the test detects a problem. But sometimes the tests are required for regulatory or marketing reasons. It makes sense to try and craft the best solution possible in the hope that the system will pick up at least some errors.
It'll ease your conscience as well. No one wants to write crummy code.
Last month I discussed hard errors—those generally reproducible problems that stem from bad hardware. But some systems may be susceptible to transient problems. A bit might mysteriously change value, yet the memory array consistently passes standard RAM tests. Though these “soft” errors can come from a variety of sources (such as power fluctuations and poor electrical design), aliens account for most. Aliens, that is, meaning invaders from outer space. High-energy streams of protons and alpha particles that comprise cosmic rays do cause memory bits to flip unexpectedly.
Then there's Maxwell's curse: shoving electrons around in a circuit generates electromagnetic fields that can affect other parts of a system, or even other systems. Long ago I instrumented a steel-rolling mill. A PDP-11 main controller lived in an insulated NEMA cabinet but was connected to numerous custom controllers dispersed around the 1,000 foot-long line. A house-sized motor consuming thousands of amps reversed direction every few seconds to move the steel back and forth under the rollers. I had argued for a fiber connection between boxes but was overruled as optical connections were still somewhat expensive at the time. The motor induced so much energy that the wiring between controllers—and a lot of electronics—quite literally smoked.
The customer found money for fiber after that.
But even a small system operating in a benign environment can suffer random bit flips from electromagnetic energy created from switching logic. Big memory arrays, being highly capacitive, are particularly vulnerable. Back when I was in the in-circuit emulator business we added a RAM test feature to our products. Since our customers used the units to bring up hardware as well as software, it seemed a great idea to help them evaluate their RAM.
That feature turned out to be a nightmare as something like a third of the users called to complain that the test gave erratic results. Two successive checks might give a few very different errors. Sometimes this was a result of an impedance mismatch between the tool and the customers' systems, but in general we found that the targets' designs were suffering from Maxwell effects. Driving RAM created transients that wrote incorrect data. Or sometimes the memory was fine but EMF distorted, momentarily and erratically, the data latched into the processor.
Those effects are most severe when a lot of bits on the bus change (because that's when the most electromagnetic energy is radiated), so it was, and still is, most common to see transients when many of the data bits are different than address bits, like reading 0000 from 0xffff. So we devised stress tests to pound on the buses hard. For instance, the code snippet in x86 lingo shown in Listing 1 tap dances on the bus like King Kong on a Yugo.
A lot of bits change in a very short time. Poor design invokes Maxwell's curse and the final compare instruction will fail.
Soft errors are just as perilous as a true hardware fault. A single-bit error in the stack, even if transient, will crash any non-MMU-based system. (One wonders why MMUs aren't more common.) A corrupted variable may bring the application to its knees, create fun and embarrassing results, or put the system into a dangerous mode.
Soft errors are transient. Conventional RAM tests, as described last month, won't generally detect them. Instead of looking for errors per se, a soft RAM test checks for memory consistency. Does each location contain what we expect?
It's a simple matter to checksum memory and compare that with a previous sum or a known good value. But variables change constantly in a running program. That's why they're called “variables.” It's generally futile to try and track the microsecond-to-microsecond contents of hundreds or thousands of these, unless you have some knowledge that an important subset stays constant . . . ah, “constants” for instance.
Sometimes it takes several machine cycles to set a variable. A 16-bit processor updating a 32 bit long variable will execute several machine instructions to save the data. An interrupt between those instructions leaves the variable half-changed. If your RAM test runs as a task or is invoked by a timer ISR, it may examine a variable in an indeterminate state. Lock accesses to every value that's part of the RAM test to keep the code reentrant.
I once saw an interesting solution to the problem of finding soft errors in a program's data space. Every access to every variable was encapsulated by a driver, which computed a new checksum for the data block after each write operation. That's simpler than re-summing memory; instead figure how much the checksum changed based on the previous and new values of the item being changed. Since two values—the variable and checksum—were updated the code used semaphores to be reentrant-safe. Though slow and cumbersome, it did work.
Many systems boot from relatively slow flash, copy the code into faster RAM, and run the code from RAM. Others boot from a network into RAM. Ulysses had himself tied to a mast to resist the temptation to write self-modifying code (the Odyssey was all Greek to me), a wise move that, if emulated, makes it trivial to look for soft errors in the code. On boot, checksum the executable and have the RAM test repeat that checksum, comparing the result to the initial value.
Since there's never a write to firmware, don't worry about reentrancy when checksumming the program.
It's reasonable to run such tests on RAM-based program code since the firmware doesn't change during execution. And the odds are good that if an error occurs, the application won't crash before the check gets around to testing the bad location, since the program likely spends much of its time stuck in loops or off handling other activities. In today's feature-rich products the failure could occur in a specific feature that's not often used.
Is a CRC better than a simple checksum? After all, if a couple of bits in different locations fail it's entirely possible the checksum won't signal a problem. CRCs dominate communications technology since noisy channels can corrupt many bytes. But they're computationally much more expensive than a checksum, and one goal for RAM testing is to run through all locations relatively quickly while leaving as many CPU cycles as possible for the system's application code. And soft RAM errors are not like noisy communications; most likely a soft error will result in merely one word being wrong. I prefer checksum's simplicity and speed.
While marketers might like the sound of “always self-tests itself!” on a datasheet, the reality is rather grim. It's tough to do any sort of consistency check on quickly changing locations, such as the stack. The best one can hope for is to check only sections of the memory array.
Unfortunately on-the-fly RAM tests pose a Hobson's choice with respect to performance. We want to run the tests often and fast, picking up errors before they wreak havoc. But that's computationally expensive, especially when reentrancy considerations mandate taking locks or disabling interrupts. If these tests are truly important one must push the application's code to a low priority and lavish the bulk of CPU cycles on the tests. Few of us are willing to do that.
Processors with caches pose particularly thorny problems. Any cached value will not get tested. One can glibly suggest a flush before each test, but, since the check runs often and fast, such action will more or less make the cache worthless.
Dual-ported RAM or memory shared between processors must be locked during the tests. It's possible to use a separate CPU just to run the test (I've seen it done), but bus bandwidth will cripple the application processor's performance.Finally, what do you do if the test detects an error? Log the problem, leaving some debugging breadcrumbs behind. But don't continue to run the application code. Recovery, unless there's some a priori knowledge about the problem, is usually impossible. Halt and let the watchdog issue a reset or go into a safe mode.
Hi rel apps
Soft RAM errors are a reality that some systems cannot tolerate. You'd hate to lose a billion-dollar space mission from a single microsecond-long glitch. Banks might not be too keen on a system in which a cosmic ray changes variable check_amount from $1.00 to $1,000,001.00. Worse would be holding a press conference to explain the loss of 100 passengers because the code, though perfect, read one wrong location and “crashed.”
In these situations we need to mitigate, rather than test for, errors. When failure is not an option it's time to rely on additional hardware. The standard solution is to widen each RAM location to add an error correcting code (ECC). A substantial amount of hardware will be needed to encode and decode ECCs on every memory access as well.
A word of 2n bits needs n+1 additional check bits to correct for any single-bit error. That is, widen RAM by 6 bits to fix any one-bit error in a 32-bit word. Use n+2 extra bits to correct any single-bit error, but to detect (and flag) two-bit errors.
Note that ECC will protect any RAM error, even one that's in the stack.
A few wrinkles remain. Poor physical organization of the memory array can defeat any ECC scheme. In the olden days DRAM was available only in one-bit-wide chips. A 16-kbyte array used sixteen 16 kbit x 1 devices. Today vendors sell all sorts of wide (x 4, x 8, and so forth) configurations. If the ECC code is stored in the same chip as the data, a multiple-bit soft error may prevent the error correction, even detection. Proper design means separating data and codes into different parts. The good news is that cosmic rays usually only toggle a single bit.
Another problem: soft errors can accumulate in memory. A proton zaps location 0x1000. The program reads that location and sends the corrected value to the CPU. But 0x1000 still has an incorrect value stored. With enough bad luck another cosmic ray may strike again; now the location has two errors, exceeding the hardware correction capability. The system crashes and 60 Minutes is knocking at your door.
Complement ECC hardware with “scrubbing” code that occasionally reads and rewrites every location in RAM. ECC will clean up the data as it goes to the processor; the rewrite fixes the bad location in memory. A minimally intrusive background task can suck a few cycles now and then to perform the scrub. Reentrancy is an issue.
Some systems invoke a DMA controller every few hours to scrub RAM. Though DMA sounds like the perfect solution, it usually runs independent of the program's operation and may corrupt any non-reentrant activity going on in parallel. Unless you have an exquisite sense of how the code updates every variable, be wary of DMA in this application.
ECC is expensive. Are there any software-only solutions?
Sometimes developers maintain multiple copies of critical variables, encapsulating access to them through a driver that checks to ensure all are identical. That leaves the stack unprotected. The heap might be vulnerable as well, since a corrupt entry in the allocation table (invisible to the C programmer) could take out every malloc() and free() . One could preallocate every block on boot, though then the heap offers little of value over fixed arrays.
Such copies may use as much extra memory as does ECC.
There may be some value in protecting critical, slowly changing, data, but my observation is that developers typically value ease of implementation over doing a real failure analysis.
Better one crash then many
Soft RAM errors will become more of a problem as memory sizes grow and device geometry shrinks. Yet they're nearly intractable in terms of software solutions, and hardware approaches are costly and eat plenty of PCB real estate. If you have an MMU, use that to put each task in its own memory space. A stack error, for instance, will likely crash just one task. The MMU exception handler can take some sort of action to restore the code or put the system into a safe state.
Jack Ganssle () is a lecturer and consultant specializing in embedded systems' development issues. For more information about Jack .
I would like to pass on my recent experience on an embedded Set Top Box. We had a USB driver which was not set up and was accidently turned on. The result was a random memory modification through out the memory space. We had two processes in the unit plus numerous kernel threads.
The process that I was working on had protected private heaps, with extensive error checking. This was done to catch application errors as we were parsing pid packets and doing extensive routing of these packets. Well when the usb system went wild, the first thing that happened was that the heap check routines were trapping errors at random places. The other process which did not have any error checking appeared to be happy at the time. This problem took considerable time to isolate, because the kernel configuration was done by an outside group. What was amazing how long a system can appear to run successfully with memory problems and the reason the problem was know to exist was the run time checks on the data structures.
Based upon this experience, I tend to protect important data structures such as heaps and persistent data. The protection aids in debugging at first, but it will probably be the first indication of memory write problems, induced by things such as erratic hardware.
– Glenn Edgar
There are quite some nasty RAM defects that a lot of simple RAM-test approaches would not find. Especially in fast RAM-chips so called “dirty faults” can be hard to detect: they only show up after a second read to the affected RAM cell.I recommend the RAM testing papers from http://ce.et.tudelft.nl/publications.php as esy to read scientific literature about RAM testing.
– Stephan Gruenfelder
Here's a Run-time Memory Scanner:
The Memory Integrity Scanner continuously checksums the DRAM from thebottom of the Code and Data area up to the top of the MMU page area.
The memory is divided up into a two-dimensional array of 32-bit wordsstarting from HW_MEM_PROT_LOW_READ (0x010000) and ending atHW_MEM_PROT_LOW_WRITE (hwStackBody). 32-bit words are checked so as toallow for the detection of Breakpoints in the code during debugging.Each of the rows and columns are separately checksummed and thechecksums are themselves summed. The memory and checksums areillustrated as follows:
|CC| <-- checksum="" of="" row="" checksums="">
| C|….|…|….|…|….|….|…|…| <-- memory="" locations="">
+–+–+–+–+–+–+–+–+–+ from '0' to 'n'
……| C | C| C| C | C| C| C | C| |CC|
…… Column Checksums ^
………..Checksum of Column Checksums -+
The two-dimensional checksumming allows the scanner to detect theaddress of single corruptions, and the checksum routine used (XOR)allows the correct value to be calculated from one row or columnchecksum. The other checksum allows the correction to be validated (bothchecksums must predict the same repair value). It can even repairmultiple errors in a single row or column. It can detect multiplecorruptions across multiple rows and columns and knows that it cannotreliably repair in this case.
The checksums are protected by keeping checksums of the checksums.
When the Scanner is called the first time it performs the initial rowand column checksums.
The Scanner performs one “pass” per iteration, where summing one row orcolumn or summing the checksums is counted as a “pass”. Passes are runat one per INTEG_TIME_SLOW, which is currently 100ms. If an error isdetected during the Row or Column scan, the state machine switches into”find” mode where it runs the passes without any delay (but still runsone pass per System schedule) until it has located and counted all therow and column checksums that are in error.
The Scanner can be configured to run or not, to correct or not or toreboot on error detection.
To support running under the debugger, the Scanner can detect (Power PC)Breakpoint instructions in the code and ignores and counts them. It alsodetects when the unit is running under the debugger, and disables thereboot option in this case.
– Tom Evans