Thirteen feet of concrete won't shield your RAM from the perils of cosmic rays. What's the solution?
In 1975 Micro Instrumentation and Telemetry Systems (MITS) shocked the techie world when they introduced the Altair computer for $400 (in kit form), the same price Intel was charging for the machine's 8080 CPU. According to the U.S. Department of Labor's Bureau of Labor Statistics (www.bls.gov/cpi) that's $1,500 in today's dollars, which would buy a pretty nifty Dell running at 1,000 times the Altair's clock rate, with a dual-core processor, quarter-terabyte of hard disk, a 19-in. flat-panel monitor, and 2 Gbytes of RAM. The Altair included neither keyboard nor monitor, had only 256 bytes (that's not a typo) of RAM, and no mass storage.
I'd rather have the Dell.
The 256 bytes of RAM were pretty limiting, so the company offered a board with 4 Kbytes of DRAM for $195 (also in kit form). These boards were simply horrible and offered reliably unreliable performance. Their poor design ensured they'd drop bits randomly and frequently.
Although MITS eventually went out of business, the computer revolution they helped midwife grew furiously. As did our demand for more and cheaper memory. Intel produced the 2107 DRAM, which stored 16 kbits of data. But users reported unreliable operation that by 1978 was traced to the radioactive decay of particles in the chip's packaging material. It turns out that the company built a new fab on the Green River in Colorado, downstream of an abandoned uranium mine. The water used to manufacture the ceramic packages was contaminated with trace amounts of radioactivity. The previous generation of DRAM had a storage charge of about 4 million electrons per bit; the 2107, using smaller geometry, reduced that to one million, about the energy from an alpha particle.
Polonium 210, the same stuff that reputedly killed Russian ex-spy Alexander Litvinenko, occurs naturally at in well water and as a decay of product of radon gas. Yet far less than one part per billion in a DRAM package would cause several bit flips per minute. This level of radioactivity is virtually undetectable and led to a crisis at an IBM fab in 1987 when chips were experiencing occasionally random bit flips due to polonium contamination. Many months of work finally determined that one container of nitric acid was slightly “hot.” The vendor's bottle cleaning machine had a small leak that occasionally emitted minute bits of 210 Po. The story of tracing the source of the contamination reads like a detective thriller.1
Cosmic rays, too, can flip logic bits, and it's almost impossible to build effective shielding against these high-energy particles. The atmosphere offers the inadequate equivalent of 13 ft. of concrete shielding. Experiments in 1984 showed that memory devices had twice as many soft errors in Denver than at sea level.2
A 2004 paper shows that because DRAMs have shrunk faster than the charge per cell has gone down, today's parts are less vulnerable to cosmic ray effects than those from years ago.3 But these energetic particles from deep space can create havoc in new dense SRAMs and FPGAs. The paper claims a system with 1 Gbyte of SRAM can expect a soft error every two weeks! According to an article in EE Times , a particle with as little as a 10 femtocoulomb charge has enough energy to flip an SRAM bit; a decade ago the larger cells needed five times more energy.4
Old-timers remember that most PCs had 9-bit memory configurations, with the ninth reserved for parity to capture DRAM errors. That's ancient history on the desktop, though servers often use parity or error-correcting logic to provide the high degree of reliability users expect. Desktop PC users expect, well, frequent crashes. I wonder how many of those are from cosmic rays and not a result of Windows problems?
We've long used RAM tests of various stripes to find hard errors–bad wiring, failed chips, and the like. The very first article I wrote for this magazine in 1990 was about testing memory;5 that piece evolved in many ways and now my Full Monte on the subject is at www.ganssle.com/testingram.pdf.6 But soft errors are more problematic.
Some systems must run continuous RAM tests as the application executes, sometimes to find hard errors–actual hardware failures–that occur over time, and sometimes to identify soft errors. Are these tests valuable? One wonders. Any sort of error in the stack space will immediately cause the system to crash unpredictably. A hardware flaw–say an address or data-line failure–takes out big chunks of memory all at once. Recovery might be impossible, though I've heard claims that memory failures in the deep-space Pioneer probe were “fixed” by modifying the code to avoid those bad locations.
But for better or worse, if your requirements demand on-the-fly RAM tests, how will you conduct these without altering critical variables and code?
Since hard and soft errors manifest themselves in completely different ways, the tests for these problems are very different. Let's look at hard errors first.
Before designing a test it's wise to consider exactly what kinds of failures may occur. On-chip RAM won't suffer from mechanical failures (such problems would probably take out the entire device), but off-chip memory uses a sometimes very complex net of wiring and circuitry. Wires break, socketed chips rattle loose, corrosion creates open circuits, and chips fail. Rarely–oh, so rarely–a single or small group of bits fail inside a particular memory device, but today such problems are hardly ever seen. It probably, except in systems that have the most stringent reliability requirements, makes sense to not look for single bit failures, as such tests are very slow. Most likely the system will crash long before any test elicits the problem.
Pattern sensitivity, another failure mode that used to be common, has all but disappeared. The “walking ones” test was devised to find such problems but is computationally expensive and destroys the contents of large swaths of memory. There's little reason to run such a test now that the problem has essentially disappeared.
So the tests should look for dead chips and bad connections. If there's external memory circuitry, say bus drivers, decoders, and the like, any problem those parts experience will appear as complete chip failures or bad connections.
DRAMs have a special case when external circuitry generates refresh cycles (note that a lot of embedded processors take care of this internally). Unfortunately, it's all but impossible to construct a test to find a refresh problem, other than at the initial power-on self test when no other code is running. The usual approach fills memory with a pattern and then stops all execution for a second or two before checking that the correct pattern still exists. Few users want to see their system periodically lock up for the sake of the test. On second thought, that explains the behavior of my PC.
I want to draw a distinction between testing RAM to ensure that it's functional and ensuring that the contents of its memory are consistent or have reasonable values. The latter we'll look at when considering soft errors next month.
Traditional memory tests break down when running in parallel with an executing application. We can't blithely fill RAM with a pattern that overwrites variables, stacks, and maybe even the application code. On-the-fly RAM tests must cautiously poke at just a few locations at a time and restore the contents of these values. Unless we have some prior information that the locations aren't in use, we'll have to stop the application code for a moment while conducting each step. Thus, the test runs sporadically, stealing occasional cycles from the CPU to check just a few locations at a time.
In simpler systems that constantly cycle through a main loop, it's probably best to stick a bit of code in the loop that checks a few locations and then continues on with other, more important activities. A static variable holds the last address tested so the code snippet knows where to pick up when it runs again. Alternatively we can run a periodic interrupt whose ISR checks a few locations and then returns. In a multitasking system, a low-priority task can do the same sort of thing.
If any sort of preemption is happening, turn interrupts off so the test itself can't be interrupted with RAM in a perhaps unstable state. Pause DMA controllers as well and shared memory accesses.
But what does the test look like?
The usual approach is to stuff 0x5555 in a location, verify it, and then repeat using 0xAAAA. That checks exactly nothing. Snip an address line with wire cutters: the test will pass. Nothing in the test proves that the byte was written to the correct address.
Instead, let's craft an algorithm that checks address and data lines, such as the algorithm in Listing 1.
START_ADDRESS is the first location of RAM. In lines 9 and 11, and 16 and 18, we save and restore the RAM locations so that this function returns with memory unaltered. But the range from line 9 to 18 is a “critical section”–an interrupt that swaps system context while we're executing in this range may invoke another function that tries to access these same addresses. To prevent this, line 8 disables interrupts (and be sure to shut down DMA, too, if that's running). Line 7 preserves the state of the interrupts; if test_ram() were invoked with them off, we sure don't want to turn them enabled! Line 19 restores the interrupts to their pre-disabled state. If you can guarantee that test_ram() will be called with interrupts enabled, simplify by removing line 7 and changing 19 to a starkly minimal interrupt enable.
The test is simplicity itself. It stuffs a value into the first location in RAM and, by shifting a bit and adding that to the base address, stuffs a value to other locations separated by an address line. This code is for a 64 Kbytes space, and in 16 iterations, it ensures that the address, data, and chip select wiring is completely functional, as is the bulk functionality of the memory devices.
To cut down interrupt latency, you can remove the loop and test one pair of locations per call.
The code doesn't check for the uncommon problem of a few locations going bad inside a chip. If that's a concern, construct another test that replaces lines 10 to 18 with the code in Listing 2, which cycles every bit at every location, testing one address each time the routine is called. Despite my warning, the 0x5555/0xAAAA pair works because the former test checked the system's wiring.
There are caveats, however. Don't program these tests in C, for instance, unless you can ensure the tests won't touch the stack. And the value of these tests is limited since address and data-bus errors most likely will crash the system long before the test runs. But in some applications that use banks of memory, a wiring fault might affect just a small subsection of RAM. In other systems the ongoing test is important, even if meaningless, to meet the promises some nave fool in marketing printed on the brochure.
Stay tuned for more on this topic next month.
Jack Ganssle () is a lecturer and consultant specializing in embedded systems' development issues. For more information about Jack .
1. Ziegler , J. F. et al, “IBM Experiments is Soft Fails in Computer Electronics,” IBM Journal of Research Development , volume 40 number 1, January 1996.
2. Ziegler, J. F. and W. A. Lanford, “The effect of sea level cosmic rays on electronic devices,” Journal of Applied Physics , 52, 4305; June 1981.
3. Tezzaron Semiconductor, “Soft Errors in Electronic Memory–A White Paper” Naperville, IL: copyright 2003-2004. Available at: www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf)
4. Cataldo, Anthony, “SRAM soft errors cause hard network problems,” EE Times ; posted online 08/17/2001, available at: www.eetimes.com/story/OEG20010817S0073
5. Ganssle, Jack G., “The Zen of Diagnostics,” Embedded Systems Programming , June 1990, p. 81. Available at: www.ganssle.com/articles/adiags1.htm
6. Ganssle, Jack G., “Testing RAM in Embedded Systems.” The Ganssle Group, 2002. Available at: www.ganssle.com/testingram.pdf
Some of the first memory tests I used were those provided in the EM180, a popular Z80 emulator back in the early 80's. The big issue back then was address line changes. Back then we would wire wrap prototypes and DRAM address RAS/CAS multiplexing and wire wrap bus structures never mixed well. What the EM180 had built in was code for an “Address Line Test”:
Take the first target address and write 00 Set an Address Mod to 0001
XOR the Address Mod with the Target address and, if in range, write FF.
Shift the Address Mod left (until it was 0) and repeat 2
Bump the target address by 1, repeat all of above until done.
The above would run for 12 or so hours.
DRAM still uses address line multiplexing and it is an area ripe for signal quality issues. The “Address Check” above generates all the possible worst case multiplexing patterns, it is very good at ringing out DRAM.
Now I was one of the first ARM users in the US and I coded the above address check into ARM assembly along with some all 00, all FF, Rotates 1s and Rotate 0s. Either one of two conditions would occur:
1. It would run solid for days and I wondered if the code really worked
2. I would keep failing and I wondered if the code really worked.
Funny thing was you could run code on a system yet the memory tests would fail. I see a lot of engineers think that since their system is runs code OK everything is fine. WRONG. You need to run good memory tests with a 4 corner test or temperature and voltage extremes. Only when all these techniques are combined do you ensure a robust system.
When a system passed the above test it was solid. I still swear by the EM180 Address Check.
– Bob Kondner
For a few more ideas look up “Memtest86” on Google or your favorite search engine.
– Bill Murray