|
|

|

|
Making Nonvolatile Data Reliable
by David Hinerman
How do you protect data stored in nonvolatile memory? This article
contains some tricks to make nonvolatile data really tough to lose, even if someone pulls the plug on your system at exactly the wrong time.
Many computer systems being developed today, whether embedded or otherwise, require the ability to store data for future use. Unfortunately, the future may also try its hardest to erase or corrupt that data. This is especially true for embedded systems that typically don't have the luxury of an uninterruptible power supply, gigabytes of disk storage, and
unlimited RAM. Many of us, as developers of lower-cost systems, must ensure the reliability of so-called "nonvolatile data" with limited resources.
Nonvolatile data are data that must be able to be retrieved accurately at some (possibly unspecified) future time, in spite of conditions that may otherwise corrupt or erase that data. At some level this definition applies to all data and even executable code. However, we can usually assume that during run time the normal methods of storing data and code in RAM or
ROM are reliable enough to not require special techniques. Data that must survive such disasters as power failures, disk removal, or gunshot damage (don't laugh-it happens) need special care in design and implementation of storage systems.
In this article, I'll briefly describe some of the more common forms of nonvolatile storage available to the embedded system designer. I'll also address some techniques for applying them to make data as reliable as possible. This is hardly a comprehensive article-if you get
an idea here that leads you to develop a better technique, that's great, and I do hope you'll share.
Why do we need nonvolatile data?
Different systems need different levels of reliability. At the least critical level, a set of intelligently selected default values stored in ROM can be thought of as nonvolatile data. Today's microwave ovens are excellent examples-they have buttons for just about any kind of job that a microwave oven can do. Popcorn? Press the button and wait. Defrost chicken?
Press a different button and wait longer. The times and power settings all reside in a table in ROM, and they don't change over the operational life of the device. This sort of feature requires some thought on the part of the system developer, but requires very few additional resources within the product itself. The ability to remember cooking settings is hardly a critical feature-microwave ovens existed for years with nothing more complicated than a clockwork timer. If the user recognizes the convenience,
however, it can make the device a more salable product.
A slightly more complex device may be required to remember settings from the last time it was used. A remote-controlled television set is a good example: it usually remembers the last channel it was tuned to. As with the microwave oven, this feature is hardly a critical requirement, but it can be an inconvenience to have to start over at channel 2 if you want to watch channel 32. By remaining at the last used channel, the television won't be blamed if
you must surf though 30 channels to watch something different.
This sort of nonvolatility can be important when you are developing a computer-based replacement for older mechanical devices. In the olden days, television tuners were (relatively) big mechanical assemblies that stayed where you set them. Users expected the same convenience when they changed over to electronic tuning. This expectation can also apply to display instruments-mechanical dials were readable even by candlelight. While most users may
forgive a device that can't display anything when the power is off, they may not accept a device that forgot what it should have displayed once power comes back on.
This leads to the next level of complexity-data that must be maintained because it would be costly to the user to lose it. During part of my career in development, I worked for a company that built electronic power meters for the electric utility market. Readings stored in the meters would be used to calculate electric bills for the utilities'
customers. A utility company wouldn't spend hundreds of dollars for a meter unless it expected to recover that cost. Consequently, the expensive meters were installed at the utilities' largest customer sites. Losing a month's readings could mean losses in the thousands of dollars to the power company.
(By the way, this is the environment where gunshot damage is a real possibility. Meters are often mounted outdoors in rural substations, and if they show any sort of light or movement, they may be a target
for somebody with a gun who will "shoot first and ask questions later." While nobody expects a memory cartridge with a 0.30-caliber hole in it to retain data, any chance of recovering readings from a damaged meter is a plus in the eyes of the power company.)
The most critical need for nonvolatile data can exist in systems that affect the safety of life, health, or property. Aircraft automatic pilots and industrial process control systems come to mind as examples. Often the most hazardous conditions for humans
can also be the most likely to cause errors in a system. This is where multiple levels of backup and redundant hardware can be expected. Even then, your system must take extra care to do the right thing in the face of multiple failures.
Data lifetime
Regardless of how critical the need for nonvolatile data, there is a period of time in which the data exist-a "data lifetime," which usually begins when the data are created or generated. (It is possible that variables used to calculate
nonvolatile data may be considered nonvolatile themselves, up to the point that the desired data are safely stored away. I'll cover this in more detail later.) The lifetime ends when the nonvolatile data are no longer required. This point may not be as well defined, since there may be requirements to archive older versions of the data.
The basic stages in the life of nonvolatile data are:
- Data being created
- Moving to nonvolatile storage
- In storage
- Reading from storage
- Being used
- No longer required
You'll need to consider how to protect the data in all of these stages, or at least through the transition into the sixth stage. Even then, the act of removing nonvolatile data from storage may make the system susceptible to later errors.
In developing a system that uses nonvolatile data, you'll need to clearly identify the requirements for that data-their lifetime and how they will be used. Ask questions such as:
- Why are these data
required?
- When will they be created or made available?
- When will they be required for use?
- When can they be discarded?
- Must critical values be stored, or can they be re-created from previously stored data?
- Are there some periods when the data are more vulnerable?
- What will happen if the data are destroyed prematurely?
When you've started to answer some of these questions, you should get an idea of how the data are made and used. You may not be able to answer all
the questions, and some may have multiple answers. Be prepared to go back and revise your answers, too-as you learn more, you'll likely uncover new ideas that could change the whole system.
Once you have a clearer understanding of the life cycle of your nonvolatile data, you'll need to ask some more questions:
- What events could disrupt the life of nonvolatile data?
- When are they likely to occur?
- Are certain stages more likely to be affected than others?
- How can
these events be prevented or managed to minimize damage to nonvolatile data?
- What limits my ability to protect against or correct these events?
Be very critical when asking these questions. Trust nothing-hardware, software, users, documentation, and even developers are all suspect. It's all right to ask, "What if the regulated power supply fails?" The idea is to make a system that is reliable in the real world. Even if you discover that there is nothing you can do to prevent certain conditions
types of nonvolatile storage
There are a number of technologies available for storing nonvolatile data. I'll cover a few of the more common ones here. While there are certainly others, most of them have characteristics similar to the ones described here. Thus, the same comments and suggestions apply.
Usually "nonvolatile" refers to storage that retains data even when power is removed from the system. Some storage technologies actually use no power at all when idle. Others "cheat" a little bit by
using an additional power source that is not switched off by the main power switch, or that stays on even when a power failure occurs.
Probably the most common form of nonvolatile storage used in embedded systems today is the ROM and its programmable variants such as EPROM. Typically used for storing executable code, ROM also holds initial values for variables and constant data that do not change at run time. ROM has a number of attractive advantages. It can be read very quickly-usually at full processor bus
speed-and requires zero power to retain its data when the system is turned off. The storage density is excellent. ROM is the best place to store default values that will never change during the operational life of the system.
The main disadvantage of ROM is that it generally cannot be written by the system while in operation. ROMs and EPROMs are usually programmed before being installed in the system. While some systems permit installed EPROMs to be programmed, this is usually performed when the system is
manufactured, by an external instrument. (EEPROMs can be written in a running system, however. I'll discuss them a little later.)
ROMs, while physically very reliable, can occasionally fail electrically. Especially EPROMs, which are meant to be erased, may have some bits change inadvertently. Good anti-static handling practices should be used with any component, not just ROMs. Opaque labels over ultraviolet erasable PROMs are a must. Make sure the label is really opaque to ultraviolet light, too. Paper
labels available from office supply houses might be acceptable for development work in which you are erasing and reprogramming EPROMs every few days, but they still can leak enough light to erase a few bits in time. The best long-term labels are typically metallized plastic.
There is some debate among developers as to whether or not error-checking algorithms are useful in verifying ROM contents at run time. Proponents of run-time checking claim that it can give the user advance warning of impending failure or
errors in stored data. Opponents claim that if the ROM would fail an error-checking process, it probably wouldn't allow the system to run long enough to report it anyway.
Another popular method of nonvolatile storage is battery-supported static RAM. This technique has been used in desktop computers for years to record their hardware configuration. RAM reads and writes quickly, is available in high densities, and consumes very little power to retain data. There are even RAM chips available with built-in
batteries, which eliminate the need to design a battery and power-switching circuit into the system.
Battery-supported RAM has several disadvantages when used for nonvolatile data storage. Because it writes easily, its contents can just as easily be overwritten by a processor that isn't executing code correctly, perhaps in a low-voltage situation. Batteries, even rechargeable ones, have a finite life span. Systems sold in some markets are expected to have a useful life of many years. In such markets, a system
with a battery is assumed to require at least one maintenance shutdown, raising the cost to operate that system. Batteries can also be considered hazardous in certain environments due to potential outgassing or release of toxic material if the battery case should rupture.
Battery-supported RAM is available in traditional byte- or word-wide configurations for connection to a processor bus, and in small (usually 8-pin) ICs with a simple clocked serial interface. The serial RAMs have an additional disadvantage,
in that reading and writing are slow processes and can't usually be performed at bus speeds.
A nonvolatile storage technology that has become more popular over time is EEPROM. It has almost all of the advantages of traditional ROM, plus it can be written while a system is in operation. EEPROM's one disadvantage (when compared to ROM) is that storage densities aren't as great. Many applications don't require a lot of nonvolatile storage, so this isn't a common problem.
EEPROM has several disadvantages when
compared against RAM, however. First, EEPROM is slow to write. Writing a single location (or a block of adjacent locations in so-called "page write" EEPROMs) can take hundreds of microseconds, or even many milliseconds, to complete. During this time, other locations within the component may be unavailable for reading or writing. Another disadvantage is that EEPROMs are usually guaranteed to survive a limited number of write operations to any given location. In common components this is usually 10,000 write
cycles, although some parts may be guaranteed to accept more (at greater cost).
EEPROM, like RAM, is also available in byte-wide, word-wide, and serial configurations. The serial components are slow, which adds even more delay to the writing process. The main advantage in having a serial component is the savings in cost and circuit board space.
Flash EPROM has become familiar as a nonvolatile storage technology in the past few years, thanks to the personal computing industry. Solid-state "disks" and
field-upgradeable motherboard BIOS memories have given flash a solid market. Flash is like a blend of ROM and EEPROM technologies, with the high storage density of ROM and the writability of EEPROM.
Flash EPROM also has most of the disadvantages of EEPROM, only more so. It writes slowly, making the part unavailable for reading, like an EE-PROM. It also tolerates a much smaller number of write cycles-on the order of 100. A further disadvantage of flash EPROM is that before a location may be written, at least a
portion of the component must be erased. Erasure is also a slow process, taking many seconds to complete in some cases.
Perhaps the most familiar form of nonvolatile storage is the magnetic disk. The advantages of disk (or any similar removable magnetic media) are well known: zero power to retain data, large storage capacity, and inexpensive media. The disadvantages are also well known: slow to read and write (compared to semiconductor memory), large up-front cost for the drive, and the mechanical systems are
susceptible to damage from shock and contamination. Floppy disk drives also require periodic maintenance, raising the cost of owning a disk-based system. Nevertheless, disk is an acceptable nonvolatile storage method in many applications.
I haven't discussed relative costs between different technologies (except between serial and parallel versions of the same technology) because costs can vary dramatically based on non-technical factors. When designing a system, contact the suppliers for current costs for
each type of storage you are considering.
unreliability in nonvolatile dat
a
Looking at the characteristics of the nonvolatile storage technologies available, you can pick out some potential causes of errors in stored data. For example, battery-supported RAM could lose data completely if the battery dies. Magnetic disks could be rendered useless by mechanical shock. UV EPROM can lose bits if it's exposed to bright light. However, these problems can be reduced or eliminated by careful design.
Knowing the characteristics of the medium, you can choose the conditions that will give it the best chance of success. But what about the data themselves? RAM is pretty much useless if you don't store anything in it. Data must be handled as carefully as, if not more carefully than, the storage medium.
Even given a perfect storage technology, we can still expect errors to try to creep into our data. This is because the process of creating, storing, retrieving, and discarding data is subject to interruptions and
mistakes. Some of the more common external causes of errors are power outages, operator error, and yes, even software errors.
Data errors can be as small as a single bit or as large as the entire contents of a system's memory. Single bit errors can frequently be caused by hardware or environmental problems such as a failed location in a chip, a static discharge during a write, or a noise spike in the power supply. Larger errors involving a byte or multiple bytes can be caused by hardware or software
problems. With multiple bytes, it is possible to have the correct bytes but in the wrong order. This almost always requires software to commit this type of error. There are detection and correction algorithms available that will catch and repair these kinds of errors.
Unexpected events like power outages or removal of a memory cartridge are frequent causes of corrupt data. This is especially true when the data are created by long and complex algorithms that take significant time to execute. Errors like these
are the most difficult to defend against, because they come at unexpected and uncontrollable times. However, there are methods that can improve the survival rate of data subject to these conditions. I'll discuss a few of these techniques later.
Error detection and correction
An important part of maintaining nonvolatile data is being able to detect (and possibly correct) errors. I'll mention some of the common methods here. I won't go into a lot of detail, because this topic has been covered in
countless articles and books.
Perhaps the simplest method of error detection is to just assume that the data are correct. In some situations this is perfectly acceptable, as long as a potentially bad value won't cause the system to crash or behave unexpectedly. If you're going to rely on this, you should at least check that the value is within an allowable range, and change it to a known good default value if it isn't.
A common technique that can actually detect errors is to store some sort of check value with
the data. A number of algorithms are available, many with C source code, in technical literature and on the Internet. Checksum, cyclic redundancy check (CRC), parity bits, and other similar techniques fall into this category. Most can detect bit errors, but others (like CRC) are also able to detect errors in byte order and other potential mistakes. Also, some algorithms require more system resources, in CPU cycles or memory, than others. Choose a method that best fits your available resources and the error
sources you are likely to encounter.
Another technique that makes nonvolatile data more robust is using error checking and correcting (ECC) methods. This involves calculating a check sequence for a set of data that not only allows detection of errors, but also allows erroneous data to be reconstructed in some cases. ECC algorithms are usually more involved than checksum or CRC calculations, and require additional storage (usually less than twice as much as for the original data) for the check sequence.
Depending on the source of errors in your data, an ECC algorithm may provide enough security for your needs.
An easy method of error detection and correction is mirrored storage. Whenever a nonvolatile datum is stored, it is stored in multiple places. Whenever it is to be used, it is read from two or more places and compared. If a majority of the values match, the datum is regarded as accurate. If only two copies of the datum are used, they must have some method of error detection as well in order to decide
which is accurate in the case of a mismatch. If three or more copies are used, a majority vote can determine the value that is most likely to be correct. The down side of mirrored storage is that it requires additional storage-two, three, or more times as much as that required for the original data.
Whatever method of error detection and correction you choose, managing your data will be easier if you isolate it behind a set of subroutines. These routines can provide gateways that make it easier to determine
when to save data to nonvolatile storage, when to calculate and test error checking codes, and can hide the business of storing and reading multiple copies in mirrored storage. If necessary, it will allow you to experiment with different algorithms without affecting the rest of your software.
Tips and techniques
Error detection methods can tell your system if an error has already occurred in nonvolatile data. ECC can even help repair the damage. But both work after the fact. There are some ways
to help prevent errors from occurring in the first place. Prevention works best by looking for the things that can cause errors and keeping them from occurring.
Possibly the most traumatic event for nonvolatile data is an unplanned power failure. Some simple systems don't care about power outages because they'll simply reload defaults from ROM when the power comes back on. Other systems, with reliable, redundant power sources, may not care about power outages because they just won't occur. (If you're
building one of these systems, you might still consider using some of these techniques-it may help recover a system where even the redundant systems failed.) A huge number of systems, however, must save data for future use despite the presence or absence of power.
Systems like these usually have a method of detecting an imminent power failure (usually by interrupting the processor), some form of reserve power with which to complete power-failure processing, and a storage medium that will retain data when the
power finally goes off. These must all be considered carefully together in order to make a reliable system. For example, a system that has only 10 milliseconds of reserve power should not try to write data to a disk drive, because it won't make it in time. Also, a system that performs calculations that can't afford to be interrupted by a power failure must be able to disable that interrupt briefly, but not so long that any reserve power is consumed before power-down processing can be performed. For this reason,
you should avoid connecting a power failure interrupt source to the Non-Maskable Interrupt (NMI) on your processor. It must be a high priority, to be sure, but it must be maskable. (There is a way to simulate masking of an NMI in software, but it's hardly an optimum solution. More on this later.)
When designing a system that will perform power-fail detection and processing, add up the following times:
- The maximum time that the power failure interrupt may be disabled
- The
latency in responding to the interrupt
- The time required to perform all power-down processing (for example, CRC calculations)
- The time required to safely write all of the necessary data to nonvolatile storage
The total of these times must safely fit into your reserve power time. When relying on residual charge in filter capacitors in the power supply to carry you through, remember these things:
- Electrolytic capacitors can lose a good deal (on the order of 50% in some
cases) of capacity as they age. Assume your product will have less reserve when it's ten years old than it has when it leaves the factory
- Capacitors in general may have less capacity when they're hot
- If your device is line-powered, the line voltage may have been below normal when power finally went off. Your capacitors may not have as much charge as if they had been operating at normal line conditions
- Some storage media, like EEPROM or disk, will draw more current while they're being written
to. This can drain your reserve faster
If your power supply can't provide sufficient passive reserve power (in the form of charged capacitors), you may need to add a battery that is capable of running the system until power failure processing is complete. This same battery could be used to power static RAM for nonvolatile storage, but you'll probably want to disconnect the processor and other current consuming parts of the system when processing is done. This disconnection will require a switch of some
sort that is under the control of software or that will switch off automatically some period of time after a power failure is detected. Inexpensive ICs are available that perform these functions, but they will add complexity to your design. Remember, also, that batteries may be regarded as a maintenance item by the user, adding to the cost of operating the system. Batteries tend to lose capacity, too, with time and temperature extremes, so be conservative.
If you have full-speed nonvolatile RAM, you may
consider writing all your nonvolatile data to it at power down. But in order to save time (which may be desirable in any system, and a necessity for slow-writing memory like EEPROM) you could periodically write the data, and even error checking information, during run time. Then power failure processing becomes an exercise in not overwriting what is already safely stored. The power failure interrupt service routine becomes the place to ready the system to shut down. This may include setting memory addressing
registers to point to safe locations, turning off any supporting battery source, and placing the processor in an idle or power-saving state.
For example, let's say you're developing a data logging system that, once per second, reads an analog input and saves the reading, along with a time stamp, into a circular queue in nonvolatile storage. Listing 1 shows the main function of a program that can accomplish this task.
Because performance requirements are low, you've chosen an inexpensive 8-bit processor to
do it. That means that your multiple precision arithmetic, while still simple, is done one byte at a time. What would happen if your system took a power failure interrupt "right between the bytes"? You've already stored an A/D reading in the queue, as well as the two least significant bytes of the 4-byte time stamp. What happens if you stop right there? The most significant bytes may be zero (if the queue was cleared and that particular location has never been used before), or they may be the two most
significant bytes from a time stamp that was stored long ago. When the system starts up again, you'll have a reading with an out-of-sequence (and incorrect) time stamp.
If you could have disabled the power failure interrupt before beginning to store any new data in the queue, that would guarantee that all related values would have been stored together. In order to reduce the amount of time that the interrupt is disabled, the A/D converter manipulation and reading the time can be done in local variables outside
the disabled code. Listing 2 illustrates this method.
You could further isolate access to the nonvolatile storage by providing a function called putQueue(reading, time) and placing the disabling and enabling of power failure interrupts there. This would completely remove the need for main() to manage nonvolatile storage. In fact, managing the queue itself would require nonvolatile data (locations of oldest and newest records in the queue, as well as the number of records stored) that could be "hidden" inside
that function. These data need to be protected from power failures, too, in order to retrieve the data later.
If you had used EEPROM instead of RAM for the data logging application, you'd have some additional conditions to keep in mind. Since EEPROM typically is guaranteed to survive only 10,000 write cycles, a system that writes to a single EEPROM location every second would potentially wear out that location in about 2.7 hours (10,000 seconds.) However, because you're using a circular queue, you'll be
writing to any given location every N seconds, where N is the number of records in the queue.
Or will you? What about the head and tail pointers into the queue? They must be updated every time the queue is updated. If these are single locations in the EEPROM, they can still wear out in a few hours. There is a way out of this dilemma, but it takes a little more code at initialization. At startup, (i.e. in the function initQueue() in the Listings) one could scan through the queue looking for the oldest and
newest records, and setting the head and tail pointers (in volatile RAM) accordingly. This removes the need to save them in nonvolatile storage.
Now you have another problem. Unless there's a sizable battery in your system, you probably can't afford to disable the power failure interrupt while writes to EEPROM take place. These can take tens of milliseconds to complete. If power dies while the program is writing to the EEPROM, it's hard to tell what will be stored.
That's where error checking becomes a
necessity. Adding a checksum to each record will increase storage requirements slightly, but will allow you to determine which records are trustworthy. If, at startup, a record shows a bad checksum, you can treat it as an unused record if it is between the oldest and newest "good" records. If it is elsewhere in the queue, you should flag it as a potentially bad EEPROM chip. Or, if you had error correction data available, you could restore the record and continue.
A queue like this can also be useful for storing in
EEPROM (or Flash EPROM, which tolerates even fewer writes) single records that update frequently. Instead of writing new data to the same location in EEPROM until it wears out, you can write it to the next available record in a queue along with an incrementing serial number and a checksum. At startup, the record with the highest serial number and a good checksum contains the most recently saved (and valid) data.
When determining how much EE-PROM is required for these techniques, you must take into
consideration several factors:
- How long can the system run between updates to the nonvolatile data? One second? One minute? One hour? What is the impact of using data from the previous update?
- How many writes will be tolerated by the EEPROM? (The typical part is guaranteed to go 10,000 write cycles. Some parts, available at higher cost, will accept 100,000 or more writes)What is the useful life of the
product?
- What is the size of a single record (including time stamps, serial
numbers, and error checking data)?
You can estimate the number of times your system will write to the EEPROM by dividing the useful life by the update rate. For example, a device with a 20-year life, updating every hour, will write to EEPROM 175,200 times. Don't forget leap years, which could add another 120 updates for a total of 175,320.
To get the minimum number of records the system's queue must hold, divide the number of lifetime updates by the number of guaranteed writes that the EEPROM will accept.
This would be a good place to be conservative, so let's assume 8,000 writes per location. You can expect to use 21.915 records, which you'll have to round up to 22. Any convenient number of records greater than this will work.
To determine the actual number of bytes required, multiply the number of records by the size of each record, including serial numbers, time stamps, and error checking data. For the data logging example (now recording a reading every hour), that would be two bytes of A/D reading, four
bytes of time, and one byte of checksum, for a total of seven bytes. Seven bytes per record for 22 records is 154 bytes. This would fit into most serial EEPROMs very nicely, with some room to spare. If you have the room, increase the number of records in the queue. This will improve reliability by decreasing the number of writes to each EEPROM location even more.
You data still isn't completely safe, however. It never really is, but there are a few more things you can do to protect it even further. Most
memories with a serial interface use a location counter within the device to perform addressing. The system sets the location counter before reading or writing that location. Some memories will automatically increment or decrement the counter when a location is read or written, but the location counter is always pointing at something. If it should be pointing at the location of something critical, and a spurious write operation started, it could overwrite that location with gibberish. True, your error
detection will probably catch it, but it could have been prevented by always setting the location counter to an unused location in the device. Much like the head landing zones on a hard disk, this is a safe place to "crash."
As I mentioned earlier, it's better to not connect your power failure interrupt to the NMI input on the processor. However, sometimes you're handed a design that's already committed to hardware, and that's the interrupt you get. It's still possible to keep that NMI from interfering too
much.
In the listings, we used two functions (disablePowerFailure() and enablePowerFailure()) to control power failure interrupt disabling and enabling. Because you can't disable NMI, the next best thing for disablePowerFailure() to do is to set a flag that will indicate to the NMI service routine that it should not perform any processing, but simply set an "interrupt pending" flag and return. Then enablePowerFailure() can test the pending flag and perform any NMI processing that is required.
This technique
can cause problems if NMI is level-sensitive and the interrupting signal can't be canceled. In that case, your system may keep trying to service NMI after every instruction in your main code. In most cases, however, you'll be able to convince NMI to only interrupt once, either because it is edge-triggered or because you'll be able to clear the interrupting signal. Then you'll have only the impact of running a shortened NMI service routine during your disabled code.
Sometimes the processing required to create
a set of nonvolatile data takes too long to leave the power failure interrupt disabled while it executes. If you have 10ms of reserve power and 100ms of processing, you'll need to divide up the process into chunks of less than 10ms each. I've found that a simple state machine, with each state being a chunk of processing that has power failure interrupts disabled, is an effective way of protecting the data.
This method adds a good deal of complexity to what would otherwise be a long and tedious, but probably
straightforward, series of calculations. But this method does allow data to be created, stored, and retrieved reliably in the face of limited system resources.
Essentially, all data used in creating the final result, as well as the state variable, must be saved in nonvolatile storage. The first step (first state) is to make certain that all source data are safely stored and the state variable is incremented and also stored. This is done with the power failure interrupt disabled. By briefly enabling the
interrupt, you give the system the opportunity to handle any pending power failures. After disabling again, run the next state, which may save intermediate values in nonvolatile storage and increment the state variable again. Enable and disable again, and proceed.
In essence, you've turned your interruptible code into a program that polls for power failures when they're convenient to service. Should a power outage occur, the system need only execute the state machine on power-up to finish processing the data.
The last correctly stored state value acts as a checkpoint which tells the state machine where to resume processing.
Other events that can be detected and signaled with an interrupt can be handled the same way. Some events, however, can come without warning. The best way to protect against these events is to keep multiple copies (for example, the most recent three sets) of nonvolatile data in a queue, with error checking and serial number data. On startup, the copy with the highest serial number and a valid
checksum are used as the current data set.
User Issues
Considering the things that can go wrong with nonvolatile data, it could seem that there's no point in even trying to use it. But many of the potential problems can be corrected even if they do occur. Just because a system must use its error correction code now and then, is it really failing? Or is it just doing the job it was built to do? These questions come up when one is considering the user interface. Should you notify the user of a
successfully corrected error?
Like so many engineering questions, the answer is "That depends." Some factors that affect the answer are:
- How technically sophisticated is the user? Does he or she realize that errors do occur, and that the system is handling them correctly? Or will the user see any warning as a reliability problem and buy another system? (Perhaps one that is potentially less reliable, but doesn't report its errors?)
- Are errors occurring more frequently than they
should? A really robust system cankeep running even when a serious problem occurs
- What do you expect the user to do if he or she receives an error report? Can the user take corrective action, or just fret over it? (Remember Steinbach's Guideline for Systems Programming: "Never test for an error condition you don't know how to handle.")
Depending on the answers to these and other questions, you might choose to employ one of the following reporting philosophies:
- Report
everything-the user needs (or wants) to know the status of the system at all times
- Perform some statistical analysis and report error tendencies outside of the normal system profile. (Deciding what a "normal system profile" is can be a challenge)
- Don't report anything unless there's no way to hide it
- Don't report anything. Period
The balance between informing the user and unnecessarily alarming him or her is difficult to find. Sometimes it's acceptable to make error data available
on request, but only issue an alarm in the case of critical (unrecoverable) errors.
Something that tends to be overlooked in designing nonvolatile data systems is how to initialize data in the first place. A really tenacious system can be impossible to initialize because it will cling to whatever data it has, no matter how useless. Plan on having a method for someone-a factory technician, at least-to be able to start the system in a known state. The procedure need not be made known to the user unless it is
a requirement, but it will make production easier in the long run. Make sure the values loaded during this process make sense-initializing the system this way won't do any good if it comes out with an unusable baud rate.
At some point, despite all your careful design and programming, the system may fail to recover corrupt data. Nothing is foolproof; all we can do is try to improve the odds. So then what do you do when the odds catch up to your system?
Again, it depends on the application, but the common
possibilities are:
- Load defaults from ROM. This can often keep the system running well enough to report the error to the user and allow the system to do its job at reduced efficiency until the error can be corrected
- Go ahead and use the data anyway, noting to the user that the data are suspect. It is conceivable that the only thing wrong is that the checksum wasn't stored in time before a power failure. If you're going to do this, verify that the data are in a valid range before
trying to use them
- Stop and demand that the user correct the situation before proceeding. If you're going this route, be sure to give the user the tools to actually do something about it. (Do you remember PARITY CHECK 1? The original IBM PC and other computers of that type used parity checking on their RAM. If an error was detected the system would halt, report the error, and wait to be turned off. The user was notified of a problem, but had no way of attempting to correct it or even saving work to
be corrected later.)
nothing is foolproof
Embedded systems are often required to retain data for long periods of time-even when the power is turned off. Many technologies are available to store nonvolatile data, each with its advantages and disadvantages. Furthermore, external conditions can affect the reliability of stored data-conditions that may be exaggerated by characteristics of the chosen storage medium.
Common types of nonvolatile storage media available to the embedded system
designer are ROM, battery-supported RAM, EEPROM, Flash EPROM, and magnetic storage (such as disk or tape). All of these technologies except RAM write slowly and are susceptible to being interrupted by power outages and other unpredictable events. RAM, on the other hand, requires an additional battery that may not be acceptable to some users, and is itself prone to failure as it discharges over time.
There are methods of error checking and correcting that will help the embedded system detect and repair data errors.
Like hardware, the different methods available have advantages and disadvantages that must be considered. Some algorithms run quickly but consume extra storage, others use less memory but take longer to execute. None are 100% secure.
Other methods are available to improve the odds of maintaining data reliably. They can add robustness in the face of unpredictable power and other annoying events. They can also be used to get around deficiencies in a given storage medium, if that medium is desired for its
other qualities.
However, the designer must admit that nothing is completely foolproof. The system must have an intelligent, reasoned plan in case the data are lost despite all precautions. Knowing the application, its operating environment, and the task required of it can make deciding what to do in a disaster a little easier.
David Hinerman is employed as a software engineer at Microcom Corp. in Westerville, OH. He has over 12 years experience as an embedded software developer.
|
|
|
|
Ready to take that job and shove it?
|
|