Improving reliability of non-volatile memory systems
Complex systems like Advanced Driver-Assistance Systems (ADAS), medical, and industrial applications need to be reliable, secure, and safe. In these systems, firmware and associated data are stored in Non-Volatile Memory (NVM) because code and data must be retained when power is not being supplied. Thus, NVM plays a crucial role in system reliability.
NVM reliability can be expressed in two ways: data retention time and cycling endurance. Retention time dictates how long NVM can hold data and code reliability. Endurance measures how many times the NVM can be rewritten and still reliably hold data and code. To offset these limitations, designers often employ special host software and/or hardware such as a Flash File System that employs wear-leveling and/or error code correction (ECC) technology to ensure data has not changed since it was last written. These measures result in system overhead, often negatively impacting performance. In addition, complex remedies reduce system robustness, especially in cases of NVM operation during a power failure.
Today’s NVM memory employs next-generation technology to increase NVM reliability. Companies like Cypress, with its Semper NOR Flash Memory, have introduce advanced measures such as on-die ECC and internal wear leveling to substantially improve retention and endurance in Flash NVM (see Figure 1).
click for larger image
Figure 1: Today’s NVM memory like the Semper NOR Flash architecture shown here, have integrated advanced measures to substantially improve NVM retention and endurance. (Source: Cypress Semiconductor)
Retention and Endurance
NVM devices need to ensure accurate storage of data and that it is available for long periods of time. At the same time, systems need to be able to update data frequently, which causes degradation of the storage medium and thus impacts its ability to store data long term. In general, endurance in Flash devices is related to the number of program/erase (P/E) cycles that the device can sustain without losing data. Retention and endurance are also affected by use cases (i.e., how program/ erase/ read operations are applied) and the use environment (i.e., in particular voltage and temperature).
When a failure occurs, regardless of the failure mechanism, bits in a Flash cell become corrupted. The result is the loss of the information stored in the cell. Bit error rate (BER) provides a common reliability metric for assessing corruption and is used to define the expected reliability of a particular application.
Satisfying the targeted BER for a required combination of retention and endurance becomes more difficult as NVM devices move to more advanced technology nodes. The reason is that advanced nodes are smaller. As such, they offer less area and thus less electric charge to store information. Storing the same information with less charge makes reliability for a Flash device even more challenging to maintain.
Options for Solving the Reliability Problem
There are fundamentally two options for improving NVM reliability:
minimize the occurrence of bit errors
correct bit errors that have occurred
The probability of bit errors occurring can be minimized by reducing the degradation of cells due to program/erase cycles. This can be achieved by spreading the “wear” of such cycles evenly across all of the cells in the device in a manner that minimizes the exposure of some cells to repeatedly storing/erasing information. For example, excessive wear can occur when frequently writing to the same address/ cell. The standard technique for spreading wear is known as Wear Leveling (WL).
Errors that have occurred can be corrected with Error Correction Code (ECC) technology. ECC technology is widely deployed and therefore not further discussed here. For more information, see Automatic ECC. Instead, the rest of this article will focus on the use of wear leveling to improve a system’s BER.
There are two primary types of wear leveling:
external wear leveling implemented by host software, typically as part of the Flash File System (FFS)
internal wear leveling integrated within the NVM device itself, typically managed by an integrated CPU
The primary limitation of external wear leveling is the requirement of a host processor capable of performing the task. This is typically not the case for a large number of small-footprint systems that also do not have the capacity to run a Flash File System. Such applications benefit greatly if the NVM can meet its own reliability targets without requiring assistance from the host. Integrating wear leveling in the NVM, can significantly enhance reliability in small-footprint applications. Larger systems can also benefit from internal wear leveling. Integrating wear leveling in the NVM greatly simplifies system design by eliminating the need for engineers to develop their own wear leveling software and then verify it for compliance to the various safety standards. Internal wear leveling also enables any application to utilize reliable storage without requiring substantial redesign of the memory subsystem.
Internal Wear Leveling
Wear Leveling is implemented by mapping logical addresses to physical addresses. The logical addresses are presented to the user. During the lifetime of the part, the wear leveling function dynamically maps these logical addresses to changing physical addresses in order to maintain a uniform distribution of program/erase cycles over the physical addresses. Because erase operations affect a group of addresses (often referred to as a sector or block), wear level mapping has a granularity of sectors (vs. individual bytes or words).
The first step is to partition the address space of the memory device into a partition that is user accessible for storing user data and code (see Sidebar: Partitioning), and a partition for storing wear leveling meta data that is not user accessible. Meta data includes information about the logical to physical sector mapping as well as power loss meta data (see Power Failures below).
Wear leveling can be applied only to the user address space. This means that the address space containing wear leveling meta data is not managed by the wear-leveling function. Thus, the reliability of that address space is therefore guaranteed by providing redundancy. This is based on redundancy inside a cell as well as the redundancy provided by storing data in multiple cells. A detailed discussion of this mechanism is outside the scope of this paper.
Whenever a sector in the user address space is erased and exceeds a given threshold, a remap or swap operation is initiated and the sector mapping is updated. Since program / erase operations are relatively slow (several milliseconds), the time for re-mapping and updating the mapping tables is negligible in comparison to the erase operation itself.
However, the logical to physical mapping information will also need to be accessed for read operations. In contrast to program/erase operations, read accesses need to complete within nanoseconds. To provide such performance, the mapping information stored in the dedicated flash address space needs to be mirrored into a fast RAM device during power-up (POR) and maintained there (see Figure 2).
click for larger image
Figure 2: Wear leveling mapping information needs to be mirrored in fast RAM to enable read operations to be completed within nanoseconds. (Source: Cypress Semiconductor)