Improving reliability of non-volatile memory systems -

Improving reliability of non-volatile memory systems

Complex systems like Advanced Driver-Assistance Systems (ADAS), medical, and industrial applications need to be reliable, secure, and safe. In these systems, firmware and associated data are stored in Non-Volatile Memory (NVM) because code and data must be retained when power is not being supplied. Thus, NVM plays a crucial role in system reliability.

NVM reliability can be expressed in two ways: data retention time and cycling endurance. Retention time dictates how long NVM can hold data and code reliability. Endurance measures how many times the NVM can be rewritten and still reliably hold data and code. To offset these limitations, designers often employ special host software and/or hardware such as a Flash File System that employs wear-leveling and/or error code correction (ECC) technology to ensure data has not changed since it was last written. These measures result in system overhead, often negatively impacting performance. In addition, complex remedies reduce system robustness, especially in cases of NVM operation during a power failure.

Today’s NVM memory employs next-generation technology to increase NVM reliability. Companies like Cypress, with its Semper NOR Flash Memory, have introduce advanced measures such as on-die ECC and internal wear leveling to substantially improve retention and endurance in Flash NVM (see Figure 1).

click for larger image

Figure 1: Today’s NVM memory like the Semper NOR Flash architecture shown here, have integrated advanced measures to substantially improve NVM retention and endurance. (Source: Cypress Semiconductor)

Retention and Endurance

NVM devices need to ensure accurate storage of data and that it is available for long periods of time. At the same time, systems need to be able to update data frequently, which causes degradation of the storage medium and thus impacts its ability to store data long term. In general, endurance in Flash devices is related to the number of program/erase (P/E) cycles that the device can sustain without losing data. Retention and endurance are also affected by use cases (i.e., how program/ erase/ read operations are applied) and the use environment (i.e., in particular voltage and temperature).

When a failure occurs, regardless of the failure mechanism, bits in a Flash cell become corrupted. The result is the loss of the information stored in the cell. Bit error rate (BER) provides a common reliability metric for assessing corruption and is used to define the expected reliability of a particular application.

Satisfying the targeted BER for a required combination of retention and endurance becomes more difficult as NVM devices move to more advanced technology nodes. The reason is that advanced nodes are smaller. As such, they offer less area and thus less electric charge to store information. Storing the same information with less charge makes reliability for a Flash device even more challenging to maintain.

Options for Solving the Reliability Problem

There are fundamentally two options for improving NVM reliability:

  • minimize the occurrence of bit errors

  • correct bit errors that have occurred

The probability of bit errors occurring can be minimized by reducing the degradation of cells due to program/erase cycles. This can be achieved by spreading the “wear” of such cycles evenly across all of the cells in the device in a manner that minimizes the exposure of some cells to repeatedly storing/erasing information. For example, excessive wear can occur when frequently writing to the same address/ cell. The standard technique for spreading wear is known as Wear Leveling (WL).

Errors that have occurred can be corrected with Error Correction Code (ECC) technology. ECC technology is widely deployed and therefore not further discussed here. For more information, see Automatic ECC. Instead, the rest of this article will focus on the use of wear leveling to improve a system’s BER.

There are two primary types of wear leveling:

  • external wear leveling implemented by host software, typically as part of the Flash File System (FFS)

  • internal wear leveling integrated within the NVM device itself, typically managed by an integrated CPU

The primary limitation of external wear leveling is the requirement of a host processor capable of performing the task. This is typically not the case for a large number of small-footprint systems that also do not have the capacity to run a Flash File System. Such applications benefit greatly if the NVM can meet its own reliability targets without requiring assistance from the host. Integrating wear leveling in the NVM, can significantly enhance reliability in small-footprint applications. Larger systems can also benefit from internal wear leveling. Integrating wear leveling in the NVM greatly simplifies system design by eliminating the need for engineers to develop their own wear leveling software and then verify it for compliance to the various safety standards. Internal wear leveling also enables any application to utilize reliable storage without requiring substantial redesign of the memory subsystem.

Internal Wear Leveling

Wear Leveling is implemented by mapping logical addresses to physical addresses. The logical addresses are presented to the user. During the lifetime of the part, the wear leveling function dynamically maps these logical addresses to changing physical addresses in order to maintain a uniform distribution of program/erase cycles over the physical addresses. Because erase operations affect a group of addresses (often referred to as a sector or block), wear level mapping has a granularity of sectors (vs. individual bytes or words).

The first step is to partition the address space of the memory device into a partition that is user accessible for storing user data and code (see Sidebar: Partitioning), and a partition for storing wear leveling meta data that is not user accessible. Meta data includes information about the logical to physical sector mapping as well as power loss meta data (see Power Failures below).

Sidebar: Partitioning

A second level of NVM partitioning enables developers to configure a single NVM device for both long-term retention and high endurance. For example, a code partition can provide 25 years of retention while, in the same NVM device, a data partition can be configured (i.e., file system usage, data logging etc.) to support endurance on the order of 1 million program/erase cycles. The sidebar figure shows an example of such a partition.

click for larger image

Sidebar Figure: Next-generation NOR Flash supports a second level of partitioning to support long-term retention for code and system parameters, and high endurance for applications such as data logging. (Source: Cypress Semiconductor)

Each partition can be used just like a legacy NOR device. Once the partition is configured and set, there is no further effort required by developers to manage and utilize the partitions.

Swap candidates will be chosen from sectors in “High Endurance” partitions. All of these sectors make up what is called the Wear Leveling Pool. Since the “Long Retention” partition is excluded from the wear leveling pool, sectors in “Long Retention” will be never swapped, thus maximizing their retention.

Wear leveling can be applied only to the user address space. This means that the address space containing wear leveling meta data is not managed by the wear-leveling function. Thus, the reliability of that address space is therefore guaranteed by providing redundancy. This is based on redundancy inside a cell as well as the redundancy provided by storing data in multiple cells. A detailed discussion of this mechanism is outside the scope of this paper.

Whenever a sector in the user address space is erased and exceeds a given threshold, a remap or swap operation is initiated and the sector mapping is updated. Since program / erase operations are relatively slow (several milliseconds), the time for re-mapping and updating the mapping tables is negligible in comparison to the erase operation itself.

However, the logical to physical mapping information will also need to be accessed for read operations. In contrast to program/erase operations, read accesses need to complete within nanoseconds. To provide such performance, the mapping information stored in the dedicated flash address space needs to be mirrored into a fast RAM device during power-up (POR) and maintained there (see Figure 2).

click for larger image

Figure 2: Wear leveling mapping information needs to be mirrored in fast RAM to enable read operations to be completed within nanoseconds. (Source: Cypress Semiconductor)

Figure 3 shows the flow of operation in a system employing internal wear leveling. As described earlier, wear leveling is triggered by a sector erase operation. It is important to note that for the vast majority of sector erases, the program/erase count for the sector to be erased will be below the threshold and a sector swap will not be initiated. Thus, only the standard erase procedure is performed (which ends at step 4). In the rare cases that swaps are required, the swap procedure is invoked.

click for larger image

Figure 3: The flow of operation in a system employing internal wear leveling, in this case, the EnduraFlex architecture implemented in Cypress’ Semper NOR Flash. (Source: Cypress Semiconductor)

Note that the mapping table is a logical to physical sector address map. The Validation Bit is actually three non-volatile flags that validate if the non-volatile operation has been successfully completed. The Data Valid Bit is a mask bit to mask the invalid data to all-0 in a sector being swapped.

Step 1: Logical Sector A is mapped to Physical Sector X, Logical Sector B is mapped to Physical Sector Y, and so on.

Step 2: The user sends an erase command to Logical Sector A.

Step 3: Erase Physical Sector X

Step 4: Check if Physical Sector X reaches threshold: is a swap required?

End process if no

Step 5: Find the swap candidate sector which has a minimum number of erase cycles (in this case, Physical Sector Y)

Step 6: Program Mapping Table in Flash. Note that the Mapping Table in RAM has not been updated yet.

Step 7: Program Validation Bit 1

Step 8: Copy data in Physical Sector Y to Sector X which has already been erased.

Step 9: Program Validation Bit 2

Step 10: Update Mapping Table in RAM.

Step 11: Erase Physical Sector Y

Step 12: Program Validation Bit 3

Step 13: Erase is complete

Now, the user completes erasing Logical Sector A which is mapped to Physical Sector Y (blank), where Y has fewer program/erase cycles than Physical Sector X. Logical Sector B is now mapped to Physical Sector X and stores the original data (i.e., Logical Sector B contains the same data as before the swap.) By repeating this sequence each erase cycle that triggers the execution of a sector swap, all sectors in the wear leveling pool will have a uniform erase cycling history throughout their life cycle.

Figure 4 shows a simulation where ~1.3M program/erase cycles are applied to a logical sector (128). The internal wear leveling function spreads the 1.3M cycles across 256 sectors, resulting in an average cycle count of 5089 per sector.

click for larger image

Figure 4 Simulation results of ~1.3M program/erase cycles spread across 256 sectors for an average cycle count of 5089 per sector. (Source: Cypress Semiconductor)

Note that the BER, data retention, and endurance are strongly related. The BER is expressed by an exponential correlation with the number of program/erase cycles. Thus, if the number of program/erase cycles is reduced by a power of ten, the BER will be improved by several orders of magnitude. While a complete reliability analysis exceeds the scope of the paper, it is apparent that wear leveling has a significant positive effect on NVM reliability. Internal wear leveling, where wear leveling is integrated into the memory device, makes this level of improved reliability available to any host system in an entirely transparent manner.

Power Failures

One of the most critical concerns and technical challenges when implementing reliable storage is robustness against power failures. Specifically, the NVM device relies on the mapping table to be correct. This raises the question of how to handle power failure situations that may occur while mapping information is being updated as an error could result in the compromise of the entire NVM device.

A power failure during a normal erase operation leaves the sector to be erased as “incompletely erased”. Flags associated with the erase operation indicate this state and prompt a re-erase after the next power-on cycle.

A power failure during wear leveling is more complicated since the physical erase operation is no longer limited to the sector the user attempted to erase. Now it involves an additional erase operation triggered by the wear leveling algorithm; see Figure 3 where physical sector Y is erased by the wear leveling algorithm and is not visible to the application. Thus, a power failure recovery routine needs to be a part of the wear leveling algorithm.

Assume in Figure 3 that power is lost at Step 11. When power is recovered, the device first reads the Mapping Table in the wear leveling address space and checks its validity to then reconstruct the Mapping Table in RAM. Each entry of the Mapping Table in Flash contains a Validation Bit (1,2,3). If power is lost at Step 11, then the Validation Bit of the interrupted swap may be (0,0,1). The Data Valid Bit of the erased sector (Sector Y) is set to “invalid” and then the swaps {A=Y, B=X} are recorded in the Mapping Table (RAM). Now logical sector A is mapped to physical sector Y but the erase of physical sector Y may still be incomplete. That is not a concern because the user attempted to erase logical sector A but that erase was interrupted and recorded as interrupted. After the next power up, sector A will need to be re-erased. Logical sector B is mapped to Physical Sector X and has its original data before wear leveling is initiated.

click for larger image

Figure 5: Power Up sequence needing to be added to the wear leveling algorithm to recover from a power failure during an erase cycle. (Source: Cypress Semiconductor)

The above power up sequence ensures that the wear leveling algorithm does not corrupt any user data. Nor does it require the application to implement any special software/hardware algorithm. Because wear leveling is implemented internally, the entire process is transparent to the application. This makes wear leveling extremely useful for high-reliability application such as automotive applications where meeting reliability targets can be challenging.

The next-generation NVM memory integrated with wear leveling is crucial for high-reliable industry. For example, Cypress Semper NOR Flash Memory combines advanced NVM technology with wear leveling and ECC to achieve over 1M endurance and 25 years data retention.

Daisuke Nakata is Director of Systems Engineering for the Memory Products Division at Cypress Semiconductor. He works on systems architecture development to enable the next-generation of memory products. He holds a master’s degree of materials processing from Tohoku University, Japan.

2 thoughts on “Improving reliability of non-volatile memory systems

  1. “For every single system there is in the industry today, we need to ensure its lifespan is prolonged for as long as it remains relevant. This is to ensure that users can continue using it for as long as it is necessary. Users wouldn't want to even start us

    Log in to Reply
  2. “For every product and service that we put out there for users to utilize, their main concern would be reliability. End users need the assurance that what they're going to be using would serve their purpose as promised. Without that trust, the initial phas

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.