Flash 101: Error management in NAND Flash
- Editor's Note: NAND and NOR Flash memory play an integral role in embedded systems of all sorts but successful implementation requires careful attention to key details -- all described and explained by Avinash Aravindan in this series, Flash 101, which includes the following articles listed in order of publication:
- NAND Flash vs NOR Flash
- The NOR Flash electrical interface
- The NAND Flash electrical interface
- Types of NAND Flash
- Errors in NAND Flash
- Error Management in NAND Flash (this article)
In the previous part, we discussed the different kinds of errors in NAND Flash. Data reliability being the most critical aspect of data storage, we need to either avoid these errors or need a way to detect and correct them. In this part we will discuss the different methods used to avoid errors, the methods to detect the errors which cannot be avoided and the techniques used to correct the errors. We will first discuss wear levelling used to avoid or reduce the effect of memory wear. In the second section we will discuss bad block management used to track and avoid the usage of bad blocks. In the subsequent section we will discuss Error Correction Codes (ECC) used to detect and correct any temporary errors in NAND Flash.
As discussed previously, Flash cells are permanently damaged due to repeated program and erase operations. Wear levelling is a technique used to reduce the effect of memory wear, thereby enhancing the endurance of Flash memory. To reduce the wear of memory cells, the number of program and erase (P/E) cycles for each memory cell need to be reduced as feasible. Wear leveling distributes program and erase operations more uniformly across the entire memory range or a portion of the Flash device, thereby avoiding some cells or blocks experiencing significantly more P/E cycles compared to other cells or blocks.
Wear levelling is typically implemented by remapping the fixed logical sector address of the host system to different physical sector addresses in Flash memory. There are two major wear leveling methods: dynamic wear leveling and static wear leveling. Dynamic wear leveling uses only a portion of the available memory for wear leveling, while static wear leveling uses the entire memory range. Static wear leveling provides longer endurance compared to dynamic wear leveling, but it is slower and more complex. See Wear Leveling for a detailed explanation of wear leveling, its advantages, and an exploration of the different wear leveling methods with some numerical examples. While that application note focuses primarily on NOR Flash memory, the same concepts are applicable to NAND Flash memory.
One of the major limitations of NAND Flash is bad blocks. Given that the integrity of data stored in Flash is critical, bad block management is a mandatory requirement for NAND Flash devices.
In Part 1 of this article series, I mentioned that NAND Flash devices are shipped with bad blocks scattered randomly throughout due to yield constraints. The locations of these initial bad blocks are marked in the Flash device itself before shipping. The details for how to read this bad block marking are available in the Flash device datasheet. (For an example, refer to the error management section in this NAND Flash datasheet. Factory-marked bad block information must be read before performing any erase operation, otherwise the information will be lost. Note that factory-marked bad blocks are tested at worst-case conditions, so some of these blocks may work in normal conditions. The use of these bad blocks should be avoided, as it can damage other good blocks. The host system stores the information of these initial bad blocks in a bad block table.
NAND Flash devices will continue to accumulate bad blocks over the lifecycle of the device due to memory wear. These additional bad blocks can be identified whenever a program or erase operation reports “Fail” in the status register. The failure to program one page in a block does not affect other pages in the same block. The contents of other pages in the block is copied to another good block and the old block is marked as bad. The host system adds this information to the bad block table to avoid any further usage of the block.
Error Correction Codes (ECC)
In the previous part, we discussed different temporary errors in Flash memories such as read disturb, program disturb, over-programming, and retention errors, where stored data gets corrupted over time. The corruption of data due to temporary errors is often known as bit-flipping in Flash memory, where the state of a bit appears to be flipped. To maintain the integrity of stored data, it is essential to detect any such errors and correct them. Error Correction Code (ECC) technology is a technique used to detect and correct errors in memory devices in general.
Error correction codes work by adding redundant bits to data bits to identify and correct errors. For m-bits of data, k-bits of redundant bits are added, making the effective data or the coded data m+k bits. The ECC algorithm will encode this m+k bits such that only some of the combinations out of the possible 2(m+k) are valid codewords. Any error in the read data can be thus identified whenever an invalid codeword is detected.
A detected error can be corrected as per the capability of the ECC algorithm used. The number of redundant ECC bits (k) to be added for a block of data (m) depends on many factors. The technology node of the NAND Flash is the most important factor, as a smaller technology node requires more ECC bits. The type of NAND Flash – SLC, MLC, or TLC – is another important factore. As mentioned in Part 4 of this series, SLC requires the least number of ECC bits and TLC requires the most. The number of P/E cycles is another key factor as the cells keeps wearing out with more program and erase, thereby requiring more ECC bits. The number of required ECC bits also depends on the type of ECC algorithm used.
There are many different ECC algorithms widely used, each having some advantage over the others. The most commonly used ECC algorithms in Flash memories are Hamming codes; Bose, Chaudhuri, and Hocquenghem (BCH) codes; Reed-Solomon (R-S) Codes; and Low Density Parity Codes (LDPC). The algorithm to be used depends on the data reliability required for the application and the NAND Flash device used. For example, Hamming codes can detect 2-bit errors and correct 1-bit errors, which can be used for SLC Flash requiring only 1-bit ECC. For other applications which require more robust ECC, some other algorithm may be used. What Types of ECC Should Be Used on Flash Memory offers a detailed explanation on the differences between Hamming, R-S, and BCH algorithms and which to select for an application.
Avinash Aravindan is a Staff Systems Engineer at Cypress Semiconductor. His responsibilities include defining technical requirements and designing PSoC based development kits, system design, technical review for system designs and technical writing. He has 8+ years of industry experience. He earned his Master’s Degree on Master of Science in Research on Information and Communication Technologies (MERIT) from Universitat Politècnica de Catalunya, Barcelona, Spain and B.Tech from Cochin University of Science and Technology, Cochin, India. His interests include embedded systems, high-speed system design, mixed signal system design and statistical signal processing.