Designing reliability into embedded Systems-on-a-Chip

With the 21st century focus on efficiency and productivity, factory automation equipment manufacturers have joined the aerospace and defense industries reliability bandwagon, striving for minimizing down time or failure on manufacturing floors. Subsequently, reliability design requirements are now often mandated by factory automation equipment manufacturers. Product engineers must not only focus on embedded solutions that meet cost and performance goals, but devices that will help to ensure overall end equipment reliability requirements. While integrated circuits have enabled quantum leaps in performance, size, and overall cost of embedded systems, the reliance on various memory elements and employment of small-geometry silicon process technologies introduces reliability challenges.

An issue with some of the first Intel Dynamic Random Access Memory (DRAM) chips in early 1970’s is described in an article in the December 2015 issue of IEEE Spectrum magazine. As densities of the memories grew from 1 KByte to 16 KBytes, the DRAMs started to exhibit a high number of bit errors. These errors impacted program execution and the reliability of the operational data. The source of the high number of bit errors was found to be caused by radioactive material that had found its way into the ceramic package. The radioactive material emitted alpha particles that caused bits to invert erroneously from the correct logical value.

Despite improvements made to remove these alpha-emitting particles on those first DRAM devices, alpha particles are still an issue that affects not only the reliability of DRAMs but also other silicon-based device memories today. Twenty-first century embedded System-on-Chip (SoC) devices with multiple processor cores, large internal caches and memories, and fixed-function logic dedicated to acceleration tasks are all susceptible to the same “soft” transient errors that can plague DRAMs.

Silicon device reliability requires managing failures that can cause the device not to function correctly at any point during its expected lifetime. From a design for reliability perspective, this means designing the device to meet market driven transient and permanent failure rate requirements.

Transient errors
Transient errors are random errors induced by an event which corrupts the data stored in a device (usually only a single bit) and include the following characteristics:

  • They affect both SRAM and logic

  • The device itself is not damaged and the root cause of the error is often impossible to trace

  • The error is caused by external elements and not a physical defect on the silicon itself

  • These types of errors do not contribute to silicon permanent failure metrics as the silicon is still functional.

Over time, it has been found that alpha particles not only affected memory and logic (registers and latches) accuracy, but also neutrons can also affect the reliability of the device. To assess the impact for a chip design, alpha and neutron tests can be run on test devices. The JESD89A specification: “Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices” can be followed for compliance with an industry standard for measurement. For the alpha and neutron tests, the test devices are subjected to an alpha or neutron source to count the number of bit errors. The results from the alpha test can then be used to determine the impact of alpha particles on a SoC package. The results from the neutron tests can be used to determine the impact of neutron particles at different geographical locations and altitudes determined by the relative flux density.

Using the observed data collected from both the alpha and neutron tests, calculations can be made to estimate the transient Soft Error Rate (SER) for a SoC. Other factors that contribute to the calculation of SER are as follows:

  • Amount of Static Random Access Memory (SRAM) in bits and memory bit cell type

  • Amount of logic in bits

  • Protection included (Parity or Error Correcting Code)

  • Silicon process technology

  • Voltages for functional sub-elements

  • Package design characteristics

  • Device temperature

  • Geographical factor (sea level altitude)

  • Product lifetime

To address high reliability requirements in today’s embedded systems market, SoCs need to be designed with a specific SER goal. As a reference example for device, let’s hypothetically set a goal of achieving an overall total SER of less than 250 Failures in Time (FIT) at New York City sea level and device temperature of 25 degrees Celsius. Sea level is specified as cosmic ray flux increases with altitude affecting alpha particles. New York City sea level is a common altitude reference for soft error FIT. A single FIT can be defined as a single undetected failure in one billion hours of operation. The inverse of the FIT value is the Mean Time Before Failure (MTBF). So for the reference example with a 250 FIT goal, the MTBF is greater than 400 years. This may seem like overkill for a device. However, if you consider a factory automation application like a Programmable Logic Controller (PLC), there may be close to a 100 PLCs controlling the operations of a large factory. If each PLC used an SoC that had a MTBF of only 100 years instead of 400 years, that would present the possibility of a device requiring restart once each year due to a transient error. This would be intolerable in today’s factories where minutes of downtime can mean possibly millions of dollars in lost product manufacturing.

Continue to page 2 >>

For a SoC, careful design consideration has to be put into each functional block of memory and logic to ensure that the total SER is limited to a specific target. Error Correcting Codes (ECC) and parity bits can be employed to detect and/or correct bit errors significantly reducing the SER across a device. A common ECC method is Single Error Correction and Dual Error Detection (SECDED). Using SECDED, a single bit error is detected and corrected in hardware. For dual bit errors, the errors are detected and the appropriate processor is signaled in the device to take action on the dual bit error. With single bit errors being corrected, this reduces the SER to the probability of a dual bit error occurring which is much more infrequent. Employing techniques such as SECDED ECC can dramatically reduce the transient error rate across a device and thus increase the MTBF.

In addition to transient errors, the other important aspect of device reliability to be considered is “hard” permanent errors due to possible failure mechanisms in the device design and silicon process. Permanent errors are repeatable errors induced by faulty device operation and typical have the following attributes:

  • The root cause is due to physical damage to the circuit

  • These types of physical errors contribute to silicon failure metrics

  • Data is lost and data can no longer be restored to that location

  • Some example permanent failure mechanisms are Gate Oxide Integrity (GOI) and Electro-Migration (EM)

From an overall failure mode perspective, the type of error is determined by where a device is within its lifecycle when compared to the traditional reliability “bathtub” curve view as shown in Figure 1.

Figure 1: Reliability Bathtub Curve (Source: Texas Instruments)

The bathtub curve provides a simplified overview of the three primary phases of a semiconductor device product lifetime.

Early life failure rate (ELFR): This phase is characterized by a relatively higher initial failure rate, which decreases rapidly. The failures observed during this phase are extrinsic failures and are typically measured as “Defective Parts Per Million” (DPPM). From a development perspective, these are removed by applying additional test screens and/or process updates.

Operating life: This phase consists of a relatively constant failure rate which remains stable over the useful lifetime of the device. This failure rate is described in units of FITs, or alternatively as MTBF in hours.

Wearout phase : This represents the point at which intrinsic wear-out mechanisms begin to dominate and the failure rate begins increasing exponentially. The product lifetime is typically defined as the time from initial production until the onset of wear-out.

To manage the intrinsic failure rate it is necessary to have a robust design process that ensures that the required level of reliability is designed in and is therefore correct by physical construction of the device. To achieve this, reliability requirements are defined and driven into the component / library level and then validated at the SoC design level as depicted in Figure 2.

Figure 2: Design for Intrinsic Reliability Process Flow (Source: Texas Instruments)  

To minimize both transient and permanent errors in a complex SoC, reliability has to be designed into the SoC from the ground up; it is not something that can be worked around or dealt with after the SoC is in production. And while performance and latency are always at forefront of requirements for a SoC, reliability has to be an intrinsic part of the foundation if that device is to function properly for multiple years in reliability-critical applications such as factory automation, transportation, military and medical. To meet the reliability demands of those markets, TI has laid this reliability foundation with the design of the high-performance 66AK2Gx DSP SoC.

Mike Hannah is a Senior Systems Engineer for Texas Instruments in the Catalog Processor Business Unit. He is a Senior Member of the Technical Staff and focuses on processor applications in industrial systems.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.