Designing reliability into embedded Systems-on-a-Chip

October 18, 2016

Mike.Hannah-October 18, 2016

With the 21st century focus on efficiency and productivity, factory automation equipment manufacturers have joined the aerospace and defense industries reliability bandwagon, striving for minimizing down time or failure on manufacturing floors. Subsequently, reliability design requirements are now often mandated by factory automation equipment manufacturers. Product engineers must not only focus on embedded solutions that meet cost and performance goals, but devices that will help to ensure overall end equipment reliability requirements. While integrated circuits have enabled quantum leaps in performance, size, and overall cost of embedded systems, the reliance on various memory elements and employment of small-geometry silicon process technologies introduces reliability challenges.

An issue with some of the first Intel Dynamic Random Access Memory (DRAM) chips in early 1970’s is described in an article in the December 2015 issue of IEEE Spectrum magazine. As densities of the memories grew from 1 KByte to 16 KBytes, the DRAMs started to exhibit a high number of bit errors. These errors impacted program execution and the reliability of the operational data. The source of the high number of bit errors was found to be caused by radioactive material that had found its way into the ceramic package. The radioactive material emitted alpha particles that caused bits to invert erroneously from the correct logical value.

Despite improvements made to remove these alpha-emitting particles on those first DRAM devices, alpha particles are still an issue that affects not only the reliability of DRAMs but also other silicon-based device memories today. Twenty-first century embedded System-on-Chip (SoC) devices with multiple processor cores, large internal caches and memories, and fixed-function logic dedicated to acceleration tasks are all susceptible to the same “soft” transient errors that can plague DRAMs.

Silicon device reliability requires managing failures that can cause the device not to function correctly at any point during its expected lifetime. From a design for reliability perspective, this means designing the device to meet market driven transient and permanent failure rate requirements.

Transient errors
Transient errors are random errors induced by an event which corrupts the data stored in a device (usually only a single bit) and include the following characteristics:

  • They affect both SRAM and logic

  • The device itself is not damaged and the root cause of the error is often impossible to trace

  • The error is caused by external elements and not a physical defect on the silicon itself

  • These types of errors do not contribute to silicon permanent failure metrics as the silicon is still functional.

Over time, it has been found that alpha particles not only affected memory and logic (registers and latches) accuracy, but also neutrons can also affect the reliability of the device. To assess the impact for a chip design, alpha and neutron tests can be run on test devices. The JESD89A specification: “Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices” can be followed for compliance with an industry standard for measurement. For the alpha and neutron tests, the test devices are subjected to an alpha or neutron source to count the number of bit errors. The results from the alpha test can then be used to determine the impact of alpha particles on a SoC package. The results from the neutron tests can be used to determine the impact of neutron particles at different geographical locations and altitudes determined by the relative flux density.

Using the observed data collected from both the alpha and neutron tests, calculations can be made to estimate the transient Soft Error Rate (SER) for a SoC. Other factors that contribute to the calculation of SER are as follows:

  • Amount of Static Random Access Memory (SRAM) in bits and memory bit cell type

  • Amount of logic in bits

  • Protection included (Parity or Error Correcting Code)

  • Silicon process technology

  • Voltages for functional sub-elements

  • Package design characteristics

  • Device temperature

  • Geographical factor (sea level altitude)

  • Product lifetime

To address high reliability requirements in today’s embedded systems market, SoCs need to be designed with a specific SER goal. As a reference example for device, let’s hypothetically set a goal of achieving an overall total SER of less than 250 Failures in Time (FIT) at New York City sea level and device temperature of 25 degrees Celsius. Sea level is specified as cosmic ray flux increases with altitude affecting alpha particles. New York City sea level is a common altitude reference for soft error FIT. A single FIT can be defined as a single undetected failure in one billion hours of operation. The inverse of the FIT value is the Mean Time Before Failure (MTBF). So for the reference example with a 250 FIT goal, the MTBF is greater than 400 years. This may seem like overkill for a device. However, if you consider a factory automation application like a Programmable Logic Controller (PLC), there may be close to a 100 PLCs controlling the operations of a large factory. If each PLC used an SoC that had a MTBF of only 100 years instead of 400 years, that would present the possibility of a device requiring restart once each year due to a transient error. This would be intolerable in today’s factories where minutes of downtime can mean possibly millions of dollars in lost product manufacturing.

Continue to page 2 >>

 

< Previous
Page 1 of 2
Next >

Loading comments...