Reliable and power-aware architectures: Measuring resilience

Editor's Note: Embedded designers must contend with a host of challenges in creating systems for harsh environments. Harsh environments present unique characteristics not only in terms of temperature extremes but also in areas including availability, security, very limited power budget, and more. In Rugged Embedded Systems, the authors present a series of papers by experts in each of the areas that can present unusually demanding requirements. In Chapter 2 of the book, the authors address fundamental concerns in reliability and system resiliency. This series excerpts that chapter in a series of installments including:
Reliable and power-aware architectures: Sustaining system resiliency
– Reliable and power-aware architectures: Measuring resiliency (this article)
Reliable and power-aware architectures: Soft-error vulnerabilities 
Reliable and power-aware architectures: Microbenchmark generation
Reliable and power-aware architectures: Measurement and modeling

4 METRICS ON POWER-PERFORMANCE IMPACT

Technology scaling and NTV operation are two effective paths to achieve aggressive power/energy efficiency goals. Both are fraught with resiliency challenges in prevalent CMOS logic and storage elements. When resiliency improvements actually enable more energy-efficient techniques (smaller node sizes, lower voltages) metrics to assess the improvements they bring to performance and energy-efficiency need to be also considered. A closely related area is thermal management of systems where system availability and performance can be improved by proactive thermal management solutions and thermal-aware designs versus purely reactive/thermal-emergency management approaches. Thermal-aware design and proactive management techniques focused on impacting thermal resiliency of the system can also improve performance and efficiency of system (reducing/eliminating impact of thermal events) in addition to potentially helping system availability at lower cost.

In this context, efficiency improvement/cost for resiliency improvement constitutes an effective metric to compare different alternatives or solutions. As an example, a DRAM-only memory system might have an energy-efficiency measure EDRAM and a hybrid Storage Class Memory-DRAM (SCM-DRAM) system with better resiliency might have a measure EHybrid . If CHybrid is the incremental cost of supporting the more resilient hybrid system, the new measure would be evaluated as (EHybrid EDRAM )/CHybrid . Different alternatives for hybrid SCM-DRAM designs would then be compared based on their relative values for this measure. On a similar note, different methods to improve thermal resiliency can be compared on their improvement in average system performance or efficiency normalized to the cost of their implementation.

5 HARD-ERROR VULNERABILITIES

This section describes the underlying physical mechanisms which may lead to reliability concerns in advanced integrated circuits (ICs). The considered mechanisms are those which affect the chip itself, including both the transistor level (“front end of line,” or FEOL) and wiring levels (“back end of line,” or BEOL), but not covering reliability associated with packaging—even though this is a significant area of potential field failures. The following is a nonexhaustive list of common permanent failures in advanced ICs:

(a) Electromigration (EM) —A process by which sustained unidirectional current flow experienced by interconnect (wires) results in progressive increase of wire resistance eventually leading to permanent open faults.

(b) Time-dependent dielectric breakdown (TDDB) —A process by which sustained gate biases applied to transistor devices or to interconnect dielectrics causes progressive degradation towards oxide breakdown eventually leading to permanent short or stuck-at faults.

(c) Negative Bias Temperature Instability (NBTI) —A process by which sustained gate biases applied to transistor devices causes a gradual shift upwards of its threshold voltage and degradation of carrier mobility, causing it to have reduced speed and current-drive capability, eventually leading to permanent circuit failure.

(d) Hot Carrier Injection (HCI) —A process by which a transistor device (with sustained switching usage) causes a gradual shift upwards of its threshold voltage and degradation of carrier mobility, causing it to have reduced speed and current-drive capability, eventually leading to permanent circuit failure.

Reliability mechanisms can be broadly divided into two categories. “Random” or “hard” failure mechanisms are by nature statistical. This type of failure is associated with a distinct failure time, however, the failure time is a random variable, different for each similar circuit element even when subject to the same voltage and temperature history. The first two mechanisms in the list above (i.e., EM and TDDB) fall into this category. EM typically follows a log-normal probability distribution of failure times, with σ ≈ 0.2. TDDB typically follows a Weibull distribution with shape parameter ≥ 1 in the case of state-of-the-art gate dielectrics.

The other category of failure is “wearout.” This type of mechanism results in a gradual shift of electrical characteristics, which is the same for every similar circuit element subject to the same voltage and temperature history. This may eventually lead to circuit failure if and when the characteristics shift outside of the allowable operating range. This type of gradual, continuous parametric shift can lead, for example, to changes in switching speed which may in turn lead to critical path timing errors, or to reduced noise margins which may lead to register errors especially in circuits operating near their voltage limits. The last two mechanisms above (i.e., NBTI and HCI) are of this type. The reader should note that the term “wearout” is very often used in a different sense, to denote the increasing failure rate near end-of-life (EOL) in the traditional bathtub-shaped reliability curve. This statistical meaning does not consider the underlying physical mechanism.

In both wearout and random failure mechanisms the underlying physics involves a gradual accumulation of damage to the device or circuit element. The fundamental difference is that in the random or hard failure type, this damage is not readily apparent until a certain threshold is reached. For example, in electromigration (EM), metal vacancies appear and grow until finally an open circuit forms, resulting in a sudden increase in metal resistance. There is typically a gradual increase in resistance before the complete open circuit, but this is usually small enough to be ignored in practice. Fig. 3 shows a typical resistance versus time plot for electromigration. Similar characteristics are also seen in TDDB, where the phenomenon known as “progressive” breakdown has been extensively studied [17].


FIG. 3 Interconnect resistivity increase over time, under sustained EM stress.

In wearout mechanisms the damage is manifest as a continuous shift in electrical characteristic, lacking any sudden (delayed) onset. Fig. 4 shows typical threshold voltage shift versus time for NBTI. In very small transistors, wearout mechanisms such as threshold voltage shifts are subject to statistical variation because of the discrete¡ nature of charge. Thus, even these “uniform” wearout mechanisms should be treated statistically. This is still a subject of current investigation and the treatment of the statistical distribution of NBTI and HCI has not yet seen widespread implementation in practice [18]. NBTI-induced wearout can be annealed or relaxed when the gate bias is removed, as shown in Fig. 4.


FIG. 4 Typical NBTI-induced threshold voltage stress/recovery characteristics.

Fig. 5 shows the typical analytical models for each mechanism. This figure highlights the dependence of each mechanism on physical and operational parameters (acceleration factors). Thermal activation energies are listed in the third column. EM and TDDB are the failure modes which are most strongly accelerated by temperature. This makes them likely candidates for reliability limiters associated with hot spots. However, since these are both random statistical processes (described by the failure time distributions given in the figure) the failure of any particular chip cannot be guaranteed, only the probability of failure can be ascertained. In all cases the absolute failure times or parameter shifts depend strongly on structural and material parameters, e.g., insulator thickness, device size, etc.


FIG. 5 Analytical models for the considered physical mechanisms associated with reliability vulnerabilities.

EM testing typically entails measurement on various metal lines and via test structures that allow evaluation of electromigration from electron flow either up from a via below the metal line (via depletion), or electron flow down from a via above the metal line (line depletion) under unidirectional current flow. Stresses are typically carried out at elevated temperature (e.g., 250–300°C) using constant current stress Ja of order 10 mA/μm2 . Vulnerable circuits for EM failure are those with highly loaded devices or high duty factors, i.e., which tend to drive DC currents. The EOL criterion is typically a resistance increase (dR/R) ≥ 20% or an excess leakage current (e.g., due to metal extrusion) >1 μA at stress condition. This is a reliability engineering benchmark, but may not necessarily cause circuit failure in every case.

TDDB testing for gate dielectrics is performed by stressing individual transistors under inversion condition (i.e., gate positive for n-fets, and gate negative for p-fets). Stresses are typically carried out at elevated temperature (e.g., 100–180°C) using constant voltage stress Va of order 2–5 V in order to obtain reasonable failure times. The EOL criterion is typically any small increase in leakage current (“first breakdown”) or an excess leakage current of say >10 μA at use condition. This level of leakage has been shown to impact functionality of certain circuits such as SRAM but may not cause all circuit types to fail.

Back-end (intermetal dielectric) TDDB is an area of increasing interest and concern, as the insulator thickness, particularly between the gate conductor and the source or drain contact metals, is now comparable to thicknesses used under the gate only a couple of decades ago. Similar phenomenology and models describe back-end TDDB, but breakdown voltages or fields in these insulators are often limited by extrinsic or integration related issues (etch profiles, metal contamination, etc.). For back-end TDDB, test structures typically comprise comb structures.

The EM and TDDB equations in Fig. 5 give the allowed current density Juse (for EM) or voltage Vuse (for TDDB) corresponding to a specified median fail time tuse for a single metal line, via, transistor gate, or other dielectric. The median failure time under accelerated stress conditions is ta, corresponding to stress current density Ja (for EM) or voltage Va (for TDDB). The distribution of EM failure times follows log-normal statistics, while TDDB failure times typically follow Weibull statistics, at least in the low-failure rate tail which is of interest.

Since semiconductor technology is typically qualified to support parts per million chip reliability with at least 1E5 lines per chip and %1E9 transistors per chip, the failure time for a single wire or device exceeds the product lifetime by many orders of magnitude. Therefore, significant current, voltage, or temperature excursions may be required to induce failure on any given device or target circuit.

On the other hand, since NBTI and HCI are uniform wearout modes, with similar threshold voltage shift or current degradation versus time for all transistors which operate at the same voltage and temperature, the potential failure by these mechanisms caused by voltage or temperature excursions should be more readily predictable. As NBTI and HCI degradation cause transistors to become weaker, circuits may be vulnerable to timing errors or to decreased noise immunity. The BTI and HCI equations in Fig. 5 gives the threshold voltage shift (for NBTI) or drain current degradation (for HCI) as a function of usage time tuse and voltage Vuse.

NBTI, like TDDB, is performed by stressing individual p-fet transistors under inversion condition (gate negative). Stresses are typically carried out at elevated temperature (e.g., 80–150°C) using constant voltage stress Va of order one to two times normal operation voltage in order to obtain measurable shifts without oxide breakdown. A typical EOL criterion is 50 mV threshold voltage shift, although as pointed out earlier this is not a “hard” fail criterion since the shift is continuous and gradual. Vulnerable circuits for NBTI are those low duty cycle (p-fet remains on for long times) since NBTI is exacerbated by continued DC (nonswitching, or transistor “on”) conditions.

HCI testing is also performed on individual transistors, but unlike TDDB and NBTI, it requires drain bias in addition to gate bias since the wearout is driven by energetic carriers in the channel. The drain is biased at constant voltage stress Va of order one to two times normal operation voltage and the gate is typically held either at the same bias or at one-half of the drain voltage. Vulnerable circuits for HCI failure are those with highly loaded devices, and high duty factors, since the damage occurs only during the conducting phase when a gate switches.

The hard error vulnerability models discussed above (see Fig. 5), are physics-based behavioral equations that help us calculate the mean-time-to-failure (MTTF) at the individual component (device or interconnect) level. It needs to be mentioned here that many of those equations evolve over time to capture the unique effects of particular semiconductor technology nodes. In fact, even for the same basic technology era, individual vendors have in-house, customized models that are very specific to their particular foundry. The equations depicted in Fig. 5 are generic examples, representative of technologies of a recent prior era—and as such, they should not be assumed to be accurately representative of particular, current generation foundry technologies.

Also, when it comes to be able to model the effect of such hard errors at the system level, in the context of real application workloads, techniques like RAMP [19–21] at the processor core level and follow-on multicore modeling innovations (e.g., Shin et al. [22–24]) need to be mentioned. The idea is to first collect representative utilization and duty cycle statistics from architecture-level simulators, driven by target application workloads. These are then used in conjunction with devicelevel physics models as well as device density and circuit-level parameters to deduce the failures in time (FIT) values on a structure-by-structure basis. The unit-specific FITs are then combined appropriately to derive the overall chip FITs. From that, the chip-level MTTF can be derived, under suitable assumptions about the error incidence (distribution) functions.

The next installment from this chapter discusses issues related to soft-error vulnerabilities.

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2016

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.