Editor's Note: Embedded designers must contend with a host of challenges in creating systems for harsh environments. Harsh environments present unique characteristics not only in terms of temperature extremes but also in areas including availability, security, very limited power budget, and more. In Rugged Embedded Systems, the authors present a series of papers by experts in each of the areas that can present unusually demanding requirements. In Chapter 2 of the book, the authors address fundamental concerns in reliability and system resiliency. This series excerpts that chapter in a series of installments including:
– Reliable and power-aware architectures: Sustaining system resiliency
– Reliable and power-aware architectures: Measuring resiliency
– Reliable and power-aware architectures: Soft-error vulnerabilities (this article)
– Reliable and power-aware architectures: Microbenchmark generation
– Reliable and power-aware architectures: Measurement and modeling
Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN317 during checkout.
Adapted from Rugged Embedded Systems, Computing in Harsh Environments, by Augusto Vega. Pradip Bose, Alper Buyuktosunoglu.
CHAPTER 2. Reliable and power-aware architectures: Fundamentals and modeling
A. Vega*, P. Bose*, A. Buyuktosunoglu*, R.F. DeMara†
IBM T. J. Watson Research Center, Yorktown Heights, NY, United States* University of Central Florida, Orlando, FL, United States†
6 SOFT-ERROR VULNERABILITIES
As CMOS technology scales and more transistors are packed on to the same chip, soft error reliability has become an increasingly important design issue for processors. Soft errors (or single event upsets or transient errors) caused by alpha particles from packaging materials or high energy particle strikes from cosmic rays can flip bits in storage cells or cause logic elements to generate the wrong result. Such errors have become a major challenge to processor design. If a particle strike causes a bit to flip or a piece of logic to generate the wrong result, we refer to it as a raw soft error event. Fortunately, not all raw soft errors cause the processor to fail. In a given cycle, only a fraction of the bits in a processor storage structure and some of the logic structures will affect the final program output. A raw error event that does not affect these critical bits or logic structures has no adverse effect on the program outcome and is said to be masked. For example, a soft error in the branch prediction unit or in an idle functional unit will not cause the program to fail.
Applications running on a given computing platform exhibit a wide range of tolerance to soft errors affecting the underlying hardware. For example, many image processing algorithms are quite tolerant to noise, since isolated bit-flips in image data frequently go unnoticed by the end-user or decision engine. Mukherjee et al.  introduced the term architectural vulnerability factor (AVF) to quantify the architectural masking of raw soft errors in a processor structure. The AVF of a structure is effectively the probability that a visible error (failure) will occur, given a raw error event in the structure . The AVF can be calculated as the percentage of time the structure contains architecturally correct execution (ACE) bits (i.e., the bits that affect the final program output). Thus, for a storage cell, the AVF is the percentage of cycles that this cell contains ACE bits. For a logic structure, the AVF is the percentage of cycles that it processes ACE bits or instructions.
The AVF of a structure directly determines its MTTF —the smaller the AVF, the larger the MTTF and vice versa. It is therefore important to accurately estimate the AVF in the design stage to meet the reliability goal of the system. Many soft error protection schemes have significant space, performance, and/or energy overheads; e.g., ECC, redundant units, etc. Designing a processor without accurate knowledge of the AVF risks over- or underdesign. An AVF-oblivious design must consider the worst case, and so could incur unnecessary overhead. Conversely, a design that underestimates the AVF would not meet the desired reliability goal.
Soft errors may be caused by various events including neutrons from cosmic particle incidence, alpha particles from trace radioactive content in packaging materials, and inductive noise effects (Ldi/dt) on the chip supply voltage resulting from aggressive forms of dynamic power management. As technology continues scaling down, early estimation of soft errors in SRAM cells and latch and logic elements becomes even more critical. Chip design must begin with a consideration of system-level MTTF targets, and the design methodology must be able to estimate or set bounds for chip-level failure rates with reasonable accuracy in order to avoid in-field system quality problems. A balanced combination of circuit- and logic-level innovations and architecture- and software-level solutions is necessary to achieve the required resiliency to single-event upsets (SEUs). In particular, there is the need for a comprehensive understanding of the vulnerabilities associated with various units on the chip with regard to workload behavior. When such information is available, appropriate approaches — such as selective duplication, SER-tolerant latch design adoption, and error-correcting code (ECC) and parity protection of SER hotspots1 — may be used for efficient error resiliency.
[Note 1: SER hotspot refers to a region of the chip that is deemed to be highly vulnerable to bit flips in an element (latch, combinational logic, or SRAM). An upset in this region is much more likely to cause a program failure than other adjoining areas. The SER hotspot profile across the chip floorplan may vary with the executing workload; however, it is quite likely that some of the SER hotspots may be largely invariant with regard to the workload. For example, portions of the instruction decode unit (through which all program instructions must pass) may well turn out to be SER hotspots across a wide range of input workloads.]
6.1 APPLICATION CHARACTERIZATION THROUGH FAULT INJECTION
One approach to assess the level of resiliency of a system is by artificially injecting faults and analyzing the system’s resulting behavior. When carefully determined, fault injection campaigns can significantly increase the coverage of early fault tolerance and resilience testing. Fault injection can be applied at different levels in the hardware-software stack—in particular, application-level fault injection (AFI) is an effective approach given that it is relatively easy and flexible to manipulate an application’s architectural state. Fig. 6 presents a high-level overview of a possible fault injector.
FIG. 6 High-level block diagram of a possible AFI framework.
In Unix/Linux scenarios, it is possible to resort to tools like the “process trace” (ptrace) debugging facility to put together a framework in which any target application can be compiled and run under user-controllable fault injection directives. The ptrace-centered facility provides a mechanism by which a parent process can observe and control the execution of another process—namely the application execution process that is targeted for fault injection. The parent process can examine and change the architected register and memory state of the monitored application execution process.
A fault injection facility like the one just described is designed to identify the most vulnerable regions of a targeted processor chip from the perspective of transient and permanent fault injections. As an example, we next describe experiments where pseudorandom fault injections were targeted to corrupt register state, instruction memory state, and data memory state of the application running on a POWER7 machine under the AIX (IBM Unix) operating system. A well-known challenge in fault injection experiments is the determination of the sensitivity to the number of injections necessary and sufficient to capture the true strength of the experiment. Fig. 7 presents the masking saturation rate of three exemplary workloads as the number of faults injected increases. Clearly, for these workloads in this particular example, 5000 injections were shown to be sufficient for a stabilizing point.
FIG. 7 Masking saturation rate for CG-A, mmm, and bzip2.
As stated before, in this example a total of 5000 single-bit, pseudorandom fault injection experiments were made into the architected register space in each controlled experiment (e.g., using various compiler optimization levels). Each such injection leads to one of the following outcomes:
The bit that is flipped via the injection has no effect on the program execution profile, including the final program output or final data memory state. These fault injections are categorized as being fully masked.
The injected bit-flip results in a silent data corruption (SDC)—i.e., the program output or the final data memory state shows an error, when compared to the fault-free, “golden” run of the program.
The injected error results in a program crash, where the operating system terminates the program due to a detected runtime error (e.g., divide by zero exception or segmentation fault, illegal memory reference, etc.).
The injected bit-flip results in a “hung” state, where there is no forward progress of the program execution; effectively, in practice such a situation would require a user-initiated program termination or even a physical machine reboot.
From the measured percentage of each event’s occurrence, the corresponding failure rate can be estimated, provided the total number of random experiments (e.g., 5000) is deemed to be large enough to draw statistically meaningful inferences. Fig. 8 shows snapshot results (analyzing the effect of register state fault injections) obtained using AFI for two example applications: mmm in Fig. 8A (which is a kernel within the dgemm family) and bzip2 in Fig. 8B. The data shown in Fig. 8 leads one to conclude that vulnerability to random bitflips worsens with the degree of compiler optimization—at least for these two example applications.
FIG. 8 Fault injection experimental results for different compiler optimization levels. (A) mmm benchmark. (B) bzip2 benchmarks.
The next installment from this chapter discusses issues generating appropriate microbenchmarks to observe the onset of vulnerabilities to operation.
Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2016