Reliable and power-aware architectures: Sustaining system resiliency
Editor's Note: Embedded designers must contend with a host of challenges in creating systems for harsh environments. Harsh environments present unique characteristics not only in terms of temperature extremes but also in areas including availability, security, very limited power budget, and more. In Rugged Embedded Systems, the authors present a series of papers by experts in each of the areas that can present unusually demanding requirements. In this chapter from the book, the authors address fundamental concerns in reliability and system resiliency.
Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN317 during checkout.
Adapted from Rugged Embedded Systems, Computing in Harsh Environments, by Augusto Vega. Pradip Bose, Alper Buyuktosunoglu.
CHAPTER 2. Reliable and power-aware architectures: Fundamentals and modeling
A. Vega*, P. Bose*, A. Buyuktosunoglu*, R.F. DeMara†
IBM T. J. Watson Research Center, Yorktown Heights, NY, United States* University of Central Florida, Orlando, FL, United States†
Chip power consumption is one of the most challenging and transforming issues that the semiconductor industry has encountered in the past decade, and its sustained growth has resulted in various concerns, especially when it comes to chip reliability. It translates into thermal issues that could harm the chip. It can also determine (i.e., limit) battery life in the mobile arena. Furthermore, attempts to circumvent the power wall through techniques like near-threshold voltage (NTV) computing lead to other serious reliability concerns. For example, chips become more susceptible to soft errors at lower voltages. This scene becomes even more disturbing when we add an extra variable: a hostile (or harsh) surrounding environment. Harsh environmental conditions exacerbate already problematic chip power and thermal issues, and can jeopardize the operation of any conventional (i.e., nonhardened) processor.
This chapter discusses fundamental reliability concepts as well as techniques to deal with reliability issues and their power implications. The first part of the chapter discusses the concepts of error, fault, and failure, the resolution phases of resilient systems, and the definition and associated metrics of hard and soft errors. The second part presents two effective approaches to stress a system from the standpoints of resilience and power-awareness—namely fault injection and microbenchmarking. Finally, the last part of the chapter briefly introduces basic ideas related to power-performance modeling and measurement.
2 THE NEED FOR RELIABLE COMPUTER SYSTEMS
A computer system is a human-designed machine with a sole ultimate purpose: to solve human problems. In practice, this principle usually materializes as a service that the system delivers either to a person (the ultimate “consumer” of that service) or to other computer systems. The delivered service can be defined as the system’s externally perceived behavior  and when it matches what is “expected,” then the system is said to operate correctly (i.e., the service is correct). The expected service of a system is described by its functional specification which includes the description of the system functionality and performance, as well as the threshold between acceptable versus unacceptable behavior . In spite of the different (and sometimes even incongruous) definitions around system reliability, one idea is unanimously accepted: ideally, a computer system should operate correctly (i.e., stick to its functional specification) all the time; and when its internal behavior experiences anomalies, the impact on the external behavior (i.e., the delivered service) should be concealed or minimized.
In practice, a computer system can face anomalies (faults and errors) during operation which require palliative actions in order to conceal or minimize the impact on the system’s externally perceived behavior (failure). The concepts of error, fault, and failure are discussed in Section 2.1. The ultimate goal is to sustain the quality of the service (QoS) being delivered in an acceptable level. The range of possible palliative actions is broad and strongly dependent on the system type and use. For example, space-grade computers deployed on earth-orbiting satellites demand more effective (and frequently more complex) fault-handling techniques than computers embedded in mobile phones. But in most cases, these actions usually involve anomaly detection (AD), fault isolation (FI), fault diagnosis (FD), and fault recovery (FR). These four resolution phases are discussed in detail in Section 2.2.
Today, reliability has become one of the most critical aspects of computer system design. Technology scaling, per Moore’s Law has reached a stage where process variability, yield, and in-field aging threaten the economic viability of future scaling. Scaling the supply voltage down per classical Dennard’s rules has not been possible lately, because a commensurate reduction in device threshold voltage (to maintain performance targets) would result in a steep increase in leakage power. And, even a smaller rate of reduction in supply voltage needs to be done carefully—because of the soft error sensitivity to voltage. Other device parameters must be adjusted to retain per-device soft error rates at current levels in spite of scaling. Even with that accomplished, the per-chip soft error rate (SER) tends to increase with each generation due to the increased device density. Similarly, the dielectric (oxide) thickness within a transistor device has shrunk at a rate faster than the reduction in supply voltage (because of performance targets). This threatens to increase hard fail rates of processor chips beyond acceptable limits as well. It is uncertain today what will be the future impact of further miniaturization beyond the 7-nm technology node in terms of meeting an acceptable (or affordable) balance across reliability and power consumption metrics related to prospective computing systems. In particular for mission-critical systems, device reliability and system survivability pose increasingly significant challenges [2–5]. Error resiliency and self-adaptability of future electronic systems are subjects of growing interest [3, 6]. In some situations, even survivability in the form of graceful degradation is desired if a full recovery cannot be achieved. Transient, or so-called soft errors as well as permanent, hard errors in electronic devices caused by aging require autonomous mitigation as manual intervention may not be feasible . In application domains that include harsh operating environments (e.g., high altitude, which exacerbates soft error rates, or extreme temperature swings that exacerbate certain other transient and permanent failure rates), the concerns about future system reliability are of course even more pronounced. The reliability concerns of highly complex VLSI systems in sub-22 nm processes, caused by soft and hard errors, are increasing. Therefore, the importance of addressing reliability issues is on the rise. In general, a system is said to be resilient if it is capable of handling failures throughout its lifetime to maintain the desired processing performance within some tolerance.