Reliable and power-aware architectures: Sustaining system resiliency

Editor's Note: Embedded designers must contend with a host of challenges in creating systems for harsh environments. Harsh environments present unique characteristics not only in terms of temperature extremes but also in areas including availability, security, very limited power budget, and more. In Rugged Embedded Systems, the authors present a series of papers by experts in each of the areas that can present unusually demanding requirements. In this chapter from the book, the authors address fundamental concerns in reliability and system resiliency. This series excerpts that chapter in a series of installments including:
– Reliable and power-aware architectures: Sustaining system resiliency (this article)
Reliable and power-aware architectures: Measuring resiliency
Reliable and power-aware architectures: Soft-error vulnerabilities 
Reliable and power-aware architectures: Microbenchmark generation
Reliable and power-aware architectures: Measurement and modeling

Elsevier is offering this and other engineering books at a 30% discount. To use this discount, click here and use code ENGIN317 during checkout.

Adapted from Rugged Embedded Systems, Computing in Harsh Environments, by Augusto Vega. Pradip Bose, Alper Buyuktosunoglu.

CHAPTER 2. Reliable and power-aware architectures: Fundamentals and modeling

A. Vega*, P. Bose*, A. Buyuktosunoglu*, R.F. DeMara†
IBM T. J. Watson Research Center, Yorktown Heights, NY, United States* University of Central Florida, Orlando, FL, United States†

1 INTRODUCTION

Chip power consumption is one of the most challenging and transforming issues that the semiconductor industry has encountered in the past decade, and its sustained growth has resulted in various concerns, especially when it comes to chip reliability. It translates into thermal issues that could harm the chip. It can also determine (i.e., limit) battery life in the mobile arena. Furthermore, attempts to circumvent the power wall through techniques like near-threshold voltage (NTV) computing lead to other serious reliability concerns. For example, chips become more susceptible to soft errors at lower voltages. This scene becomes even more disturbing when we add an extra variable: a hostile (or harsh) surrounding environment. Harsh environmental conditions exacerbate already problematic chip power and thermal issues, and can jeopardize the operation of any conventional (i.e., nonhardened) processor.

This chapter discusses fundamental reliability concepts as well as techniques to deal with reliability issues and their power implications. The first part of the chapter discusses the concepts of error, fault, and failure, the resolution phases of resilient systems, and the definition and associated metrics of hard and soft errors. The second part presents two effective approaches to stress a system from the standpoints of resilience and power-awareness—namely fault injection and microbenchmarking. Finally, the last part of the chapter briefly introduces basic ideas related to power-performance modeling and measurement.

2 THE NEED FOR RELIABLE COMPUTER SYSTEMS

A computer system is a human-designed machine with a sole ultimate purpose: to solve human problems. In practice, this principle usually materializes as a service that the system delivers either to a person (the ultimate “consumer” of that service) or to other computer systems. The delivered service can be defined as the system’s externally perceived behavior [1] and when it matches what is “expected,” then the system is said to operate correctly (i.e., the service is correct). The expected service of a system is described by its functional specification which includes the description of the system functionality and performance, as well as the threshold between acceptable versus unacceptable behavior [1]. In spite of the different (and sometimes even incongruous) definitions around system reliability, one idea is unanimously accepted: ideally, a computer system should operate correctly (i.e., stick to its functional specification) all the time; and when its internal behavior experiences anomalies, the impact on the external behavior (i.e., the delivered service) should be concealed or minimized.

In practice, a computer system can face anomalies (faults and errors) during operation which require palliative actions in order to conceal or minimize the impact on the system’s externally perceived behavior (failure). The concepts of error, fault, and failure are discussed in Section 2.1. The ultimate goal is to sustain the quality of the service (QoS) being delivered in an acceptable level. The range of possible palliative actions is broad and strongly dependent on the system type and use. For example, space-grade computers deployed on earth-orbiting satellites demand more effective (and frequently more complex) fault-handling techniques than computers embedded in mobile phones. But in most cases, these actions usually involve anomaly detection (AD), fault isolation (FI), fault diagnosis (FD), and fault recovery (FR). These four resolution phases are discussed in detail in Section 2.2.

Today, reliability has become one of the most critical aspects of computer system design. Technology scaling, per Moore’s Law has reached a stage where process variability, yield, and in-field aging threaten the economic viability of future scaling. Scaling the supply voltage down per classical Dennard’s rules has not been possible lately, because a commensurate reduction in device threshold voltage (to maintain performance targets) would result in a steep increase in leakage power. And, even a smaller rate of reduction in supply voltage needs to be done carefully—because of the soft error sensitivity to voltage. Other device parameters must be adjusted to retain per-device soft error rates at current levels in spite of scaling. Even with that accomplished, the per-chip soft error rate (SER) tends to increase with each generation due to the increased device density. Similarly, the dielectric (oxide) thickness within a transistor device has shrunk at a rate faster than the reduction in supply voltage (because of performance targets). This threatens to increase hard fail rates of processor chips beyond acceptable limits as well. It is uncertain today what will be the future impact of further miniaturization beyond the 7-nm technology node in terms of meeting an acceptable (or affordable) balance across reliability and power consumption metrics related to prospective computing systems. In particular for mission-critical systems, device reliability and system survivability pose increasingly significant challenges [2–5]. Error resiliency and self-adaptability of future electronic systems are subjects of growing interest [3, 6]. In some situations, even survivability in the form of graceful degradation is desired if a full recovery cannot be achieved. Transient, or so-called soft errors as well as permanent, hard errors in electronic devices caused by aging require autonomous mitigation as manual intervention may not be feasible [7]. In application domains that include harsh operating environments (e.g., high altitude, which exacerbates soft error rates, or extreme temperature swings that exacerbate certain other transient and permanent failure rates), the concerns about future system reliability are of course even more pronounced. The reliability concerns of highly complex VLSI systems in sub-22 nm processes, caused by soft and hard errors, are increasing. Therefore, the importance of addressing reliability issues is on the rise. In general, a system is said to be resilient if it is capable of handling failures throughout its lifetime to maintain the desired processing performance within some tolerance.

2.1 SUSTAINING QUALITY OF SERVICE IN THE PRESENCE OF FAULTS, ERRORS, AND FAILURES

To advance beyond static redundancy in the nanoscale era, it is essential to consider innovative resilient techniques which distinguish between faults, errors, and failures in order to handle them in innovative ways. Fig. 1 depicts each of these terms using a layered model of system dependability.

click for larger image

FIG. 1. Layered model of system dependability.

The resource layer consists of all of the physical components that underlie all of the computational processes used by an (embedded) application. These physical components span a range of granularities including logic gates, field-programmable gate array (FPGA) look-up tables, circuit functional units, processor cores, and memory chips. Each physical component is considered to be viable during the current computation if it operates without exhibiting defective behavior at the time that it is utilized. On the other hand, components which exhibit defective behavior are considered to be faulty, either initially or else may become faulty at any time during the mission. Initially faulty resources are a direct result of a priori conditions of manufacturing imperfections such as contaminants or random effects creating process variation beyond allowed design tolerances [8]. As depicted by the cumulative arc in Fig. 1, during the mission each component may transition from viable status to faulty status for highly scaled devices. This transition may occur due to cumulative effects of deep submicron devices such as time-dependent dielectric breakdown (TDDB) due to electrical field weakening of the gate oxide layer, total ionizing dose (TID) of cosmic radiation, electromigration within interconnect, and other progressive degradations over the mission lifetime. Meanwhile, transient effects such as incident alpha particles which ionize critical amounts of charge, ground bounce, and dynamic temperature variations may cause either long lasting or intermittent reversible transitions between viable and faulty status. In this sense, faults may lie dormant whereby the physical resource is defective, yet currently unused. Later in the computations, dormant faults become active when such components are utilized.

The behavioral layer shown in Fig. 1 depicts the outcome of utilizing viable and faulty physical components. Viable components result in correct behavior during the interval of observation. Meanwhile, utilization of faulty components manifests errors in the behavior according to the input/output and/or timing requirements which define the constituent computation. Still, an error which occurs but does not incur any impact to the result of the computation is termed a silent error. Silent errors, such as a flipped bit due to a faulty memory cell at an address which is not referenced by the application, remain isolated at the behavioral layer without propagating to the application. On the other hand, errors which are articulated propagate up to the application layer.

The application layer shown in Fig. 1 depicts that correct behaviors contribute to sustenance of compliant operation. Systems that are compliant throughout the mission at the application layer are deemed to be reliable. To remain completely compliant, all articulated errors must be concealed from the application to remain within the behavioral layer. For example, error masking techniques which employ voting schemes achieve reliability objectives by insulating articulated errors from the application. Articulated errors which reach the application cause the system to have degraded performance if the impact of the error can be tolerated. On the other hand, articulated errors which result in unacceptable conditions to the application incur a failure condition. Failures may be catastrophic, but more often are recoverable—e.g., using some of the techniques discussed in Chapter 4. In general, resilience techniques that can provide a continuum in QoS (spanning from completing meeting requirements down to inadequate performance from the application perspective) are very desirable. This mapping of the QoS continuum to application states of compliant, degraded, and failure is depicted near the top of Fig. 1.

2.2 PROCESSING PHASES OF COMPUTING SYSTEM RESILIENCY

A four-state model of system resiliency is shown in Fig. 2. For purposes of discussion, the initial and predominant condition is depicted as the lumped representation of the useful operational states of compliant or degraded performance in the upper center of the figure. To deal with contingencies in an attempt to return to a complaint or degraded state, resilient computing systems typically employ a sequence of resolution phases including AD, FI, FD, and FR using the variety of techniques, some of which are described in this and following chapters. Additionally, methods such as radiation shielding attempt to prevent certain anomalies such as alpha particle-induced soft errors from occurring.

click for larger image

FIG. 2. Resiliency-enabled processing phases.

Redundancy-based AD methods are popular throughout the fault-tolerant systems community, although they incur significant area and energy overhead costs. In the comparison diagnosis model [9, 10] units are evaluated in pairs when subjected to identical inputs. Under this AD technique, any discrepancy between the units’ outputs indicates occurrence of at least a single failure. However, two or more identical common-mode failures (CMF) which occur simultaneously in each module may be undetected. For instance, a concurrent error detection (CED) arrangement utilizes either two concurrent replicas of a design [11], or a diverse duplex design to reduce CMFs [12]. This raises the concept of design diversity in redundant systems. Namely, triple modular redundancy (TMR) systems can be implemented using physically distinct, yet functionally identical designs. Granted, the meaning of physically distinct differs when referring to FPGAs than when referring to application-specific integrated circuits (ASICs). In FPGAs, two modules are said to be physically distinct if the look-up tables in the same relative location on both modules do not implement the same logical function. TMR systems based on diverse designs possess more immunity toward CMF that impact multiple modules at the same time in the same manner, generally due to a common cause.

An additional primary advantage of TMR is its very low fault detection latency. A TMR-based system [13, 14] utilizes three instances of a datapath module. The outputs of these three instances become inputs to a majority voter, which in turn provides the main output of the system. In this way, besides AD capability, the system is able to mask its faults in the output if distinguishable faults occur within one of the three modules. However, this incurs an increased area and power requirement to accommodate three replicated datapaths. It will be shown that these overheads can be significantly reduced by either considering some health metric, such as the instantaneous peak signal-to-noise ratio (PSNR) measure obtained within a video encoder circuit as a precipitating indication of faults, or periodic checking of the logic resources. In contrast, simple masking methods act immediately to attempt to conceal each articulated error to return immediately to an operational state of compliant or degraded performance.

As shown in Fig. 2, FI occurs after AD identifies inconsistent output(s). Namely, FI applies functional inputs or additional test vectors in order to locate the faulty component(s) present in the resource layer. The process of FI can vary in granularity to a major module, component, device, or input signal line. FI may be specific with certainty or within some confidence interval. One potential benefit of identifying faulty component(s) is the ability to prune the recovery space to concentrate on resources which are known to be faulty. This can result in more rapid recovery, thus increasing system availability which is defined to be the proportion of the mission which the system is operational. Together, these first two phases of AD and FI are often viewed to constitute error containment strategies.

The FD phase consists in distinguishing the characteristics of the faulty components which have been isolated. Traditionally, in many fault-tolerant digital circuits, the components are diagnosed by evaluating their behavior under a set of test inputs. This test vector strategy can isolate faults while requiring only a small area overhead. However the cost of evaluating an extensive number of test vectors to diagnose the functional blocks increases exponentially in terms of the number of components and their input domains. The active dynamic redundancy approach presented in Chapter 4 combines the benefits of redundancy with a negligible computational overhead. On the other hand, static redundancy techniques reserve dedicated spare resources for fault handling.

While reconfiguration and redundancy are fundamental components of an FR process, both the reconfiguration scheduling policy and the granularity of recovery affect availability during the recovery phase and quality of recovery after fault handling. In this case, it is possible to exploit the FR algorithms properties so that the reconfiguration strategy is constructed while taking into account varying priority levels associated with required functions.

A system can be considered to be fault tolerant if it can continue some useful operation in the presence of failures, perhaps in a degraded mode with partially restored functionality [15]. Reliability and availability are desirable qualities of a system, which are measured in terms of service continuity and operational availability in the presence of adverse events, respectively [16]. In recent FPGA-based designs, reliability has been attained by employing the reconfigurable modules in the fault-handling flow, whereas availability is maintained by minimum interruption of the main throughput datapath. These are all considered to constitute fault handling procedures as depicted in Fig. 2.

The next installment from this chapter discusses issues related to measuring resilience.

Reprinted with permission from Elsevier/Morgan Kaufmann, Copyright © 2016

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.