Reliable and power-aware architectures: Sustaining system resiliency

Augusto Vega. Pradip Bose, and Alper Buyuktosunoglu

September 12, 2017

Augusto Vega. Pradip Bose, and Alper BuyuktosunogluSeptember 12, 2017

2.1 SUSTAINING QUALITY OF SERVICE IN THE PRESENCE OF FAULTS, ERRORS, AND FAILURES

To advance beyond static redundancy in the nanoscale era, it is essential to consider innovative resilient techniques which distinguish between faults, errors, and failures in order to handle them in innovative ways. Fig. 1 depicts each of these terms using a layered model of system dependability.

click for larger image

FIG. 1. Layered model of system dependability.

The resource layer consists of all of the physical components that underlie all of the computational processes used by an (embedded) application. These physical components span a range of granularities including logic gates, field-programmable gate array (FPGA) look-up tables, circuit functional units, processor cores, and memory chips. Each physical component is considered to be viable during the current computation if it operates without exhibiting defective behavior at the time that it is utilized. On the other hand, components which exhibit defective behavior are considered to be faulty, either initially or else may become faulty at any time during the mission. Initially faulty resources are a direct result of a priori conditions of manufacturing imperfections such as contaminants or random effects creating process variation beyond allowed design tolerances [8]. As depicted by the cumulative arc in Fig. 1, during the mission each component may transition from viable status to faulty status for highly scaled devices. This transition may occur due to cumulative effects of deep submicron devices such as time-dependent dielectric breakdown (TDDB) due to electrical field weakening of the gate oxide layer, total ionizing dose (TID) of cosmic radiation, electromigration within interconnect, and other progressive degradations over the mission lifetime. Meanwhile, transient effects such as incident alpha particles which ionize critical amounts of charge, ground bounce, and dynamic temperature variations may cause either long lasting or intermittent reversible transitions between viable and faulty status. In this sense, faults may lie dormant whereby the physical resource is defective, yet currently unused. Later in the computations, dormant faults become active when such components are utilized.

The behavioral layer shown in Fig. 1 depicts the outcome of utilizing viable and faulty physical components. Viable components result in correct behavior during the interval of observation. Meanwhile, utilization of faulty components manifests errors in the behavior according to the input/output and/or timing requirements which define the constituent computation. Still, an error which occurs but does not incur any impact to the result of the computation is termed a silent error. Silent errors, such as a flipped bit due to a faulty memory cell at an address which is not referenced by the application, remain isolated at the behavioral layer without propagating to the application. On the other hand, errors which are articulated propagate up to the application layer.

The application layer shown in Fig. 1 depicts that correct behaviors contribute to sustenance of compliant operation. Systems that are compliant throughout the mission at the application layer are deemed to be reliable. To remain completely compliant, all articulated errors must be concealed from the application to remain within the behavioral layer. For example, error masking techniques which employ voting schemes achieve reliability objectives by insulating articulated errors from the application. Articulated errors which reach the application cause the system to have degraded performance if the impact of the error can be tolerated. On the other hand, articulated errors which result in unacceptable conditions to the application incur a failure condition. Failures may be catastrophic, but more often are recoverable—e.g., using some of the techniques discussed in Chapter 4. In general, resilience techniques that can provide a continuum in QoS (spanning from completing meeting requirements down to inadequate performance from the application perspective) are very desirable. This mapping of the QoS continuum to application states of compliant, degraded, and failure is depicted near the top of Fig. 1.

< Previous
Page 2 of 3
Next >

Loading comments...