A Survey of fault models and fault tolerance methods for 2D bus-based multi-core systems and TSV based 3D NOC many-core systems

by   Shashikiran Venkatesha, et al.

Reliability has taken centre stage in the development of high-performance computing processors. A Surge of interest is noticeable in recent times in formulating fault and failure models, understanding failure mechanism and strategizing fault mitigation methods for improving the reliability of the system. The article presents a congregation of concepts illustrated one after the other for a better understanding of damages caused by radiation, relevant fault models, and effects of faults. We examine the state of art fault mitigation techniques at the logical layer for digital CMOS based design and SRAM based FPGA. CMOS SRAM structure is the same for both digital CMOS and FPGA. Understanding of resilient SRAM based FPGA is necessary for developing resilient prototypes and it facilitates a faster integration of digital CMOS designs. At the micro-architectural and architectural layer, error detection and recovery methods are discussed for bus-based multi-core systems. The Through silicon via based 3D Network on chip is the prospective solution for integrating many cores on single die. A suitable interconnection approach for petascale computing on many-core systems. The article presents an elaborate discussion on fault models, failure mechanisms, resilient 3D routers, defect tolerance methods for the TSV based 3D NOC many-core systems. Core redundancy, self-diagnosis and distributed diagnosis at the hardware level are examined for many-core systems. The article presents a gamut of fault tolerance solutions from logic level to processor core level in a multi-core and many-core scenario.


A Survey of fault mitigation techniques for multi-core architectures

Fault tolerance in multi-core architecture has attracted attention of re...

A survey on Dependable Digital Systems using FPGAs: Current Methods and Challenges

Fault tolerance is increasingly being use to design Dependable Digital S...

Enhancement in Reliability for Multi-core system consisting of One Instruction Cores

Rapid CMOS device size reduction resulted in billions of transistors on ...

On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster

With the shrinking of technology nodes and the use of parallel processor...

Dynamic Fault Tolerance Through Resource Pooling

Miniaturized satellites are currently not considered suitable for critic...

System on Chip Rejuvenation in the Wake of Persistent Attacks

To cope with the ever increasing threats of dynamic and adaptive persist...

Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-Stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft

Modern embedded technology is a driving factor in satellite miniaturizat...

Please sign up or login with your details

Forgot password? Click here to reset