A Survey of fault mitigation techniques for multi-core architectures

by   Shashikiran Venkatesha, et al.

Fault tolerance in multi-core architecture has attracted attention of research community for the past 20 years. Rapid improvements in the CMOS technology resulted in exponential growth of transistor density. It resulted in increased challenges for designing resilient multi-core architecture at the same pace. The article presents a survey of fault tolerant methods like fault detection, recovery, re-configurability and repair techniques for multi-core architectures. Salvaging at micro-architectural and architectural level are also discussed. Gamut of fault tolerant approaches discussed in this article have tangible improvements on the reliability of the multi-core architectures. Every concept in the seminal articles is examined with respect to relevant metrics like performance cost, area overhead, fault coverage, level of protection, detection latency and Mean Time To Failure. The existing literature is critically examined. New research directions in the form of new fault tolerant design alternatives for both homogeneous and heterogeneous multi-core architectures are presented. Brief on an analytical approach for fault tolerating model is suggested for Intel and AMD based modern homogeneous multi-core architecture are presented to enhance the understanding of the readers about the architecture with respect to performance degradation, memory access time and execution time.


