Dependability Analysis of Data Storage Systems in Presence of Soft Errors
In recent years, high availability and reliability of Data Storage Systems (DSS) have been significantly threatened by soft errors occurring in storage controllers. Due to their specific functionality and hardware-software stack, error propagation and manifestation in DSS is quite different from general-purpose computing architectures. To our knowledge, no previous study has examined the system-level effects of soft errors on the availability and reliability of data storage systems. In this paper, we first analyze the effects of soft errors occurring in the server processors of storage controllers on the entire storage system dependability. To this end, we implemented the major functions of a typical data storage system controller, running on a full stack of storage system operating system, and developed a framework to perform fault injection experiments using a full system simulator. We then propose a new metric, Storage System Vulnerability Factor (SSVF), to accurately capture the impact of soft errors in storage systems. By conducting extensive experiments, it is revealed that depending on the controller configuration, up to 40 unrecoverable soft errors in this part will result in Data Loss (DL) in an irreversible manner. However, soft errors in the rest of cache memory filled by Operating System (OS) and storage applications will result in Data Unavailability (DU) at the storage system level. Our analysis also shows that Detectable Unrecoverable Errors (DUEs) on the cache data field are the major cause of DU in storage systems, while Silent Data Corruptions (SDCs) in the cache tag and data field are mainly the cause of DL in storage systems.
READ FULL TEXT