Multi-Layer Fault Tolerance Techniques for High Reliability and Performance: Devices, Systems and Data Centers Open Access
Downloadable ContentDownload PDF
In cloud computing data centers, failures may propagate quickly and widely, affecting many physical machines and users. Particularly, soft error is one of the major sources that can cause failures in computer systems. Soft errors may occur in various components in data centers, such as CPU, main memory, flash storage, etc, causing system failures, virtual machine (VM) failures, application abnormal abort, or even silent data corruptions (SDC).Fault tolerance techniques have been proposed at various levels in cloud computing data centers. However, they have limitations in handling soft errors. At the device level, due to the increasing error rates in flash storage, stronger error-correction code (ECC) is required to handle multi-bit errors, resulting in additional overhead in performance, energy, and area. At the system level, while VM-checkpointing is effective to protect selected VMs, the virtualization infrastructure itself, that provides this functionality, is not well protected from soft errors. Soft errors may still cause failures in the virtualization infrastructure affecting all VMs within it. At the data center level, dynamic Voltage/Frequency Scaling (DVFS) may change the CPU voltage to reduce CPU power consumption. However, the reduced voltage increases soft error rates in CPU and potentially causes more soft-error-induced failures. If DVFS is not utilized properly, it may increase the overall operation cost.To improve the reliability of cloud computing systems against soft errors, we design techniques at the device, virtualization system and data center levels to address these limitations. At the device level, we propose a new flash storage architecture, SoftFlash, which aims to reduce the need of strong ECC by leveraging the inherent error tolerance capability in application data. Our results show that for many data-centric applications, the proposed SoftFlash system can achieve acceptable results (or better in certain cases), with 40% performance improvement and a third of the energy consumption.At the data center level, we propose a data center management framework, DUAL, which consists of new virtual machine power and reliability analysis tools. The framework is designed to balance dual needs of a data center, that is, reducing energy consumption and providing high reliability. The evaluations show that DUAL can help maintain the desired reliability and reduce power consumption, which in turn lowers the overall operational cost of a data center.At the virtualization system level, we conduct in-depth analysis of the reliability risks of the virtualization infrastructure. Based on the analysis, we design Xentry that focuses on limiting error propagation within and from the hypervisor. The experiment results show that Xentry incurs very small performance overhead and detects over 99% of the injected faults. To further improve the reliability of the hypervisor, we design and implement redundant hypervisor execution, DualVisor, to provide both recovery and detection capability. We evaluate various design parameters, and selectively replicate hypervisor executions. DualVisor covers 87% of the total number of hypervisor executions with only less than 6% overhead (with 2 to 4 VMs).