Lightning: Striking the Secure Isolation on GPU Clouds with Transient Hardware Faults
GPU clouds have become a popular computing platform because of the cost of owning and maintaining high-performance computing clusters. Many cloud architectures have also been proposed to ensure a secure execution environment for guest applications by enforcing strong security policies to isolate the untrusted hypervisor from the guest virtual machines (VMs). In this paper, we study the impact of GPU chip's hardware faults on the security of cloud "trusted" execution environment using Deep Neural Network (DNN) as the underlying application. We show that transient hardware faults of GPUs can be generated by exploiting the Dynamic Voltage and Frequency Scaling (DVFS) technology, and these faults may cause computation errors, but they have limited impact on the inference accuracy of DNN due to the robustness and fault-tolerant nature of well-developed DNN models. To take full advantage of these transient hardware faults, we propose the Lightning attack to locate the fault injection targets of DNNs and to control the fault injection precision in terms of timing and position. We conduct experiments on three commodity GPUs to attack four widely-used DNNs. Experimental results show that the proposed attack can reduce the inference accuracy of the models by as high as 78.3% and 64.5% on average. More importantly, 67.9% of the targeted attacks have successfully misled the models to give our desired incorrect inference result. This demonstrates that the secure isolation on GPU clouds is vulnerable against transient hardware faults and the computation results may not be trusted.
READ FULL TEXT