Kavli Affiliate: Ke Wang
| First 5 Authors: Youssef A. Ait Alama, Sampada Sakpal, Ke Wang, Razvan Bunescu, Avinash Karanth
| Summary:
Hardware failures are a growing challenge for machine learning accelerators,
many of which are based on systolic arrays. When a permanent hardware failure
occurs in a systolic array, existing solutions include localizing and isolating
the faulty processing element (PE), using a redundant PE for re-execution, or
in some extreme cases decommissioning the entire accelerator for further
investigation. In this paper, we propose novel algorithmic approaches that
mitigate permanent hardware faults in neural network (NN) accelerators by
uniquely integrating the behavior of the faulty component instead of bypassing
it. In doing so, we aim for a more sustainable use of the accelerator where
faulty hardware is neither bypassed nor discarded, instead being given a second
life. We first introduce a CUDA-accelerated systolic array simulator in
PyTorch, which enabled us to quantify the impact of permanent faults appearing
on links connecting two PEs or in weight registers, where one bit is stuck at 0
or 1 in the float32, float16, or bfloat16 representation. We then propose
several algorithmic mitigation techniques for a subset of stuck-at faults, such
as Invertible Scaling or Shifting of activations and weights, or fine tuning
with the faulty behavior. Notably, the proposed techniques do not require any
hardware modification, instead relying on existing components of widely used
systolic array based accelerators, such as normalization, activation, and
storage units. Extensive experimental evaluations using fully connected and
convolutional NNs trained on MNIST, CIFAR-10 and ImageNet show that the
proposed fault-tolerant approach matches or gets very close to the original
fault-free accuracy.
| Search Query: ArXiv Query: search_query=au:”Ke Wang”&id_list=&start=0&max_results=3