On the Faithfulness Measurements for Model Interpretations
Recent years have witnessed the emergence of a variety of post-hoc interpretations that aim to uncover how natural language processing (NLP) models make predictions. Despite the surge of new interpretations, it remains an open problem how to define and quantitatively measure the faithfulness of interpretations, i.e., to what extent they conform to the reasoning process behind the model. To tackle these issues, we start with three criteria: the removal-based criterion, the sensitivity of interpretations, and the stability of interpretations, that quantify different notions of faithfulness, and propose novel paradigms to systematically evaluate interpretations in NLP. Our results show that the performance of interpretations under different criteria of faithfulness could vary substantially. Motivated by the desideratum of these faithfulness notions, we introduce a new class of interpretation methods that adopt techniques from the adversarial robustness domain. Empirical results show that our proposed methods achieve top performance under all three criteria. Along with experiments and analysis on both the text classification and the dependency parsing tasks, we come to a more comprehensive understanding of the diverse set of interpretations.
READ FULL TEXT