Analyzing the Interpretability Robustness of Self-Explaining Models

by   Haizhong Zheng, et al.

Recently, interpretable models called self-explaining models (SEMs) have been proposed with the goal of providing interpretability robustness. We evaluate the interpretability robustness of SEMs and show that explanations provided by SEMs as currently proposed are not robust to adversarial inputs. Specifically, we successfully created adversarial inputs that do not change the model outputs but cause significant changes in the explanations. We find that even though current SEMs use stable co-efficients for mapping explanations to output labels, they do not consider the robustness of the first stage of the model that creates interpretable basis concepts from the input, leading to non-robust explanations. Our work makes a case for future work to start examining how to generate interpretable basis concepts in a robust way.


On the Robustness of Interpretability Methods

We argue that robustness of explanations---i.e., that similar inputs sho...

Towards Robust Interpretability with Self-Explaining Neural Networks

Most recent work on interpretability of complex machine learning models ...

Learning to Select Prototypical Parts for Interpretable Sequential Data Modeling

Prototype-based interpretability methods provide intuitive explanations ...

Robust Learning from Explanations

Machine learning from explanations (MLX) is an approach to learning that...

Rectifying Group Irregularities in Explanations for Distribution Shift

It is well-known that real-world changes constituting distribution shift...

Interpretable Graph Capsule Networks for Object Recognition

Capsule Networks, as alternatives to Convolutional Neural Networks, have...

Explaining Language Models' Predictions with High-Impact Concepts

The emergence of large-scale pretrained language models has posed unprec...

Please sign up or login with your details

Forgot password? Click here to reset