ER-TEST: Evaluating Explanation Regularization Methods for NLP Models

by   Brihi Joshi, et al.

Neural language models' (NLMs') reasoning processes are notoriously hard to explain. Recently, there has been much progress in automatically generating machine rationales of NLM behavior, but less in utilizing the rationales to improve NLM behavior. For the latter, explanation regularization (ER) aims to improve NLM generalization by pushing the machine rationales to align with human rationales. Whereas prior works primarily evaluate such ER models via in-distribution (ID) generalization, ER's impact on out-of-distribution (OOD) is largely underexplored. Plus, little is understood about how ER model performance is affected by the choice of ER criteria or by the number/choice of training instances with human rationales. In light of this, we propose ER-TEST, a protocol for evaluating ER models' OOD generalization along three dimensions: (1) unseen datasets, (2) contrast set tests, and (3) functional tests. Using ER-TEST, we study three key questions: (A) Which ER criteria are most effective for the given OOD setting? (B) How is ER affected by the number/choice of training instances with human rationales? (C) Is ER effective with distantly supervised human rationales? ER-TEST enables comprehensive analysis of these questions by considering a diverse range of tasks and datasets. Through ER-TEST, we show that ER has little impact on ID performance, but can yield large gains on OOD performance w.r.t. (1)-(3). Also, we find that the best ER criterion is task-dependent, while ER can improve OOD performance even with limited and distantly-supervised human rationales.


Understanding and Testing Generalization of Deep Networks on Out-of-Distribution Data

Deep network models perform excellently on In-Distribution (ID) data, bu...

Sharpness-Aware Minimization Improves Language Model Generalization

The allure of superhuman-level capabilities has led to considerable inte...

Cross-functional Analysis of Generalisation in Behavioural Learning

In behavioural testing, system functionalities underrepresented in the s...

Distractor generation for multiple-choice questions with predictive prompting and large language models

Large Language Models (LLMs) such as ChatGPT have demonstrated remarkabl...

On the Compositional Generalization Gap of In-Context Learning

Pretrained large generative language models have shown great performance...

Generalization of Reinforcement Learners with Working and Episodic Memory

Memory is an important aspect of intelligence and plays a role in many d...

Few-shot Adaptation Works with UnpredicTable Data

Prior work on language models (LMs) shows that training on a large numbe...

Please sign up or login with your details

Forgot password? Click here to reset