It's better to say "I can't answer" than answering incorrectly: Towards Safety critical NLP systems

by   Neeraj Varshney, et al.

In order to make AI systems more reliable and their adoption in safety critical applications possible, it is essential to impart the capability to abstain from answering when their prediction is likely to be incorrect and seek human intervention. Recently proposed "selective answering" techniques model calibration as a binary classification task. We argue that, not all incorrectly answered questions are incorrect to the same extent and the same is true for correctly answered questions. Hence, treating all correct predictions equally and all incorrect predictions equally constraints calibration. In this work, we propose a methodology that incorporates the degree of correctness, shifting away from classification labels as it directly tries to predict the probability of model's prediction being correct. We show the efficacy of the proposed method on existing Natural Language Inference (NLI) datasets by training on SNLI and evaluating on MNLI mismatched and matched datasets. Our approach improves area under the curve (AUC) of risk-coverage plot by 10.22% and 8.06% over maxProb with respect to the maximum possible improvement on MNLI mismatched and matched set respectively. In order to evaluate our method on Out of Distribution (OOD) datasets, we propose a novel setup involving questions with a variety of reasoning skills. Our setup includes a test set for each of the five reasoning skills: numerical, logical, qualitative, abductive and commonsense. We select confidence threshold for each of the approaches where the in-domain accuracy (SNLI) is 99%. Our results show that, the proposed method outperforms existing approaches by abstaining on 2.6% more OOD questions at respective confidence thresholds.


page 1

page 2

page 3

page 4


Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications

With the increasing importance of safety requirements associated with th...

Calibrating Structured Output Predictors for Natural Language Processing

We address the problem of calibrating prediction confidence for output e...

Rethinking Confidence Calibration for Failure Prediction

Reliable confidence estimation for the predictions is important in many ...

Post-Abstention: Towards Reliably Re-Attempting the Abstained Instances in QA

Despite remarkable progress made in natural language processing, even th...

On Calibrating Semantic Segmentation Models: Analysis and An Algorithm

We study the problem of semantic segmentation calibration. For image cla...

What Else Do I Need to Know? The Effect of Background Information on Users' Reliance on AI Systems

AI systems have shown impressive performance at answering questions by r...

Calibration of Machine Reading Systems at Scale

In typical machine learning systems, an estimate of the probability of t...

Please sign up or login with your details

Forgot password? Click here to reset