Measuring the Robustness of Natural Language Processing Models to Domain Shifts

by   Nitay Calderon, et al.

Large Language Models have shown promising performance on various tasks, including fine-tuning, few-shot learning, and zero-shot learning. However, their performance on domains without labeled data still lags behind those with labeled data, which we refer as the Domain Robustness (DR) challenge. Existing research on DR suffers from disparate setups, lack of evaluation task variety, and reliance on challenge sets. In this paper, we explore the DR challenge of both fine-tuned and few-shot learning models in natural domain shift settings. We introduce a DR benchmark comprising diverse NLP tasks, including sentence and token-level classification, QA, and generation, each task consists of several domains. We propose two views of the DR challenge: Source Drop (SD) and Target Drop (TD), which alternate between the source and target in-domain performance as reference points. We find that in significant proportions of domain shifts, either SD or TD is positive, but not both, emphasizing the importance of considering both measures as diagnostic tools. Our experimental results demonstrate the persistent existence of the DR challenge in both fine-tuning and few-shot learning models, though it is less pronounced in the latter. We also find that increasing the fine-tuned model size improves performance, particularly in classification.


page 1

page 2

page 3

page 4


Making Pre-trained Language Models Better Few-shot Learners

The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot...

On Codex Prompt Engineering for OCL Generation: An Empirical Study

The Object Constraint Language (OCL) is a declarative language that adds...

COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning

We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to ...

Adversarial Robustness of Prompt-based Few-Shot Learning for Natural Language Understanding

State-of-the-art few-shot learning (FSL) methods leverage prompt-based f...

LawngNLI: A Long-Premise Benchmark for In-Domain Generalization from Short to Long Contexts and for Implication-Based Retrieval

Natural language inference has trended toward studying contexts beyond t...

Revisiting Distance Metric Learning for Few-Shot Natural Language Classification

Distance Metric Learning (DML) has attracted much attention in image pro...

Feature Transformation Ensemble Model with Batch Spectral Regularization for Cross-Domain Few-Shot Classification

Deep learning models often require much annotated data to obtain good pe...

Please sign up or login with your details

Forgot password? Click here to reset