An Analysis of Dataset Overlap on Winograd-Style Tasks

11/09/2020
by   Ali Emami, et al.
0

The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlap between these training corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the corpora on which state-of-the-art models are (pre)trained, and that a significant drop in classification accuracy occurs when we evaluate models on instances with minimal overlap. Based on these results, we develop the KnowRef-60K dataset, which consists of over 60k pronoun disambiguation problems scraped from web data. KnowRef-60K is the largest corpus to date for WSC-style common-sense reasoning and exhibits a significantly lower proportion of overlaps with current pretraining corpora.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2020

Don't Neglect the Obvious: On the Role of Unambiguous Words in Word Sense Disambiguation

State-of-the-art methods for Word Sense Disambiguation (WSD) combine two...
research
11/05/2018

On the Evaluation of Common-Sense Reasoning in Natural Language Understanding

The NLP and ML communities have long been interested in developing model...
research
10/09/2022

Noise-Robust De-Duplication at Scale

Identifying near duplicates within large, noisy text corpora has a myria...
research
02/15/2022

Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Pretrained Language Models (LMs) have demonstrated ability to perform nu...
research
03/03/2022

Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

Pre-trained multilingual language models such as mBERT and XLM-R have de...
research
03/13/2020

Know thy corpus! Robust methods for digital curation of Web corpora

This paper proposes a novel framework for digital curation of Web corpor...
research
07/15/2021

Spanish Language Models

This paper presents the Spanish RoBERTa-base and RoBERTa-large models, a...

Please sign up or login with your details

Forgot password? Click here to reset