More capable language models increasingly saturate existing task benchma...
Despite agreement on the importance of detecting out-of-distribution (OO...
Recent years have seen numerous NLP datasets introduced to evaluate the
...
Many crowdsourced NLP datasets contain systematic gaps and biases that a...
A growing body of work shows that models exploit annotation artifacts to...
Performance on the Winograd Schema Challenge (WSC), a respected English
...