Large Language Models Struggle to Learn Long-Tail Knowledge

by   Nikhil Kandpal, et al.

The internet contains a wealth of knowledge – from the birthdays of historical figures to tutorials on how to code – all of which may be learned by language models. However, there is a huge variability in the number of times a given piece of information appears on the web. In this paper, we study the relationship between the knowledge memorized by large language models and the information in their pre-training datasets. In particular, we show that a language model's ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training. We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given question-answer pair. Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answering datasets (e.g., TriviaQA), pre-training corpora (e.g., ROOTS), and model sizes (e.g., 176B parameters). Moreover, we find that while larger models are better at learning long-tail knowledge, we estimate that today's models must be scaled by many orders of magnitude to reach competitive QA performance on questions with little support in the pre-training data. Finally, we show that retrieval-augmentation can reduce the dependence on relevant document count, presenting a promising approach for capturing the long-tail.


page 2

page 8

page 11


Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering

The integration of multi-document pre-training objectives into language ...

The Effect of Masking Strategies on Knowledge Retention by Language Models

Language models retain a significant amount of world knowledge from thei...

REALM: Retrieval-Augmented Language Model Pre-Training

Language model pre-training has been shown to capture a surprising amoun...

Studying Strategically: Learning to Mask for Closed-book QA

Closed-book question-answering (QA) is a challenging task that requires ...

Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

Generative pre-trained language models (GPLMs) like ChatGPT encode in th...

Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? A.K.A. Will LLMs Replace Knowledge Graphs?

Since the recent prosperity of Large Language Models (LLMs), there have ...

Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

How do language models learn to make predictions during pre-training? To...

Please sign up or login with your details

Forgot password? Click here to reset