LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

by   Philippe Laban, et al.

With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.


page 5

page 7

page 8

page 12

page 14


Generating Benchmarks for Factuality Evaluation of Language Models

Before deploying a language model (LM) within a given domain, it is impo...

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Human evaluation is the foundation upon which the evaluation of both sum...

Towards Reliable Dermatology Evaluation Benchmarks

Benchmark datasets for digital dermatology unwittingly contain inaccurac...

BenchIE: Open Information Extraction Evaluation Based on Facts, Not Tokens

Intrinsic evaluations of OIE systems are carried out either manually – w...

ARB: Advanced Reasoning Benchmark for Large Language Models

Large Language Models (LLMs) have demonstrated remarkable performance on...

Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow

Recent research has shown that language models exploit `artifacts' in be...

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Recent model editing techniques promise to mitigate the problem of memor...

Please sign up or login with your details

Forgot password? Click here to reset