Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

05/10/2021
by   KayYen Wong, et al.
0

Wikipedia is the largest online encyclopedia, used by algorithms and web users as a central hub of reliable information on the web. The quality and reliability of Wikipedia content is maintained by a community of volunteer editors. Machine learning and information retrieval algorithms could help scale up editors' manual efforts around Wikipedia content reliability. However, there is a lack of large-scale data to support the development of such research. To fill this gap, in this paper, we propose Wiki-Reliability, the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues. To build this dataset, we rely on Wikipedia "templates". Templates are tags used by expert Wikipedia editors to indicate content issues, such as the presence of "non-neutral point of view" or "contradictory articles", and serve as a strong signal for detecting reliability issues in a revision. We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative with respect to each template. Each positive/negative example in the dataset comes with the full article text and 20 features from the revision's metadata. We provide an overview of the possible downstream tasks enabled by such data, and show that Wiki-Reliability can be used to train large-scale models for content reliability prediction. We release all data and code for public use.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/03/2021

A Large Scale Study of Reader Interactions with Images on Wikipedia

Wikipedia is the largest source of free encyclopedic knowledge and one o...
research
10/14/2020

NwQM: A neural quality assessment framework for Wikipedia

Millions of people irrespective of socioeconomic and demographic backgro...
research
12/26/2018

DBpedia NIF: Open, Large-Scale and Multilingual Knowledge Extraction Corpus

In the past decade, the DBpedia community has put significant amount of ...
research
04/10/2023

WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus

In this paper, we introduce a new NLP task – generating short factual ar...
research
09/19/2018

Learning to Interpret Satellite Images Using Wikipedia

Despite recent progress in computer vision, fine-grained interpretation ...
research
12/10/2021

LSH methods for data deduplication in a Wikipedia artificial dataset

This paper illustrates locality sensitive hasing (LSH) models for the id...
research
12/21/2018

Wikipedia Text Reuse: Within and Without

We study text reuse related to Wikipedia at scale by compiling the first...

Please sign up or login with your details

Forgot password? Click here to reset