WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks

02/12/2019
by   Cristian Consonni, et al.
0

Wikipedia articles contain multiple links connecting a subject to other pages of the encyclopedia. In Wikipedia parlance, these links are called internal links or wikilinks. We present a complete dataset of the network of internal Wikipedia links for the 9 largest language editions. The dataset contains yearly snapshots of the network and spans 17 years, from the creation of Wikipedia in 2001 to March 1st, 2018. While previous work has mostly focused on the complete hyperlink graph which includes also links automatically generated by templates, we parsed each revision of each article to track links appearing in the main text. In this way we obtained a cleaner network, discarding more than half of the links and representing all and only the links intentionally added by editors. We describe in detail how the Wikipedia dumps have been processed and the challenges we have encountered, including the need to handle special pages such as redirects, i.e., alternative article titles. We present descriptive statistics of several snapshots of this network. Finally, we propose several research opportunities that can be explored using this new dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2023

Orphan Articles: The Dark Matter of Wikipedia

With 60M articles in more than 300 language versions, Wikipedia is the l...
research
03/05/2015

Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation

Hyperlinks and other relations in Wikipedia are a extraordinary resource...
research
09/15/2018

Geo-Text Data and Data-Driven Geospatial Semantics

Many datasets nowadays contain links between geographic locations and na...
research
07/16/2020

Wikipedia's Network Bias on Controversial Topics

The most important feature of Wikipedia is the presence of hyperlinks in...
research
01/28/2020

WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Wikipedia is written in the wikitext markup language. When serving conte...
research
10/25/2022

Wikinformetrics: Construction and description of an open Wikipedia knowledge graph dataset for informetric purposes

Wikipedia is one of the most visited websites in the world and is also a...
research
03/02/2017

DAWT: Densely Annotated Wikipedia Texts across multiple languages

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia ...

Please sign up or login with your details

Forgot password? Click here to reset