Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources

08/13/2019
by   Daniel Specht Menezes, et al.
0

With the recent progress in machine learning, boosted by techniques such as deep learning, many tasks can be successfully solved once a large enough dataset is available for training. Nonetheless, human-annotated datasets are often expensive to produce, especially when labels are fine-grained, as is the case of Named Entity Recognition (NER), a task that operates with labels on a word-level. In this paper, we propose a method to automatically generate labeled datasets for NER from public data sources by exploiting links and structured data from DBpedia and Wikipedia. Due to the massive size of these data sources, the resulting dataset -- SESAME Available at https://sesame-pt.github.io -- is composed of millions of labeled sentences. We detail the method to generate the dataset, report relevant statistics, and design a baseline using a neural network, showing that our dataset helps building better NER predictors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/16/2021

Few-NERD: A Few-Shot Named Entity Recognition Dataset

Recently, considerable literature has grown up around the theme of few-s...
research
02/08/2017

Automatically Annotated Turkish Corpus for Named Entity Recognition and Text Categorization using Large-Scale Gazetteers

Turkish Wikipedia Named-Entity Recognition and Text Categorization (TWNE...
research
01/13/2020

CLUENER2020: Fine-grained Named Entity Recognition Dataset and Benchmark for Chinese

In this paper, we introduce the NER dataset from CLUE organization (CLUE...
research
04/05/2019

A Multi-task Learning Approach for Named Entity Recognition using Local Detection

Named entity recognition (NER) systems that perform well require task-re...
research
05/17/2017

Transfer Learning for Named-Entity Recognition with Neural Networks

Recent approaches based on artificial neural networks (ANNs) have shown ...
research
05/22/2023

Better Sampling of Negatives for Distantly Supervised Named Entity Recognition

Distantly supervised named entity recognition (DS-NER) has been proposed...
research
01/30/2018

A Machine Learning Approach to Quantitative Prosopography

Prosopography is an investigation of the common characteristics of a gro...

Please sign up or login with your details

Forgot password? Click here to reset