Using Titles vs. Full-text as Source for Automated Semantic Document Annotation

by   Lukas Galke, et al.
Christian-Albrechts-Universität zu Kiel

A significant part of the largest Knowledge Graph today, the Linked Open Data cloud, consists of metadata about documents such as publications, news reports, and other media articles. While the widespread access to the document metadata is a tremendous advancement, it is yet not so easy to assign semantic annotations and organize the documents along semantic concepts. Providing semantic annotations like concepts in SKOS thesauri is a classical research topic, but typically it is conducted on the full-text of the documents. For the first time, we offer a systematic comparison of classification approaches to investigate how far semantic annotations can be conducted using just the metadata of the documents such as titles published as labels on the Linked Open Data cloud. We compare the classifications obtained from analyzing the documents' titles with semantic annotations obtained from analyzing the full-text. Apart from the prominent text classification baselines kNN and SVM, we also compare recent techniques of Learning to Rank and neural networks and revisit the traditional methods logistic regression, Rocchio, and Naive Bayes. The results show that across three of our four datasets, the performance of the classifications using only titles reaches over 90 the classification performance when using the full-text. Thus, conducting document classification by just using the titles is a reasonable approach for automated semantic annotation and opens up new possibilities for enriching Knowledge Graphs.


page 1

page 2

page 3

page 4


SciKGTeX – A LaTeX Package to Semantically Annotate Contributions in Scientific Publications

Scientific knowledge graphs have been proposed as a solution to structur...

ConceptScope: Organizing and Visualizing Knowledge in Documents based on Domain Ontology

Current text visualization techniques typically provide overviews of doc...

Multiple Document Representations from News Alerts for Automated Bio-surveillance Event Detection

Due to globalization, geographic boundaries no longer serve as effective...

Tuning Traditional Language Processing Approaches for Pashto Text Classification

Today text classification becomes critical task for concerned individual...

Minimally Supervised Categorization of Text with Metadata

Document categorization, which aims to assign a topic label to each docu...

DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop

Business documents come in a variety of structures, formats and informat...

Using Deep Learning For Title-Based Semantic Subject Indexing To Reach Competitive Performance to Full-Text

For (semi-)automated subject indexing systems in digital libraries, it i...

Please sign up or login with your details

Forgot password? Click here to reset