Using Deep Learning For Title-Based Semantic Subject Indexing To Reach Competitive Performance to Full-Text

by   Florian Mai, et al.
Christian-Albrechts-Universität zu Kiel

For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models obtained from training on increasing amounts of title training data compare to models from training on a constant number of full-texts. We evaluate this question on a large-scale dataset from the medical domain (PubMed) and from economics (EconBiz). In these datasets, the titles and annotations of millions of publications are available, and they outnumber the available full-texts by a factor of 20 and 15, respectively. To exploit these large amounts of data to their full potential, we develop three strong deep learning classifiers and evaluate their performance on the two datasets. The results are promising. On the EconBiz dataset, all three classifiers outperform their full-text counterparts by a large margin. The best title-based classifier outperforms the best full-text method by 9.9 reaches the performance of the best full-text classifier, with a difference of only 2.9


page 1

page 2

page 3

page 4


Large scale biomedical texts classification: a kNN and an ESA-based approaches

With the large and increasing volume of textual data, automated methods ...

RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

Information surrounds people in modern life. Text is a very efficient ty...

Rank-Aware Negative Training for Semi-Supervised Text Classification

Semi-supervised text classification-based paradigms (SSTC) typically emp...

Linear Classifier: An Often-Forgotten Baseline for Text Classification

Large-scale pre-trained language models such as BERT are popular solutio...

Predicting Themes within Complex Unstructured Texts: A Case Study on Safeguarding Reports

The task of text and sentence classification is associated with the need...

Using Titles vs. Full-text as Source for Automated Semantic Document Annotation

A significant part of the largest Knowledge Graph today, the Linked Open...

Towards Integration of Statistical Hypothesis Tests into Deep Neural Networks

We report our ongoing work about a new deep architecture working in tand...

Please sign up or login with your details

Forgot password? Click here to reset