OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language Model

03/03/2021
by   Xiao Liu, et al.
0

To enrich language models with domain knowledge is crucial but difficult. Based on the world's largest public academic graph Open Academic Graph (OAG), we pre-train an academic language model, namely OAG-BERT, which integrates massive heterogeneous entities including paper, author, concept, venue, and affiliation. To better endow OAG-BERT with the ability to capture entity information, we develop novel pre-training strategies including heterogeneous entity type embedding, entity-aware 2D positional encoding, and span-aware entity masking. For zero-shot inference, we design a special decoding strategy to allow OAG-BERT to generate entity names from scratch. We evaluate the OAG-BERT on various downstream academic tasks, including NLP benchmarks, zero-shot entity inference, heterogeneous graph link prediction, and author name disambiguation. Results demonstrate the effectiveness of the proposed pre-training approach to both comprehending academic texts and modeling knowledge from heterogeneous entities. OAG-BERT has been deployed to multiple real-world applications, such as reviewer recommendations for NSFC (National Nature Science Foundation of China) and paper tagging in the AMiner system. It is also available to the public through the CogDL package.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/08/2021

NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task–Next Sentence Prediction

Using prompts to utilize language models to perform various downstream t...
research
04/18/2021

CEAR: Cross-Entity Aware Reranker for Knowledge Base Completion

Pre-trained language models (LMs) like BERT have shown to store factual ...
research
09/02/2021

TravelBERT: Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation

Existing technologies expand BERT from different perspectives, e.g. desi...
research
06/05/2023

Graph-Aware Language Model Pre-Training on a Large Graph Corpus Can Help Multiple Graph Applications

Model pre-training on large text corpora has been demonstrated effective...
research
10/12/2020

Zero-shot Entity Linking with Efficient Long Range Sequence Modeling

This paper considers the problem of zero-shot entity linking, in which a...
research
05/26/2021

Zero-shot Medical Entity Retrieval without Annotation: Learning From Rich Knowledge Graph Semantics

Medical entity retrieval is an integral component for understanding and ...
research
09/10/2020

RadLex Normalization in Radiology Reports

Radiology reports have been widely used for extraction of various clinic...

Please sign up or login with your details

Forgot password? Click here to reset