A Deep Representation Empowered Distant Supervision Paradigm for Clinical Information Extraction

by   Yanshan Wang, et al.

Objective: To automatically create large labeled training datasets and reduce the efforts of feature engineering for training accurate machine learning models for clinical information extraction. Materials and Methods: We propose a distant supervision paradigm empowered by deep representation for extracting information from clinical text. In this paradigm, the rule-based NLP algorithms are utilized to generate weak labels and create large training datasets automatically. Additionally, we use pre-trained word embeddings as deep representation to eliminate the need of task-specific feature engineering for machine learning. We evaluated the effectiveness of the proposed paradigm on two clinical information extraction tasks: smoking status extraction and proximal femur (hip) fracture extraction. We tested three prevalent machine learning models, namely, Convolutional Neural Networks (CNN), Support Vector Machine (SVM), and Random Forrest (RF). Results: The results indicate that CNN is the best fit to the proposed distant supervision paradigm. It outperforms the rule-based NLP algorithms given large datasets by capturing additional extraction patterns. We also verified the advantage of word embedding feature representation in the paradigm over term frequency-inverse document frequency (tf-idf) and topic modeling representations. Discussion: In the clinical domain, the limited amount of labeled data is always a bottleneck for applying machine learning. Additionally, the performance of machine learning approaches highly depends on task-specific feature engineering. The proposed paradigm could alleviate those problems by leveraging rule-based NLP algorithms to automatically assign weak labels and eliminating the need of task-specific feature engineering using word embedding feature representation.


page 1

page 2

page 3

page 4


Neural Language Models with Distant Supervision to Identify Major Depressive Disorder from Clinical Notes

Major depressive disorder (MDD) is a prevalent psychiatric disorder that...

Word and Document Embeddings based on Neural Network Approaches

Data representation is a fundamental task in machine learning. The repre...

Automated Labeling of German Chest X-Ray Radiology Reports using Deep Learning

Radiologists are in short supply globally, and deep learning models offe...

Chinese Event Extraction Using DeepNeural Network with Word Embedding

A lot of prior work on event extraction has exploited a variety of featu...

Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition

Background. Previous state-of-the-art systems on Drug Name Recognition (...

Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes

PURPOSE: The medical literature relevant to germline genetics is growing...

MT-Clinical BERT: Scaling Clinical Information Extraction with Multitask Learning

Clinical notes contain an abundance of important but not-readily accessi...

Please sign up or login with your details

Forgot password? Click here to reset