Text Mining to Identify and Extract Novel Disease Treatments From Unstructured Datasets

by   Rahul Yedida, et al.

Objective: We aim to learn potential novel cures for diseases from unstructured text sources. More specifically, we seek to extract drug-disease pairs of potential cures to diseases by a simple reasoning over the structure of spoken text. Materials and Methods: We use Google Cloud to transcribe podcast episodes of an NPR radio show. We then build a pipeline for systematically pre-processing the text to ensure quality input to the core classification model, which feeds to a series of post-processing steps for obtaining filtered results. Our classification model itself uses a language model pre-trained on PubMed text. The modular nature of our pipeline allows for ease of future developments in this area by substituting higher quality components at each stage of the pipeline. As a validation measure, we use ROBOKOP, an engine over a medical knowledge graph with only validated pathways, as a ground truth source for checking the existence of the proposed pairs. For the proposed pairs not found in ROBOKOP, we provide further verification using Chemotext. Results: We found 30.4 example, our model successfully identified that Omeprazole can help treat heartburn.We discuss the significance of this result, showing some examples of the proposed pairs. Discussion and Conclusion: The agreement of our results with the existing knowledge source indicates a step in the right direction. Given the plug-and-play nature of our framework, it is easy to add, remove, or modify parts to improve the model as necessary. We discuss the results showing some examples, and note that this is a potentially new line of research that has further scope to be explored. Although our approach was originally oriented on radio podcast transcripts, it is input-agnostic and could be applied to any source of textual data and to any problem of interest.


TravelBERT: Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation

Existing technologies expand BERT from different perspectives, e.g. desi...

Probing Pre-Trained Language Models for Disease Knowledge

Pre-trained language models such as ClinicalBERT have achieved impressiv...

EGFI: Drug-Drug Interaction Extraction and Generation with Fusion of Enriched Entity and Sentence Information

The rapid growth in literature accumulates diverse and yet comprehensive...

Predicting Potential Drug Targets Using Tensor Factorisation and Knowledge Graph Embeddings

The drug discovery and development process is a long and expensive one, ...

Characterizing Uncertainty in the Visual Text Analysis Pipeline

Current visual text analysis approaches rely on sophisticated processing...

Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation

BACKGROUND: The amount of biomedical literature is rapidly growing and i...

Countering Inconsistent Labelling by Google's Vision API for Rotated Images

Google's Vision API analyses images and provides a variety of output pre...

Please sign up or login with your details

Forgot password? Click here to reset