De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of De-Identifiers

by   Leibo Liu, et al.

Objective:Electronic Medical Records (EMRs) contain clinical narrative text that is of great potential value to medical researchers. However, this information is mixed with Protected Health Information (PHI) that presents risks to patient and clinician confidentiality. This paper presents an end-to-end de-identification framework to automatically remove PHI from hospital discharge summaries. Materials and Methods:Our corpus included 600 hospital discharge summaries which were extracted from the EMRs of two principal referral hospitals in Sydney, Australia. Our end-to-end de-identification framework consists of three components: 1) Annotation: labelling of PHI in the 600 hospital discharge summaries using five pre-defined categories: person, address, date of birth, individual identification number, phone/fax number; 2) Modelling: training and evaluating ensembles of named entity recognition (NER) models through the use of three natural language processing (NLP) toolkits (Stanza, FLAIR and spaCy) and both balanced and imbalanced datasets; and 3) De-identification: removing PHI from the hospital discharge summaries. Results:The final model in our framework was an ensemble which combined six single models using both balanced and imbalanced datasets for training majority voting. It achieved 0.9866 precision, 0.9862 recall and 0.9864 F1 scores. The majority of false positives and false negatives were related to the person category. Discussion:Our study showed that the ensemble of different models which were trained using three different NLP toolkits upon balanced and imbalanced datasets can achieve good results even with a relatively small corpus. Conclusion:Our end-to-end framework provides a robust solution to de-identifying clinical narrative corpuses safely. It can be easily applied to any kind of clinical narrative documents.


Summarisation of Electronic Health Records with Clinical Concept Guidance

Brief Hospital Course (BHC) summaries are succinct summaries of an entir...

Benchmarking Modern Named Entity Recognition Techniques for Free-text Health Record De-identification

Electronic Health Records (EHRs) have become the primary form of medical...

Improving Hospital Mortality Prediction with Medical Named Entities and Multimodal Learning

Clinical text provides essential information to estimate the acuity of a...

Med7: a transferable clinical natural language processing model for electronic health records

The field of clinical natural language processing has been advanced sign...

MASK: A flexible framework to facilitate de-identification of clinical texts

Medical health records and clinical summaries contain a vast amount of i...

An Empirical Study of UMLS Concept Extraction from Clinical Notes using Boolean Combination Ensembles

Our objective in this study is to investigate the behavior of Boolean op...

Automatic end-to-end De-identification: Is high accuracy the only metric?

De-identification of electronic health records (EHR) is a vital step tow...

Please sign up or login with your details

Forgot password? Click here to reset