SemClinBr – a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks

The high volume of research focusing on extracting patient's information from electronic health records (EHR) has led to an increase in the demand for annotated corpora, which are a very valuable resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multi-purpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. In this study, we developed a semantically annotated corpus using clinical texts from multiple medical specialties, document types, and institutions. We present the following: (1) a survey listing common aspects and lessons learned from previous research, (2) a fine-grained annotation schema which could be replicated and guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. The result of this work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations, and can support a variety of clinical NLP tasks and boost the EHR's secondary use for the Portuguese language.

READ FULL TEXT

page 1

page 6

page 12

page 15

research
11/07/2016

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

Objective: To build a comprehensive corpus covering syntactic and semant...
research
06/03/2022

ArgRewrite V.2: an Annotated Argumentative Revisions Corpus

Analyzing how humans revise their writings is an interesting research qu...
research
03/08/2022

A Unified Framework of Medical Information Annotation and Extraction for Chinese Clinical Text

Medical information extraction consists of a group of natural language p...
research
08/04/2021

An Empirical Study of UMLS Concept Extraction from Clinical Notes using Boolean Combination Ensembles

Our objective in this study is to investigate the behavior of Boolean op...
research
06/05/2020

Prague Dependency Treebank – Consolidated 1.0

We present a richly annotated and genre-diversified language resource, t...
research
04/06/2022

A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts

Recent progress in natural language processing has been impressive in ma...
research
04/06/2022

Hierarchical Annotation for Building A Suite of Clinical Natural Language Processing Tasks: Progress Note Understanding

Applying methods in natural language processing on electronic health rec...

Please sign up or login with your details

Forgot password? Click here to reset