Resources


Database Contributor Review

CARMEN-I: A resource of anonymized electronic health records in Spanish and Catalan for training and testing NLP tools

Eulalia Farre Maduell, Salvador Lima-Lopez, Santiago Andres Frid, Artur Conesa, Elisa Asensio, Antonio Lopez-Rueda, Helena Arino, Elena Calvo, Maria Jesús Bertran, Maria Angeles Marcos, Montserrat Nofre Maiz, Laura Tañá Velasco, Antonia Marti, Ricardo Farreres, Xavier Pastor, Xavier Borrat Frigola, Martin Krallinger

CARMEN-I is a Spanish corpus of 2,000 clinical records from Hospital Clínic, Barcelona. It covers COVID-19 patients and comorbidities, serving as a resource for training clinical NLP models and researchers in NLP applied to clinical documents.

de-identification anonymization clinical ner

Published: Nov. 2, 2023. Version: 1.0


Database Credentialed Access

Tasks 1 and 3 from Progress Note Understanding Suite of Tasks: SOAP Note Tagging and Problem List Summarization

Yanjun Gao, John Caskey, Timothy Miller, Brihat Sharma, Matthew Churpek, Dmitriy Dligach, Majid Afshar

We introduce a hierarchical annotation suite of tasks addressing clinical text understanding, reasoning and abstraction over evidence, and diagnosis summarization. One task is section tagging major section and the other task is diagnosis generation.

Published: Sept. 30, 2022. Version: 1.0.0


Challenge Credentialed Access

Analysis of Clinical Text: Task 14 of SemEval 2015

Guergana Savova

This is the dataset for SemEval 2014 and 2015, Analysis of Clinical Text

semeval nlp

Published: Dec. 28, 2014. Version: 2.0


Database Credentialed Access

MIMIC-IV-Note: Deidentified free-text clinical notes

Alistair Johnson, Tom Pollard, Steven Horng, Leo Anthony Celi, Roger Mark

Deidentified free-text clinical notes for patients in the MIMIC-IV Clinical Database.

mimic deidentification critical care electronic health record clinical notes natural language processing

Published: Jan. 6, 2023. Version: 2.2


Model Credentialed Access

EntityBERT: BERT-based Models Pretrained on MIMIC-III with or without Entity-centric Masking Strategy for the Clinical Domain

Chen Lin, Steven Bethard, Guergana Savova, Timothy Miller, Dmitriy Dligach

Pretraining of models with a broad representation of biomedical terminology (PubMedBERT) on MIMIC-III corpus along with or without a novel entity-centric masking strategy.

Published: March 17, 2022. Version: 1.0.1


Challenge Credentialed Access

Analysis of Clinical Text: Task 14 of SemEval 2015

Guergana Savova

This is the dataset for SemEval 2014 and 2015, Analysis of Clinical Text

semeval nlp

Published: Dec. 28, 2014. Version: 2.0


Challenge Credentialed Access

BioNLP Workshop 2023 Shared Task 1A: Problem List Summarization

Yanjun Gao, Dmitriy Dligach, Timothy Miller, Majid Afshar

This is the data storage for BioNLP Workshop Shared Task 1A: Problem List Summarization.

bionlp clinical natural language processing electronic health record summarization

Published: Nov. 12, 2023. Version: 2.0.0


Model Credentialed Access

Characterization of Stigmatizing Language in Medical Records

Keith Harrigian, Ayah Zirikly, Brant Chee, Alya Ahmad, Anne Links, Somnath Saha, Mary Catherine Beach, Mark Dredze

A suite of classifiers for detecting three types of stigmatizing language in electronic medical records. Trained on MIMIC-IV discharge notes.

mimic clinical natural language processing large language models domain transfer bias stigmatizing language

Published: Nov. 6, 2023. Version: 1.0.0


Database Credentialed Access

RuMedNLI: A Russian Natural Language Inference Dataset For The Clinical Domain

Pavel Blinov, Aleksandr Nesterov, Galina Zubkova, Arina Reshetnikova, Vladimir Kokh, Chaitanya Shivade

RuMedNLI is the full counterpart dataset of MedNLI in Russian language.

natural language inference recognizing textual entailment russian language

Published: April 1, 2022. Version: 1.0.0


Database Credentialed Access

Chest ImaGenome Dataset

Joy Wu, Nkechinyere Agu, Ismini Lourentzou, Arjun Sharma, Joseph Paguio, Jasper Seth Yao, Edward Christopher Dee, William Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo Anthony Celi, Tanveer Syeda-Mahmood, Mehdi Moradi

The Chest ImaGenome dataset is a scene graph dataset with additional chronological comparison relations for chest X-rays. It is automatically derived from the MIMIC-CXR dataset. A manually annotated gold standard is also available for 500 patients.

multimodal chest x-ray radiology machine learning scene graph visual dialogue object detection semantic reasoning bounding box relation extraction knowledge graph explainability reasoning chest cxr visual question answering deep learning disease progression

Published: July 13, 2021. Version: 1.0.0