Resources


Database Restricted Access

EchoNext: A Dataset for Detecting Echocardiogram-Confirmed Structural Heart Disease from ECGs

Pierre Elias, Joshua Finer

EchoNext is a curated dataset of electrocardiograms (ECGs) paired with echocardiogram-confirmed structural heart disease labels, designed to support the development and validation of machine learning models.

clinical decision support heart failure artificial intelligence ecg health equity machine learning electrocardiogram deep learning ai model deployment population health transthoracic echocardiogram left ventricular dysfunction structural heart disease aortic stenosis cardiovascular screening digital health ai in healthcare valvular heart disease

Published: Aug. 5, 2025. Version: 1.0.0


Database Credentialed Access

MIMIC-IV-Ext Triage Instruction Corpus

Qingyang Shen, Quan Guo

MIMIC-IV-Ext Triage Instruction Corpus includes 9,629 ED triage cases organized by the five-level ESI, enabling LLMs to improve triage accuracy. It provides CSV data, generation prompts, expert validation samples, and SQL QC scripts.

clinical decision support nlp large language models machine learning emergency severity index emergency triage

Published: March 4, 2025. Version: 1.0.0


Database Open Access

Synthetic Mention Corpora for Disease Entity Recognition and Normalization

Kuleen Sasse, John David Osborne

We present the Synthetic Mention Corpora for Disease Entity Recognition and Normalization, containing 128000 disease mentions from the UMLS disorder group, generated by an LLM. This corpus aims to improve these tasks in biomedical and clinical texts.

nlp named entity recognition machine learning data augmentation entity normalization

Published: Feb. 3, 2025. Version: 1.0.0


Database Open Access

CGMacros: a scientific dataset for personalized nutrition and diet monitoring

Ricardo Gutierrez-Osuna, David Kerr, Bobak Mortazavi, Anurag Das

CGMacros contains information from two continuous glucose monitors (CGM), food macronutrients, food photographs, physical activity, and anonymized participant demographics, anthropometric measurements and health parameters.

diabetes continuous glucose monitors machine learning obesity postprandial glucose response food macronutrients metabolic models food photographs personalized nutrition

Published: Jan. 28, 2025. Version: 1.0.0


Database Contributor Review

COVID Data for Shared Learning (CDSL): A comprehensive, multimodal COVID-19 dataset from HM Hospitales

Álvaro Ritoré, Andreea M Oprescu, Alberto Estirado Bronchalo, Miguel Ángel Armengol de la Hoz

COVID Data for Shared Learning (CDSL) is a multimodal database comprising de-identified structured health data and radiological images from 4,479 patients with COVID-19, as a comprehensive toolkit for developing predictive models.

covid-19 multimodal database radiological images open data healthcare data machine learning and ai

Published: Oct. 25, 2024. Version: 1.0.0


Database Credentialed Access

MIMIC-IV

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, Roger Mark

Large database of de-identified health information from patients admitted to Beth Israel Deaconess Medical Center

critical care intensive care unit machine learning mimic

Published: Oct. 11, 2024. Version: 3.1


Database Open Access

VTaC: A Benchmark Dataset of Ventricular Tachycardia Alarms from ICU Monitors

Li-wei Lehman, Benjamin Moody, Lucas McCullum, Hasan Saeed, Harsh Deep, Diane Perry, Tristan Struja, Qiao Li, Gari Clifford, Roger Mark

VTaC is an annotated ventricular tachycardia (VT) arrhythmia alarm database containing over 5,000 waveform recordings with VT alarms from ICU monitors, with each alarm labeled as either true or false by at least two human expert annotators.

arrhythmia icu false alarms benchmark dataset ventricular tachycardia machine learning

Published: Oct. 1, 2024. Version: 1.0

Visualize waveforms

Database Credentialed Access

Comprehensive Polysomnography (CPS) Dataset: A Resource for Sleep-Related Arousal Research

Stefan Kraft, Andreas Theissler, Vera Wienhausen-Wilke, Philipp Walter, Gjergji Kasneci

This dataset includes polysomnographic sleep recordings from a study on sleep-related arousal diagnostics, featuring raw and derived data channels, annotated event types, and questionnaire data.

polysomnography sleep disorders machine learning in healthcare sleep arousal diagnostics pulse wave analysis

Published: Sept. 18, 2024. Version: 1.0.0


Database Credentialed Access

MIMIC-IV-ECG-Ext-ICD: Diagnostic labels for MIMIC-IV-ECG

Nils Strodthoff, Juan Miguel Lopez Alcaraz, Wilhelm Haverkamp

Dataset that links ECG records from MIMIC-IV-ECG to ED discharge and hospital discharge diagnoses, which enables to train general ECG prediction models based on clinical labels and facilitates the retrieval of further clinical metadata from MIMIC-IV.

electrocardiography machine learning mimic

Published: Aug. 30, 2024. Version: 1.0.1


Database Credentialed Access

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, Edward Choi

We present EHRXQA, the first multi-modal EHR QA dataset combining structured patient records with aligned chest X-ray images. EHRXQA contains a comprehensive set of QA pairs covering image-related, table-related, and image+table-related questions.

question answering chest x-ray benchmark evaluation multi-modal question answering ehr question answering semantic parsing machine learning electronic health records deep learning visual question answering

Published: July 23, 2024. Version: 1.0.0