Resources


Database Credentialed Access

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Hyungyung Lee, Geon Choi, Jung Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi

CheXStruct is an automated pipeline that derives structured diagnostic reasoning steps from chest X-rays. CXReasonBench builds on this to evaluate whether models perform clinically grounded, multi-step reasoning beyond final diagnoses.

evaluation chest x-ray benchmark structured chest x-ray qa intermediate reasoning steps structured reasoning grounded reasoning diagnostic reasoning structured diagnostic pipeline

Published: Oct. 23, 2025. Version: 1.0.1


Database Credentialed Access

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

Hyungyung Lee, Geon Choi, Jung Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi

CheXStruct is an automated pipeline that derives structured diagnostic reasoning steps from chest X-rays. CXReasonBench builds on this to evaluate whether models perform clinically grounded, multi-step reasoning beyond final diagnoses.

evaluation chest x-ray benchmark structured chest x-ray qa intermediate reasoning steps structured reasoning grounded reasoning diagnostic reasoning structured diagnostic pipeline

Published: Oct. 23, 2025. Version: 1.0.1


Database Credentialed Access

MIMIC-IV-Ext-MDS-ED: Multimodal Decision Support in the Emergency Department - a Benchmark Dataset for Diagnoses and Deterioration Prediction in Emergency Medicine

Juan Miguel Lopez Alcaraz, Nils Strodthoff

MIMIC-IV-ext-MDS-ED proposes a dataset to benchmark multimodal decision support in the emergency department. It features multimodal input (including ECG waveforms) and a comprehensive set of prediction targets (diagnoses and deterioration prediction)

emergency department ecg diagnoses prediction deterioration prediction benchmark multimodal

Published: Sept. 12, 2024. Version: 1.0.0


Database Contributor Review

ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room

Mel Molina, Nikita Mehandru, Niloufar Golchini, Ahmed Alaa

The ER-REASON dataset is a longitudinal collection of 25,174 de-identified clinical notes for 3,437 patients admitted to the emergency room (ER) at a large academic medical center between March 1, 2022, and March 31, 2024.

Published: Oct. 23, 2025. Version: 1.0.0


Database Credentialed Access

FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark

Mingjie Li, Wenjia Cai, Rui Liu, Yuetian Weng, Tengfei Liu, Cong Wang, xin chen, zhong liu, Caineng Pan, Mengke Li, yingfeng zheng, Yizhi Liu, Flora Salim, Karin Verspoor, Xiaodan Liang, Xiaojun Chang

Benchmark dataset for report generation based on fundus fluorescein angiography images and reports.

fundus fluorescein angiography medical report generation vision and language explainable and reliable evaluation

Published: Jan. 21, 2025. Version: 1.1.0


Database Credentialed Access

MIMIC-IV-Ext-MDS-ED: Multimodal Decision Support in the Emergency Department - a Benchmark Dataset for Diagnoses and Deterioration Prediction in Emergency Medicine

Juan Miguel Lopez Alcaraz, Nils Strodthoff

MIMIC-IV-ext-MDS-ED proposes a dataset to benchmark multimodal decision support in the emergency department. It features multimodal input (including ECG waveforms) and a comprehensive set of prediction targets (diagnoses and deterioration prediction)

emergency department ecg diagnoses prediction deterioration prediction benchmark multimodal

Published: Sept. 12, 2024. Version: 1.0.0


Database Credentialed Access

EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries

Sunjun Kweon, Jiyoun Kim, Heeyoung Kwak, Dongchul Cha, Hangyul Yoon, Kwang Hyun Kim, Jeewon Yang, Seunghyun Won, Edward Choi

An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries

Published: June 26, 2024. Version: 1.0.1


Database Credentialed Access

MIMIC-IV-ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering

Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sahashi, Christina Binder-Rodriguez, Angela Zhang, David Ouyang, James Zou

We present MIMICEchoQA, a benchmark dataset for echocardiogram-based question answering, built from the publicly available MIMIC-IV-ECHO database.

Published: Oct. 7, 2025. Version: 1.0.0


Database Credentialed Access

CXR-Align: A Benchmark for CXR-Report Alignment with Negations

Hanbin Ko

CXR-Align is a benchmark dataset created to evaluate vision-language models' capability to interpret negations in chest X-ray (CXR) reports, featuring systematically modified reports from MIMIC-CXR.

Published: Aug. 21, 2025. Version: 1.0.0


Database Credentialed Access

ODD: A Benchmark Dataset for the NLP-based Opioid Related Aberrant Behavior Detection

Sunjae Kwon, Xun Wang, Weisong Liu, Emily Druhl, Minhee Sung, Joel Reisman, Wenjun Li, Robert Kerns, William Becker, Hong Yu

Opioid-related aberrant behaviors (ORABs) detection Dataset (ODD) which is a large-size, expert-annotated, and multi-label classification benchmark dataset corresponding to the task

substance use natural language processing opioid related aberrant behavior

Published: Jan. 11, 2024. Version: 1.0.0