Database Credentialed Access

MIMIC-IV-ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering

Rahul Thapa Andrew Li Qingyang Wu Bryan He Yuki Sahashi Christina Binder-Rodriguez Angela Zhang David Ouyang James Zou

Published: Oct. 7, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Thapa, R., Li, A., Wu, Q., He, B., Sahashi, Y., Binder-Rodriguez, C., Zhang, A., Ouyang, D., & Zou, J. (2025). MIMIC-IV-ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/rndk-4s36

Additionally, please cite the original publication:

Thapa, Rahul, Andrew Li, Qingyang Wu, Bryan He, Yuki Sahashi, Christina Binder, Angela Zhang et al. "How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?." arXiv preprint arXiv:2504.14391 (2025).

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

We present MIMICEchoQA, a benchmark dataset for echocardiogram-based question answering, built from the publicly available MIMIC-IV-ECHO database. Each echocardiographic study was paired with the closest discharge summary within a 7-day window, and the transthoracic echocardiography (TTE) and ECHO sections were extracted to serve as proxies for cardiologist-authored diagnostic reports. DICOM videos were converted to .mp4 format, and a language model was used to generate multi-turn, closed-ended Q/A pairs grounded in these reports. To ensure anatomical consistency, a view classification model was used to label each video by its echocardiographic view (e.g., A4C, A3C), enabling the filtering of questions referencing structures not visible in the corresponding video. All generated Q/A pairs were manually reviewed by two board-certified cardiologists to ensure clinical validity. The final dataset consists of 622 high-quality, view-consistent Q/A pairs aligned with real-world echocardiogram videos, offering a valuable resource for developing and evaluating models in echocardiographic visual question answering.


Background

Echocardiography is a widely used, non-invasive imaging modality for assessing cardiac structure, function, and hemodynamics [1]. It plays a central role in diagnosing cardiovascular diseases due to its accessibility and ability to provide real-time insights into heart function [2]. As echocardiographic data continues to accumulate in large hospital systems, there is growing interest in developing machine learning (ML) models to assist clinicians with interpretation. Most existing work has focused on traditional supervised tasks such as view classification, structure segmentation, or function quantification [3,4,5]. While valuable, these tasks often fall short of capturing the diverse clinically relevant queries that physicians routinely ask during patient care.

To bridge this gap, we introduce MIMICEchoQA, a clinician-validated question-answering benchmark constructed from the MIMIC-IV-ECHO dataset [6]. MIMICEchoQA focuses on core cardiology topics such as ejection fraction, valvular abnormalities, chamber size, and pericardial effusion—questions that reflect the actual diagnostic priorities of practicing cardiologists. Each question-answer pair is grounded in both echocardiogram videos and their corresponding diagnostic reports, and the dataset has been reviewed by clinical experts for medical validity and consistency.

Unlike prior benchmarks, MIMICEchoQA is specifically designed to evaluate the capabilities of medical vision-language models (VLMs)—models that integrate visual input (e.g., echo video) and textual data (e.g., reports, prompts) to perform clinically meaningful tasks. By providing a standardized, expert-reviewed set of echocardiogram-grounded Q/A examples, MIMICEchoQA aims to facilitate progress in multimodal AI systems for real-world cardiology applications.


Methods

MIMICEchoQA was constructed from the MIMIC-IV-ECHO dataset [6], which includes echocardiogram videos and metadata collected from patients at Beth Israel Deaconess Medical Center. To ensure that each echocardiographic study was accompanied by relevant clinical interpretation, we first linked each echocardiogram with its nearest discharge summary from MIMIC-IV-Note [7], within a ±7-day window. We retained only those discharge summaries that contained echo-specific sections or keywords such as such as “ECHO” or “TTE”, resulting in a pool of 1,200 unique patients with corresponding echocardiogram studies and relevant diagnostic reports.

From each selected report, we extracted the echo-specific section text and used a securely hosted Qwen-2-72B-Instruct large language model (LLM) to generate multiple candidate question-answer (Q/A) pairs. The questions were designed to be closed-ended, clinically grounded, and aligned with core echocardiographic concepts such as ejection fraction, valvular severity, chamber size, and pericardial effusion.

To ensure baseline quality, we applied a first-pass filtering step using the same LLM to eliminate noisy or underspecified questions. Specifically, we removed Q/A pairs that lacked clinical specificity, referenced ambiguous findings, or contained unclear answer formats. Questions with multiple-choice answers were standardized to follow clinically interpretable scales (e.g., Normal, Mild, Moderate, Severe).

From this filtered pool, we randomly sampled 1,000 unique video–question pairs and evenly divided them between two board-certified cardiologists. Each cardiologist independently reviewed 500 unique samples, evaluating each pair across three dimensions:

  1. Whether the question is clinically relevant and grounded in the diagnostic report,
  2. Whether the answer is correct based on the provided text,
  3. Whether the question is visually answerable from the assigned echocardiographic video (e.g., appropriate view for the anatomical structure mentioned).

After manual review, 622 video–question pairs met all criteria and were retained in the final benchmark. Each of these examples is linked to a unique echocardiogram video from a different patient, ensuring diversity and clinical coverage.

Since each reviewer was assigned a disjoint set of examples, there were no overlapping reviews, and thus no inter-reviewer inconsistencies to resolve.


Data Description

Each entry in MIMICEchoQA represents a single benchmark example and includes the following components:

  • Video: A transthoracic echocardiogram clip in .mp4 format, derived from the original DICOM files in MIMIC-IV-ECHO. Each video is associated with a specific echocardiographic view (e.g., Apical Four-Chamber [A4C], Apical Three-Chamber [A3C]).
  • Question and Answer: A closed-ended, clinically relevant question grounded in the accompanying report and video, along with four answer choices (A–D). The correct answer is explicitly marked.
  • Anatomical Structure: The cardiac structure referenced in the question (e.g., Mitral Valve, Left Ventricle, Aortic Valve), enabling structured reasoning and view-aware analysis.
  • Report Context: The relevant sentence from the clinical report that supports the answer, along with the full impression section for reference.
  • Metadata: Includes the study ID, video filename, view label, and test split assignment.

Each question-answer pair is tightly linked to both the visual content of the echo video and expert-authored clinical impressions, making this dataset suitable for evaluating models that integrate anatomical reasoning and clinical language understanding.

Dataset Statistics

  • Total entries: 622
  • Unique echo videos: 622
  • Unique patients: 622
  • Questions per video: 1 (1:1 mapping)
  • Total unique echocardiographic views: 48
  • Most common views:
    • Apical 4-Chamber (A4C): 72
    • Subcostal 4-Chamber: 52
    • Apical 3-Chamber (A3C): 51
    • PLAX Zoom Out: 49
    • Parasternal Long Axis (PLAX): 38
    • Doppler A5C: 33
    • PSAX at Level of Apex: 30
    • PSAX at Level of Papillary Muscles: 27
    • Apical 2-Chamber (A2C): 26
  • Total unique cardiac structures queried: 14
  • Most common structures:
    • Left Ventricle: 162
    • Aortic Valve: 132
    • Mitral Valve: 87
    • Pericardium: 82
    • Left Atrium: 38
    • Atrial Septum: 24
    • Right Ventricle: 23
    • Tricuspid Valve: 23
    • Aorta: 18
    • Pulmonary Artery: 12
    • Right Atrium: 11
    • Resting Segmental Wall Motion: 5
    • Pulmonic Valve: 4
    • Inferior Vena Cava (IVC): 1

This diversity in views and anatomical focus allows MIMICEchoQA to test a model’s ability to integrate view-specific visual reasoning with clinically grounded language understanding.

An example entry from the dataset is shown below:

{
  "messages_id": "5921e583-9df1-45d6-825f-4b3399b0f24a",
  "videos": ["mimic-iv-echo/0.1/files/p10/p10119872/s98097718/98097718_0047.mp4"],
  "question": "What is the severity of mitral stenosis?",
  "answer": "Normal",
  "correct_option": "A",
  "option_A": "Normal",
  "option_B": "Mild",
  "option_C": "Moderate",
  "option_D": "Severe",
  "study": "s98097718",
  "image": "98097718_0047",
  "structure": "Mitral Valve",
  "report": "impression: suboptimal image quality. small...",
  "exact_sentence": "Severe mitral annular calcification with trivial stenosis.",
  "view": "A3C",
  "split": "test"
}

Usage Notes

Dataset Utility

MIMICEchoQA is a benchmark dataset for evaluating echocardiogram-based visual question answering (VQA). It consists of 622 curated examples, each containing a transthoracic echocardiogram video and corresponding closed-ended Q/A pairs derived from clinical reports.

The dataset is intended solely for evaluation purposes and does not include a training set. Researchers developing echocardiographic VQA systems can use MIMICEchoQA to assess model performance in answering clinically grounded questions tied to specific cardiac ultrasound views. Each video is labeled with its echocardiographic view (e.g., A4C, A3C), enabling view-aware analysis and filtering.

Known Limitation

While MIMICEchoQA has been carefully curated, a few limitations should be noted. The initial Q/A generation process relied on a language model, which may introduce occasional hallucinations or clinically implausible content. Additionally, echocardiographic view labels were assigned using an automated classification model, which, while effective, is not perfectly accurate. As a result, some videos may have been assigned incorrect views, potentially affecting the filtering of anatomically inconsistent questions. However, these instances are expected to be minimal (approximately below 5%), as all questions and videos were manually reviewed by board-certified cardiologists to ensure clinical validity and alignment between question and video.

GitHub Repository for this Project

The dataset's creation code is accessible on GitHub [8].


Release Notes

This is version 1.0.0 of the MIMICEchoQA dataset. For any questions, feedback, or concerns regarding this benchmark, please feel free to contact us at rthapa84@stanford.edu. We appreciate your interest in the dataset and are happy to support your research efforts.


Ethics

This project builds upon the publicly available MIMIC-IV-ECHO and MIMIC-IV-Note datasets [6, 7], which are de-identified and distributed under strict data use agreements. All data used in this project are derived from patients who provided general research authorization, and the original datasets were reviewed and approved for credentialed distribution by the institutional review board (IRB) at Beth Israel Deaconess Medical Center and the Massachusetts Institute of Technology.

No additional patient data were collected or used beyond what is already available in the de-identified MIMIC-IV datasets. All derived content, including video-question pairs and associated metadata, adheres to the original de-identification protocols and does not contain any protected health information (PHI).

To generate the initial set of question-answer pairs, we used a securely hosted instance of Qwen-2-72B-Instruct, a large language model, deployed within a private, access-controlled environment. All automated processing was conducted within an institutionally managed compute environment to maintain data integrity and confidentiality.

This work complies with PhysioNet’s ethical data sharing standards, and all curation and development were carried out by credentialed researchers in accordance with the data usage license terms.


Acknowledgements

R.T. is supported by the Knight-Hennessy Scholars program at Stanford University. J.Z. is supported by funding from the Chan-Zuckerberg Biohub.


Conflicts of Interest

The author(s) have no conflicts of interest to declare.


References

  1. Otto CM. Textbook of Clinical Echocardiography. 5th ed. Philadelphia: Elsevier; 2013.
  2. Madani A, Arnaout R, Mofrad M, Arnaout R. Fast and accurate view classification of echocardiograms using deep learning. NPJ digital medicine. 2018 Mar 21;1(1):6.
  3. Mortada MJ, Tomassini S, Anbar H, Morettini M, Burattini L, Sbrollini A. Segmentation of anatomical structures of the left heart from echocardiographic images using deep learning. Diagnostics. 2023 May 9;13(10):1683.
  4. Zhang J, Gajjala S, Agrawal P, Tison GH, Hallock LA, Beussink-Nelson L, Lassen MH, Fan E, Aras MA, Jordan C, Fleischmann KE. Fully automated echocardiogram interpretation in clinical practice: feasibility and diagnostic accuracy. Circulation. 2018 Oct 16;138(16):1623-35.
  5. Ouyang D, He B, Ghorbani A, Yuan N, Ebinger J, Langlotz CP, Heidenreich PA, Harrington RA, Liang DH, Ashley EA, Zou JY. Video-based AI for beat-to-beat assessment of cardiac function. Nature. 2020 Apr;580(7802):252-6.
  6. Gow B, Pollard T, Greenbaum N, Moody B, Johnson A, Herbst E, Waks J W, Eslami P, Chaudhari A, Carbonati T, Berkowitz S, Mark R, Horng S. MIMIC-IV-ECHO: Echocardiogram Matched Subset (version 0.1). PhysioNet. 2023. RRID:SCR_007345. Available from: https://doi.org/10.13026/ef48-v217
  7. Johnson A, Pollard T, Horng S, Celi L A, Mark R. MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. 2023. RRID:SCR_007345. Available from: https://doi.org/10.13026/1n74-ne17
  8. Zou Group. OpenBiomedVid [Internet]. GitHub; 2025. Available from: https://github.com/zou-group/OpenBiomedVid [Accessed 3 Oct 2025].

Parent Projects
MIMIC-IV-ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files