Database Credentialed Access
MIMIC-Ext-CXR-QBA: A Structured, Tagged, and Localized Visual Question Answering Dataset with Question-Box-Answer Triplets and Scene Graphs for Chest X-ray Images
Philip Müller , Friederike Jungmann , Georgios Kaissis , Daniel Rueckert
Published: July 22, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Müller, P., Jungmann, F., Kaissis, G., & Rueckert, D. (2025). MIMIC-Ext-CXR-QBA: A Structured, Tagged, and Localized Visual Question Answering Dataset with Question-Box-Answer Triplets and Scene Graphs for Chest X-ray Images (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/8qmz-da41
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
Visual Question Answering (VQA) enables flexible and context-dependent analysis of medical images, such as chest X-rays (CXRs), by allowing users to pose specific questions and receive nuanced answers. However, existing CXR VQA datasets are typically limited to short and simplistic answer, lack localization information (such as bounding boxes), and provide little structured metadata (e.g., hierarchical answer formats or tags like region and finding annotations). To address these limitations, we introduce MIMIC-Ext-CXR-QBA, a new large-scale CXR VQA dataset derived from MIMIC-CXR, comprising 42 million QA pairs, which provides multi-granular, hierarchical answers composed of full sentences in the style of radiology reports, as well as detailed bounding boxes, and structured tags. Additionally, we provide scene graphs for each study, containing both regions and observation nodes with bounding boxes, tags, and textual descriptions derived from the original radiology reports. We created the scene graphs using LLM-based information extraction, semantic mention mapping, and localization models before generating question-answer pairs based on the extracted information stored in these graphs. Using automatic quality assessments, we have selected 31,230,906 QA pairs intended for pre-training and 7,532,281 of these intended for fine-tuning VQA models, therefore providing, to the best of our knowledge, the most sophisticated and largest VQA dataset for CXRs yet.
Background
With the emergence of Large Language Models (LLMs) and Large Multimodal Models (LMMs), interactive and conversational tasks have become a common way to interpret medical images, including chest X-rays (CXR) [1-5]. One widely studied interactive task is Visual Question Answering (VQA), where a model is given an image and a corresponding textual question and is expected to generate an answer. Unlike conventional medical imaging approaches that produce fixed outputs—such as classification labels, bounding boxes, segmentation masks, or textual reports—VQA enables user-driven, context-dependent interpretations, allowing for more flexible insights.
VQA allows the formulation of a variety of problem types, ranging from simple yes/no questions to more complex free-text answers. However, training robust VQA models for medical applications necessitates high-quality, large-scale training datasets. Existing CXR VQA datasets [1], [6-9] suffer from several limitations: they often contain only short and simplistic answers, lack localization information (such as bounding boxes), and provide little structured metadata (e.g., hierarchical answer formats, region and finding annotations, or uncertainty estimates). Additionally, their relatively small size constrains their utility for pretraining and necessitates fine-tuning on limited data.
To address these challenges, we introduce a new large-scale CXR VQA dataset derived from MIMIC-CXR [10-12], consisting of 42,172,827 QA pairs. Using automatic quality assessment, we have selected 31,230,906 pairs intended for pre-training and 7,532,281 of these intended for fine-tuning.
Unlike prior datasets, each QA pair includes multi-granular, hierarchical answers composed of full, structured sentences in the style of radiology reports. Furthermore, our dataset provides detailed bounding boxes and additional structured tags (e.g., findings, anatomical regions, probability estimates), enhancing interpretability and facilitating the development of more advanced and transparent VQA models for medical imaging.
Methods
To construct our visual question-answering dataset from MIMIC-CXR [10-12], we employ three key steps: scene graph generation, question-answer generation, and quality assessment.
We first construct scene graphs using both LLM-based information extraction (from the reports) and semantic concept mapping. Extracted observations are then associated with bounding boxes provided by anatomical region localization models. These scene graphs provide a structured description of the study, including sentences (derived from the report) for individual observations. They serve as a data source for our question-answer generation, where we utilize both template-based answers and answers derived from the rewritten report sentences. Finally, we automatically assess the quality of question-answer pairs using LLM-based evaluations.
1. Scene Graph Construction
We construct the scene graphs in three major steps, namely a) region localization, b) information extraction, and c) construction with entity mapping.
Region Localization
The bounding boxes in our scene graphs (and the derived QA-pairs) are based on fine-grained anatomical structures, allowing us to localize associated findings very precisely.
We use the CXAS [13] model to predict segmentation masks of 158 anatomical structures in the chest. We apply CXAS on the 377,110 CXRs from MIMIC-CXR-JPG [12], [14,15] and postprocess the resulting masks using morphological transformations to remove noise. Additionally, we use the bounding boxes provided by the Chest ImaGenome [12], [16] dataset, which are provided for 29 anatomical structures in most frontal images of MIMIC-CXR. Next, we derive a total of 257 localized anatomical structures based on combinations (e.g. intersections, unions, super bounding boxes, etc.) of the available masks and bounding boxes. Finally, we discard any masks or boxes that are too small and derive bounding boxes from the segmentation masks. Note that we define 53 further regions/structures that are either non-localized (e.g. interstitial) or for which we do not have bounding boxes, leading to a total of 310 structures/regions.
Information Extraction
We use the 227,827 free-text radiology reports provided by MIMIC-CXR as the main source of information for our scene graphs. Using the Llama 3.1 70B [17] model with few-shot prompting, we extract the relevant information in three steps.
First, we extract individual sentences from the reports, detect their sections (e.g. FINDINGS, IMPRESSION, INDICATION, …), discard sentences without relevant information, and merge sentences containing similar information (e.g. if findings are described in both the FINDINGS and IMPRESSION section). Therefore, each full report is passed in a single step to the LLM, which predicts the individually separated sentences as well as their sections and related sentences.
Next, we extract information about the INDICATION section and detect which FINDINGS or IMPRESSION sentences may provide information related to the indication. Therefore, the extracted INDICATION sentences and a list of all FINDINGS and IMPRESSION sentences are passed to the LLM, which predicts the following (in a json-structure):
- The INDICATION sentences rewritten (cleaned from different formattings in the report) as an indication summary.
- Patient info extracted from the INDICATION sentences, typically containing the patient’s sex.
- The clinical indication if there is any given in the INDICATION sentences
- The expected evaluation (i.e. what should be assessed using the CXR) as named in the INDICATION sentences.
- A list of FINDING/IMPRESSION sentences (their IDs) that my be used to answer the indication.
- A short answer (”answer_for_indication”) that would be given to the indication / evaluation question, considering what is written in the FINDING/IMPRESSION sentences.
Finally, we extract individual observations described in the FINDING/IMPRESSION sentences. Therefore, we pass each FINDING/IMPRESSION sentence individually to the LLM and let the LLM predict json-objects for individual observation mentioned in the sentence (which may per no, one, or more observations per sentence). Each observation includes the following:
- name: short name of the observation derived from the sentence)
- summary_sentence: textual description of the observation (derived from the sentence)
- entity: the associated finding entities - regions: the associated anatomical regions
- probability: is this as positive or negative finding and how likely
- temporal, spread, …: modifiers of the finding entities
- change, change_sentence: type and description of (longitudinal) change mentioned in the report
- children: sub-observations, providing more details (same structure as top-level json)
The LLM is allowed to freely assign values to each of those fields but we provide few-shot examples and guidelines in the prompt, including examples of entities and rules for modifier assignment. For name and summary_sentence, we prompt the model to stay close to the original sentence, but it must remove any mentions of change and only keep the part relevant to the individual observation (if several observations are mentioned in one sentence).
Graph Construction and Entity Mapping
Given the extracted information from the reports and the computed bounding boxes, we now construct the final scene graph. Therefore, we first map individual fields (entity, regions, probability, modifiers, change) to pre-defined sets of values, our reference definitions . This assures high quality and consistency of the scene graphs and enables mapping of observations to bounding boxes. The reference definitions are based on tags used in other datasets (including PadChest [18] and Chest ImaGenome [12], [16]) as well as SNOMED-CT [19] and have been verified by clinical experts. They include synonym lists, hierarchies (categories, …), and relationships. For more robust mapping, we utilize the BioLORD [20] model as a sentence transformer and identify the closest matching concept based on their semantic embeddings.
Next, we simplify region information (merging regions or picking more precise regions) and derive default regions from the finding entities if no regions are given. We merge similar oberservations and check the consistency between observations. We then add pre-defined negative default observations (if no contradicting observations are present) and assemble a graph of observation nodes.
Based on the mentioned regions, we associated bounding boxes with the observations if available. Additionally, we build a tree of all mentioned regions and fill missing intermediate regions based on the reference data. This allows us to build a graph of region nodes relevant to the study.
We construct region-region edges based on the reference data, observation-region edges (located-at) based on the mentioned observation regions, observation-observation edges based on the parent-child structure of observations (and the child-type predicted by the LLM), and observation-sentence relations based the sentences each observation was derived from.
Finally, we attach the indication information extracted from the report. Therefore, we build an additional observation node based on the extracted “answer_for_indication” and the associated finding sentences (and their observations).
2. Question-Answer Generation
We generate question-answer pairs following a template-based approach based on the information available in the scene graphs. However, we try to utilize the observation sentences (derived from the report sentences) wherever possible to provide diverse and fine-grained answers directly derived from the written report sentences.
We structure each answer hierarchically, following the structure of observations, i.e. with multiple individual “top-level” answers and optionally sub-answers. Each of the individual answers contains text (the answer itself), bounding boxes (wherever available), and additional information derived from the observations in the scene graph (regions, findings, modifiers, probability, …). Additionally, we categorize the answer parts into:
- main-answers: required to answer the question, there is always at least one main-answer per question.
- details: providing additional details for the main answer.
- related-information: not directly answering the question, but may be related and provides context.
Main answers are either created template-based or created based on observations in the scene graph. All other types are always derived from observations in the graph.
We utilize different generation strategies to i) identify the observations relevant for the question, ii) fill question and main-answer templates based on the information in the scene graph, and iii) convert the identified observations into answers. We use the following four generation strategies (each with one or more different templates):
“Indication” Strategy
In this strategy, we use the extracted indication (if available) as the question. The answer starts with a main-answer based on the indication observation (i.e. the answer to the indication based on the finding sentences), while detail sub-answers are constructed based on all associated finding observations. We include this question, if an indication observation is present in the scene graph.
“Abnormal“ Strategy
In this strategy, we generate questions about abnormalities. This includes descriptions of the full study or specific categories of observations (e.g. devices), description of only abnormal findings, and yes/no questions of whether there are positive findings (overall or of specific categories) present in the study.
Answers to description questions include all related observations as main answers. For yes/no questions, we first create a template-based main answer and then add the related observations as detail answers. We include each of these questions for most of the scene graphs, but ignore samples where we can’t guarantee the correctness based on the scene graphs.
“Region“ Strategy
In this strategy, we generate question about anatomical regions. This includes describing regions, answering yes/no questions about the abnormality of regions, or describing specific aspects of regions (e.g. devices).
Answers to description questions include all region-related observations as main answers. For yes/no questions, we create a template-based main answer and then the related observations as detail answers. Additionally, we provide “related-information” answers if there are aspects in other regions that might be related (e.g. parent/child regions, other lateralities, …). We always include these questions for a set of default regions (the lungs, the heart, …) and include questions about regions mentioned in observations, as well as their parent regions. Additionally, we randomly sample regions to ask about. Their sampling probabilities are computed based on how often they are associated with positive vs. negative findings, i.e. the more often a region is associated with positive findings and the less often it is associated with negative findings, the more often we sample it as a question. This assures that we generate additional “negative” questions for regions that are only/mostly mentioned with positive findings.
“Finding “ Strategy
In this strategy, we generate question about specific findings (radiological findings, diseaes, devices, …). This includes descriptions of findings, yes/no questions about the presence of findings, location of findings, and severity of findings.
Answers to description questions include all finding-related observations as main answers. For yes/no, location, and severity questions, we create a template-based main answer and add the related observations as detail answers. Additionally, we provide “related-information” answers if there are aspects that might be related (e.g. other findings in the same region, related findings). We always include these questions for a set of default findings and include questions about findings mentioned in observations (positive or negative), as well as their parent findings. Additionally, we randomly sample findings to ask about. Their sampling probabilities are computed based on how often they are mentioned positively vs. negatively (over all scene graphs), i.e. the more often a finding is mentioned positively and the less often it is mentioned negatively, the more often we sample it as a question. This assures that we generate additional “negative” questions for findings that are only/mostly mentioned positively.
3. Evaluation
We provide two types of quality evaluations for our dataset:
- (i) Automatic quality assessment and grading (described in Section 3a)
- (ii) Quantitative validation against expert annotations (described in Section 3b)
The automatic quality assessment and grading (i) is provided for every single QA-pair (sample) in the dataset and allows filtering the dataset by different quality criteria or grades, e.g. selecting samples for pre-training or fine-tuning (see below). This assessment is conducted completely automatically by tracking the extraction process (using rules) or using an LLM as a judge. The assessment criteria have been carefully designed and the process has been overseen by two trained radiologists.
Additionally, we provide a quantitative validation of our dataset against expert annotations (ii). More precisely, we conducted an analysis on a subset of our dataset, comparing finding entity tags and bounding boxes from our scene graphs to several hand-labeled expert annotations (publicly available for subsets of MIMIC-CXR). This evaluation assesses the correctness of the scene graph annotations, including the identification of radiological findings / diseases / devices and the localization of corresponding regions. This in turn validates the quality of the QA-pairs derived from these annotations, as answer texts are either template-based (using the tags from the scene graphs) or directly derived from report sentences, while all answer tags and bounding boxes are copied from the scene graph data. For more details on this evaluation, we refer to Section 3b.
3a. Automatic Quality Assessment and Grading
Automatic Scene Graph Quality Assessment
We asses the scene graph extraction quality using simple rules and tracking of issues during the extraction, mapping, and graph construction process:
Criterion | Options (resulting max-rating) |
---|---|
region_quality |
|
entity_quality |
|
sentence_name_quality |
|
change_quality |
|
issue_level |
|
localization_quality |
|
Automatic Question-Answer Quality Assessment
We evaluate the question-answer quality using Llama 3.1 8B with few-shot prompting and the following evaluation criteria (each criteria is evaluated independently):
Criterion | Evaluation Level | Context | Options (rating) |
---|---|---|---|
Entailment | Sub-answer |
|
|
Relevance | Sub-answer |
|
|
Completeness | Full answer |
|
|
Question clarity | Question |
|
|
Answer clarity | Sub-answer |
|
|
Final Quality Grading
Based on the QA quality and extraction quality, we compute an overall rating for each QA-pair, considering the minimum rating of the full scene graph and all answers of the current question in this sample.
Based on these rating we prepare two main datasets recommended for training:
- Pre-training grade: everything with grade B or better
- Fine-tuning grade: everything with grade A or better
In these datasets, we also exclude all non-frontal images (because the bounding box quality in generally low in these cases) and remove all studies without any frontal images. Note that we also exclude samples, where the evaluation failed (mainly due to issues in the LLM-based evaluation). These samples (almost 20% of all samples) are not necessarily of bad quality, but we cannot guarantee the quality and therefore do not recommend using them for training. Using larger evaluation models may reduce the number of non-validated samples, but we decided to not further optimize this, as there is already a large and diverse set of rated samples. While we recommend using one of the two datasets, we also release the full dataset including non-validated and lower-grade samples as well as non-frontal images.
3b. Quantitative Validation against Expert Annotations
While the automatic quality assessment in Section 3a provides grades for each QA-pair, which can be useful for filtering, it does not evaluate the correctness of finding/region tags. To address this, we conducted an analysis on a subset of our dataset, comparing finding entity tags and bounding boxes to hand-labeled expert annotations (available for subsets of MIMIC-CXR). We use Chest ImaGenome's scene graphs as a baseline for comparison.
First, we evaluate the plausibility of finding tags by comparing study-level labels derived from our scene graphs to two reference annotation sets: the radiologist annotations in MIMIC-CXR-JPG [12], [14, 15] v.2.1.0 with 13 CheXpert (CXP) [21] classes and the CXR-LT 2024 [12], [22, 23] gold-standard dataset (task 2 test set) with 12 additional rare, long-tail (LT) classes. Our approach (slightly) outperforms Chest ImaGenome, with strong improvements (20%) on long-tail classes, demonstrating the value of our fine-grained finding tags (237 classes) in capturing nuanced study details:
Classes | CXP-5 | CXP-7 | CXP-13 | Micro |
---|---|---|---|---|
Ours (scene graphs) | 0.80 | 0.81 | 0.69 | 0.71 |
Chest ImaGenome | 0.78 | 0.80 | 0.66 | 0.67 |
Classes | CXP-7 | CXP-13 | LT-only | CXR-LT | Micro |
---|---|---|---|---|---|
Ours (scene graphs) | 0.65 | 0.57 | 0.71 | 0.64 | 0.67 |
Chest ImaGenome | 0.65 | 0.56 | 0.59 | 0.58 | 0.64 |
To evaluate the accuracy of finding bounding boxes, we compare them with annotations from MS-CXR [12], [24, 25] (on 6 of the 8 classes with positive samples on all datasets) and REFLACX [12], [26, 27] (on 18 of the 29 classes with positive samples on all datasets). We compute study-level pixel masks for each finding as the union of all bounding boxes from positive observation nodes that contain the specific finding tag.
We calculate pixel-level Intersection-over-Union (IoU), Intersection-over-Prediction (IoP), and Intersection-over-Target (IoT) for each finding class, considering only image pairs with positive predictions and targets. Thresholding at 30% IoU/IoP/IoT, we micro-average the results, reporting the percentage of accurately localized finding-boxes.
On the IoU metric, our scene graphs perform slightly better than the ones from Chest ImaGenome.
The low IoP values indicate that bounding boxes are often too large, but high IoT values suggest that they generally cover the finding boxes well. This discrepancy arises because bounding boxes are derived from anatomical regions mentioned in reports, whereas hand-labeled annotations are more precise. Notably, our approach produces more precise boxes (higher IoP) than Chest ImaGenome, likely due to our large number of fine-grained region annotations (311 region classes).
Metric | IoU@30 | IoP@30 | IoT@30 |
---|---|---|---|
Ours (scene graphs) | 0.51 | 0.56 | 0.94 |
Chest ImaGenome | 0.45 | 0.48 | 0.98 |
Metric | IoU@30 | IoP@30 | IoT@30 |
---|---|---|---|
Ours (scene graphs) | 0.45 | 0.54 | 0.87 |
Chest ImaGenome | 0.42 | 0.46 | 0.95 |
Data Description
Dataset Statistics
Number of samples and their quality:
Train | Val | Test | Total | |
---|---|---|---|---|
# Patients | 64,524 | 500 | 293 | 65,317 |
# Studies | 222,180 | 1,805 | 3,254 | 227,239 |
# QA pairs | 41,239,042 | 600,763 | 333,022 | 42,172,827 |
→ Fine-tuning grade | 7,378,344 | 58,486 | 95,451 | 7,532,281 |
→ Pre-training grade | 30,542,190 | 246,233 | 442,483 | 31,230,906 |
→ Rating A++ | 1,338,959 | 10,267 | 18,775 | 1,368,001 |
→ Rating A+ | 1,092,771 | 8,610 | 14,408 | 1,115,789 |
→ Rating A | 5,237,408 | 41,758 | 68,414 | 5,347,580 |
→ Rating B | 24,241,667 | 197,103 | 373,911 | 24,812,681 |
→ Rating C | 683,310 | 5,848 | 11,589 | 700,747 |
→ Rating D | 534,169 | 4,326 | 9,268 | 547,763 |
→ Unrated | 8,110,758 | 65,110 | 104,398 | 8,280,266 |
# Sub-answers | 88,876,344 | 717,117 | 1,321,739 | 90,915,200 |
Number of questions for each question type and strategy:
Question Strategy (identifier) | Question Type (identifier) | Train | Val | Test | Total |
---|---|---|---|---|---|
Indication (indication) | Indication (A_indication) | 213,506 | 1,741 | 3,047 | 218,294 |
Abnormal (abnormal) | 11,652,463 | 94,095 | 168,826 | 11,915,384 | |
describe_all (B01_describe_all) | 203,868 | 1,644 | 2,921 | 208,433 | |
describe_abnormal (B02_describe_abnormal) | 203,868 | 1,644 | 2,921 | 208,433 | |
is_abnormal (B03_is_abnormal) | 203,868 | 1,644 | 2,921 | 208,433 | |
is_normal (B04_is_normal) | 203,868 | 1,644 | 2,921 | 208,433 | |
describe_subcat (B08_describe_subcat) | 2,307,749 | 18,652 | 33,928 | 2,360,329 | |
describe_abnormal_subcat (B09_describe_abnormal_subcat) | 2,307,749 | 18,652 | 33,928 | 2,360,329 | |
is_abnormal_subcat (B10_is_abnormal_subcat) | 2,307,749 | 18,652 | 33,928 | 2,360,329 | |
is_normal_subcat (B11_is_normal_subcat) | 1,630,944 | 13,152 | 23,368 | 1,667,464 | |
describe_device (B12_describe_device) | 934,111 | 7,534 | 13,014 | 954,659 | |
has_devices (B13_has_devices) | 934,111 | 7,534 | 13,014 | 954,659 | |
describe_acquisition (B14_describe_acquisition) | 6,842 | 55 | 120 | 7,017 | |
describe_imaging_artifacts (B15_describe_imaging_artifacts) | 203,868 | 1,644 | 2,921 | 208,433 | |
has_imaging_artifacts (B16_has_imaging_artifacts) | 203,868 | 1,644 | 2,921 | 208,433 | |
Region (region_abnormal) | 20,169,684 | 162,217 | 288,807 | 20,620,708 | |
describe_region (C01_describe_region) | 2,768,270 | 22,031 | 36,846 | 2,827,147 | |
describe_abnormal_region (C02_describe_abnormal_region) | 2,768,278 | 21,889 | 37,010 | 2,827,177 | |
is_abnormal_region (C03_is_abnormal_region) | 2,773,041 | 22,049 | 36,976 | 2,832,066 | |
is_normal_region (C04_is_normal_region) | 2,772,856 | 21,924 | 37,058 | 2,831,838 | |
describe_region_device (C07_describe_region_device) | 4,543,059 | 37021 | 70,203 | 4,650,283 | |
has_region_device (C08_has_region_device) | 4,544,180 | 37,303 | 70,714 | 4,652,197 | |
Finding (finding) | 9,203,389 | 74,969 | 140,083 | 9,418,441 | |
describe_finding (D01_describe_finding) | 2,491,473 | 20,352 | 38,273 | 2,550,098 | |
has_finding (D02_has_finding) | 2,492,466 | 20,395 | 38,355 | 2,551,216 | |
where_is_finding (D03_where_is_finding) | 1,975,568 | 15,966 | 29,618 | 2,021,152 | |
how_severe_is_finding (D04_how_severe_is_finding) | 1,297,647 | 10,364 | 17,602 | 1,325,613 | |
describe_device (D05_describe_device) | 318,476 | 2,706 | 5,446 | 326,628 | |
has_device (D06_has_device) | 313,717 | 2,587 | 5,374 | 321,678 | |
where_is_device (D07_where_is_device) | 314,042 | 2,599 | 5,415 | 322,056 |
Files and Structure
Directory Structure
├── metadata (2.3 GB)
│ ├── patient_metadata.csv.gz
│ ├── study_metadata.csv.gz
│ ├── image_metadata.csv.gz
│ ├── question_metadata.csv.gz
│ ├── question_image_metadata.csv.gz
│ ├── answer_metadata.csv.gz
│ ├── answer_image_metadata.csv.gz
│ └── dataset_info.json
├── stats (4.5 GB)
│ └── ...
├── scene_data.zip (1.3 GB)
├── qa.zip (7.5 GB)
├── exports (12.4 GB)
│ └── ...
└── quality_mappings.csv (5 KB)
Metadata (”metadata” dir)
We provide metadata for all scene graphs and question-answer pairs in the metadata
directory. The metadata is provided on different levels (patient, study, image, question, question-image, answer, and answer-image) with according number of rows. Each of the metadata files is provided in two redundant versions:
.csv.gz
(compressed csv): for easy interpretation and.parquet
: for fast reading
These metadata files can be used to filter the dataset on different levels (patient, study, question, …) by different criteria. Therefore, each file comes with unique IDs and additional metadata that may be relevant for that level. An overview is provided below:
Metadata file | 1 row per | Index columns | Example metadata | Total # Rows |
---|---|---|---|---|
patient_metadata | patient | patient_id |
|
65,317 |
study_metadata | study | patient_id, study_id |
|
227,239 |
image_metadata | image | patient_id, study_id, image_id |
|
376,175 |
question_metadata | question | patient_id, study_id, question_id |
|
42,172,827 |
question_image_metadata | question-image pair | patient_id, study_id, question_id, image_id |
|
70,045,778 |
answer_metadata | answer | patient_id, study_id, question_id, answer_id |
|
90,915,200 |
answer_image_metadata | answer-image pair | patient_id, study_id, question_id, answer_id, image_id |
|
151,539,450 |
Additionally, the dataset_info.json
describes the sets of possible values for different tags of answers/observations, i.e. possible finding entity names, region names, finding categories and subcategories, answer types, modifiers, etc.
Statistics (”stats” dir)
We provide additional information and statistics about scene graphs and question-answer pairs in the stats
directory. This include aggregate statistics as well as observation-level (for scene graphs) or answer-level (for questions) information. It may for example be used for more advanced data filtering or to compute dataset characteristics without having to load individual scene-graphs or qa-samples (which would be much more expansive).
Aggregate statistics about scene graphs are named as study*.csv
and include (among others) the percentages of positive/negative observations for different regions, entities, and categories.
Observation-level information for scene-graphs are named as all_obs*.csv
and include (among others) information about positive/negative observations, entities, regions, categories.
(Sub-)answer-level information for qa-samples are named as all_ans*.csv
and include (among others) information about positive/negative answers, entities, regions, categories.
Scene Graph Format (”scene_data.zip”)
All scene graphs (and related metadata) can be found in the scene_data.zip
file, which contains a folder structure in the following format:
p1x/p1xxxxxxx/sxxxxxxxx.scene_graph.json
p1x/p1xxxxxxx/sxxxxxxxx.metadata.json
where p1x
refers to the first 2 digits of the subject_id
, p1xxxxxxx
to the full subject_id
, and sxxxxxxxx
to the full study_id
.
The sxxxxxxxx.metadata.json
file contains study metadata as also provided in the study_metadata.csv.gz
file.
The sxxxxxxxx.scene_graph.json
file contains the scene graph in the following format:
{
"patient_id": "p1xxxxxxx", // see metadata
"study_id": "sxxxxxxxx", // see metadata
// original report sentences (= sentence nodes of scene graph)
"sentences": {
"S01": {
"sent_id": "S01",
"section": "FINDINGS",
"section_type": "FINDINGS",
"sentence": "No new focal consolidation."
},
... // more sentences
},
// keys for "observations"
"top_level_obs_ids": ["O01", "O02", ...],
// observations in the study (= observation nodes of scene graph)
"observations": {
"O01": {
"obs_id": "O01",
"name": "no focal consolidation",
"summary_sentence": "There is no focal consolidation.",
"child_type": null,
"child_level": 0,
"regions": [{"region": "lungs", "distances": []}],
"non_resolved_regions": [],
"laterality": "bilateral",
"default_regions": ["lungs"],
"obs_entities": ["consolidation"],
"obs_entities_parents": [],
"non_resolved_obs_entities": [],
"obs_categories": ["ANATOMICAL_FINDING", "DISEASE"],
"obs_subcategories": ["LUNG_FIELD", "PULMONARY_DISEASES", "INFECTION"],
"probability": "negative",
"certainty": "certain",
"positiveness": "neg",
"modifiers": {"temporal": [],"severity": [], "texture": [], "spread": ["focal"]},
"changes": ["no new"],
"change_sentence": "No new focal consolidation is visible.",
"from_report": true, // derived from report sentences or template-based?
"obs_quality": {...},
"localization": {
// one item for each image
"[image_id]": {
"image_id": "[image_id]",
"localization_reference_ids": ["lungs"],
// list of bboxes in (x_1, y_1, x_2, y_2) format in pixel cooridnates
"bboxes": [[888.0, 370.0, 1610.0, 1642.0 ],
[136.0, 402.0, 898.0, 1678.0 ]],
"instance_mask_ids": ["lungs"],
"missing_localization": [],
"is_fallback": false,
"localization_quality": ...
},
},
},
... // more observations
},
// information related to indication section of report
"indication": {
"indication_summary": "Female with HIV, experiencing chest pain and dyspnea; should be evaluated for infiltrate and effusion.",
"patient_info": "Female, HIV-positive, with chest pain and dyspnea.",
"indication": "Chest pain and dyspnea.",
"evaluation": "Evaluate for infiltrate and effusion.",
"associated_sentence_ids": ["S05", ...],
"associated_obs_ids": ["O03", ...],
"answer_for_indication": {
// this has the same form as an observation node in "observations"
"obs_id": "OIND", // this ID is always the same
"name": "...",
...
}
},
// regions relevant for the study (= region nodes of scene graph)
"regions": {
"left lung": {
"region": "left lung",
"laterality": "left",
"localization": { ... } // same format as for observation nodes
"region_localization_quality": ...
},
... // more regions
},
// relations between observation and region nodes
"located_at_relations": [
{"region": "lungs", "observation_id": "O01", "distances": [], "where_specified": "direct"},
... // more relations
],
// relations between observation node pairs
"obs_relations": [
{"parent_observation_id": "O02", "child_observation_id":"O02.01", "child_type":"associated_with"},
... // more relations
],
// relations between observation and sentence nodes
"obs_sent_relations": [
{"observation_id": "O01", "sentence_id": "S01"},
... // more relations
]
// relations between region node pairs
"region_region_relations": [
{"region": "lungs", "related_region": "left lung", "relation_type": "sub_region"},
{"region": "left lung", "related_region": "right lung", "relation_type": "right"},
... // more relations
]
// quality levels for different aspects (larger = better)
"study_quality": {
...
},
// localization quality level per imge-id (larger = better)
"study_img_localization_quality": {
...
}
}
Question-Answer Format (”qa.zip”)
All question-answer data can be found in the qa.zip
file, which contains a folder structure in the following format:
p1x/p1xxxxxxx/sxxxxxxxx.qa.json
where p1x
refers to the first 2 digits of the subject_id
, p1xxxxxxx
to the full subject_id
, and sxxxxxxxx
to the full study_id
.
Each of the sxxxxxxxx.qa.json
files contains all question-answer pairs (and additional tags) for a single study in the following format:
{
"patient_id": "p1xxxxxxx", // see metadata
"study_id": "sxxxxxxxx", // see metadata
"questions": [
// -> one object per question-answer pair
{
"question_id": "xxxxxxxxxxxx", // see metadata
"question_type": "describe_all", // template used for generation
"question_strategy": "abnormal", // strategy used for generation
"variables": { ... }, // template variables used for generation
"obs_ids":["O01", ...], // observations (from scene graph) used in answer
"contains_report_answers": true/false, // any answers derived from report sentences?
"contains_template_answers": true/false, // any answers based on templates but not directly from sentences?
"extraction_quality": { ... },
"question_img_localization_quality": { ... },
"question": "Describe the given study.",
// list of sub-answers (top-level answers with their sub-answers)
"answers":[
{
"answer_id": "xxxxxxxxxxxx", // see metadata
"answer_type":"main_answer", // main_answer, details, or related_information
"answer_level": 0, // 0 for top-level, >0 for each child-level
"text": "There is no focal consolidation.", // this is the answer text
"name_tag": "No focal consolidation", // summary name of this sub-answer
"laterality": "bilateral",
"regions": ["lungs"],
"obs_entities": ["consolidation"],
"obs_entities_parents": [],
"obs_categories": ["ANATOMICAL_FINDING", "DISEASE"],
"obs_subcategories": ["LUNG_FIELD", "PULMONARY_DISEASES", "INFECTION"],
"certainty": "certain",
"positiveness": "neg",
// list of modifiers (tuples of modifier type and value)
"modifiers": [["spread", "focal"]],
"localization": {
// one item for each image
"[image_id]": {
"image_id": "[image_id]",
"localization_reference_ids": ["lungs"],
// list of bboxes in (x_1, y_1, x_2, y_2) format in pixel cooridnates
"bboxes": [[888.0, 370.0, 1610.0, 1642.0 ],
[136.0, 402.0, 898.0, 1678.0 ]],
"instance_mask_ids": ["lungs"],
"missing_localization": [],
"is_fallback": false,
"localization_quality": ...
},
},
// contain child-answers if there are any (same object format as top-level answer)
"sub_answers": [],
"from_report": true/false, // derived from report sentences?
"extraction_quality": {...},
"answer_quality": {...},
},
... // more top-level answers
],
"question_quality": {...}
},
... // more questions
]}
Exports (”exports” dir)
Here we provide subsets of the full dataset.
We have two full copies of the dataset:
A_frontal
(Fine-tuning grade): Only questions with a quality rating of A, A+, or A++, and only frontal images (7,532,281 QA-pairs, 3 GB). We recommend this dataset for fine-tuning / instruction-tuning purposes.B_frontal
(Pre-training grade): Only questions with a quality rating of B, A, A+, or A++, and only frontal images (31,230,906 QA-pairs, 9.4 GB). We recommend this dataset for pre-training purposes. (this is a superset ofA_frontal
)
Each of these datasets contains the metadata
dir, scene_graph.zip
and qa.zip
.
Additionally, we provide filtered metadata files for further subsets of these. They are provided as sub-folders in the metadata dirs. We provide the following:
A_frontal/metadata/Ap
: Quality rating of A+ or A++ (2,389,739 QA-pairs)A_frontal/metadata/App
: Quality rating of A++ (1,318,885 QA-pairs)A_frontal/metadata/q1M
: random 1M question subset with A, A+, or A++ ratings (1M QA-pairs)A_frontal/metadata/Ap_q1M
: random 1M question subset with A+ or A++ ratings (1M QA-pairs)A_frontal/metadata/App_q1M
: random 1M question subset with A++ ratings (1M QA-pairs)B_frontal/metadata/q1M
: random 1M question subset with B, A, A+, or A++ ratings (1M QA-pairs)
Quality mappings (”quality_mappings.csv”)
This file provides mappings from the raw encodings of quality values (fields in the JSON-files or columns in the metadata files) to their respective fields.
Usage Notes
Dataset Utility
- Fine-grained finding and pathology classification
- Pathology localization
- Fine-grained longitudinal analysis
- Structured and grounded radiology report generation
- Structured, grounded, and localized visual question answering
- Further derived datasets and tasks
Loading and Filtering the Data
All data can be loaded directly from the provided files using standard utilities (e.g. pandas, unzip, json-loaders).
A simple approach would be the following:
- Select the relevant samples by loading, merging, and filtering the metadata files (e.g. using pandas). Also consider the quality_mappings.csv file when filtering based on quality ratings.
- Load the relevant scene-graph / QA files per study (e.g. by first extracting the zip files using unzip and then loading the json files)
Known Limitations
Our template-based questions and answers may limit variability and introduce grammatical errors, though this is mitigated by deriving some answers directly from report sentences and by our quality assessment.
Additionally, our approach focuses on individual studies and we do not include longitudinal or differential questions.
Finally, our dataset is a silver dataset—constructed using models, rules, and templates—without human annotations. As a result, it may contain errors introduced by these models or our approach and should be interpreted with caution.
Comparison with Other Datasets
Comparison with Scene Graphs Datasets
Compared to other scene graph datasets derived from MIMIC-CXR, our scene graphs provide bounding boxes for both observation and regions nodes, rewritten sentences for each observation as well as fine-grained finding and region classes.
Ours | Chest ImaGenome | RadGraph | |
---|---|---|---|
Bounding boxes | regions and observations (derived from fine-grained anatomical structures) | only for regions | none |
# Finding classes | 221 | 53 | no mapping |
# Region classes | 310 (257 with bboxes) | 29 | no mapping |
Rewritten sentences per observation | yes | no | no (but text spans are provided) |
Hierarchical observations | yes | no | no (but relationships provided) |
Longitudinal | no | yes | no |
Extraction method | Segmentation model + LLM + semantic concept matching | Object detector + rule-based | Relation extraction model |
Comparison with VQA Datasets
Compared to existing VQA datasets, our dataset provides three key benefits: a) it provides structured full sentence answers instead of simple short answers, b) it provides bounding boxes for each (sub-)answer, and c) it is much larger than (most) other datasets. Note, however, that some existing VQA dataset use clinical annotators and may thus provide more reliable answers.
Ours | VQA-RAD | SLAKE | MIMIC-Ext-MIMIC-CXR-VQA | Medical-CXR-VQA dataset | CheXinstruct | |
---|---|---|---|---|---|---|
Bounding boxes | yes, per sub-answer | no | no | no | no | no |
Answer structure | multi-granular, hierarchical, full sentence answers with additional tags | short answers (no full sentences) | short answers (no full sentences) | short answers (no full sentences) | short answers (no full sentences) | short answers (no full sentences) |
# QA-Pairs | 32.6M | 3.5K (includes non-CXR) | 14K (includes non-CXR) | 377K | 780K | 8.5M |
Question construction | template-based | clinical annotators | clinical annotators | template-based (+ LLM-paraphrasing) | template-based | template-based |
Answer construction | mixture of:
|
clinical annotators | clinical annotators | template-based (+ LLM-paraphrasing) | template-based | template-based |
Answer data source | Our scene graphs (LLM + concept matching) | clinical annotators | clinical annotators | Chest ImaGenome | LLM-based extraction from reports | image Annotations (depending on source dataset) |
Comparison with Grounded Radiology Report Datasets
Some other datasets provide grounded CXR descriptions. However, these only provide study-level descriptions instead of question-specific answers and bboxes, such that they cannot be used for QA tasks. Also, they do not provide structured class/tag annotations.
Ours | MedTrinityMedTrinity-25M | MAIRA-2 Dataset | |
---|---|---|---|
Bounding boxes | per sub-answer (question-specific) | ROIs (1 or few per study) | per observation |
Individual questions | yes | no | no |
Class/tag mappings | findings, regions, … | no | no |
Annotation scheme | LLM-based, conditioned on original report + automatic quality control | MLLM-based, partially conditioned on original report | unknown |
Ethics
The dataset is a derivative dataset of MIMIC-CXR and thus no new patient data was collected. The ethics approval of the dataset follows from that of the parent MIMIC dataset.
We confirm that all data processing, generation, and training were conducted entirely within a local and secure environment, ensuring data safety and privacy. This includes all usage of LLMs, localization, and embedding models, as well as all (vision-language) model training and evaluation. No data was sent to external APIs or processed by any third-party services.
Acknowledgements
This work was partially funded by ERC Grant Deep4MI (Grant No. 884622).
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Chen Z, Varma M, Xu J, Paschali M, Veen DV, Johnston A, et al (2024). "A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation". arXiv preprint. arXiv:2401.12208v2.
- Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang PC, et al (2024). "Towards generalist biomedical AI". NEJM AI. 1(3). doi:10.1056/AIoa2300138.
- Müller P, Kaissis G, Rueckert D (2024). "ChEX: Interactive Localization and Region Description in Chest X-rays". In: Leonardis A, Ricci E, Roth S, Russakovsky O, Sattler T, Varol G, editors. Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland. doi:10.1007/978-3-031-72664-4_6.
- Lee S, Youn J, Kim H, Kim M, Yoon SH (2025). "CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images". Eur Radiol. doi:10.1007/s00330-024-11339-6.
- Shaaban MA, Khan A, Yaqub M (2024). "MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis". arXiv preprint. arXiv:2403.15585.
- Lau JJ, Gayen S, Ben Abacha A, Demner-Fushman D (2018). "A dataset of clinically generated visual questions and answers about radiology images". Sci Data. 5(1). doi:10.1038/sdata.2018.251.
- Liu B, Zhan LM, Xu L, Ma L, Yang Y, Wu XM (2021). "Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering". In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). doi: 10.1109/ISBI48211.2021.9434010.
- Bae S, Kyung D, Ryu J, Cho E, Lee G, Kweon S, et al (2024). "EHRXQA: A multi-modal question answering dataset for electronic health records with chest x-ray images". In: Proceedings of the 37th International Conference on Neural Information Processing Systems.
- Hu X, Gu L, Kobayashi K, Liu L, Zhang M, Harada T, et al (2024) "Interpretable medical image visual question answering via multi-modal relationship graph learning". Medical Image Analysis. 97:103279. doi:10.1016/j.media.2024.103279.
- Johnson A, Pollard T, Mark R, Berkowitz S, Horng S (2024). "MIMIC-CXR Database (version 2.1.0)". PhysioNet. doi:10.13026/4jqj-jw95.
- Johnson AEW, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C, et al (2019). "MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports". Sci Data. 6(1):317. doi:10.1038/s41597-019-0322-0.
- Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, et al (2000). "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals". Circulation. 101(23). doi:10.1161/01.CIR.101.23.e215.
- Seibold C, Jaus A, Fink MA, Kim M, Reiß S, Herrmann K, et al (2023). "Accurate fine-grained segmentation of human anatomy in radiographs via volumetric pseudo-labeling". arXiv preprint. arXiv:2306.03934.
- Johnson A, Lungren M, Peng Y, Lu Z, Mark R, Berkowitz S, et al (2024). "MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.1.0)". PhysioNet. doi:10.13026/jsn5-t979.
- Johnson AEW, Pollard TJ, Greenbaum NR, Lungren MP, Deng C, Peng Y, et al (2019). "MIMIC-CXR: A large publicly available database of labeled chest radiographs". arXiv preprint. arXiv:1901.07042.
- Wu J, Agu N, Lourentzou I, Sharma A, Paguio J, Yao JS, et al (2021). "Chest ImaGenome Dataset (version 1.0.0)". PhysioNet. doi:10.13026/wv01-y230.
- Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al (2024). "The llama 3 herd of models". arXiv preprint. arXiv:2407.21783.
- Bustos A, Pertusa A, Salinas JM, de la Iglesia-Vayá M (2020). "Padchest: A large chest x-ray image dataset with multi-label annotated reports". Medical image analysis. 66:101797. doi:10.1016/j.media.2020.101797.
- SNOMED International (2023). "SNOMED CT". https://www.snomed.org.
- Remy F, Demuynck K, Demeester T (2024). "BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights". Journal of the American Medical Informatics Association. 31(9):1844-55. doi:10.1093/jamia/ocae029.
- Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al (2019). "CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison". Proceedings of the AAAI Conference on Artificial Intelligence. 33(01):590–7. doi:10.1609/aaai.v33i01.3301590.
- Holste G, Lin M, Wang S, Zhou Y, Wei Y, Chen H, et al (2025). "CXR-LT: Multi-Label Long-Tailed Classification on Chest X-Rays". PhysioNet. doi:10.13026/RYJ9-X506.
- Holste G, Zhou Y, Wang S, Jaiswal A, Lin M, Zhuge S, et al (2024). "Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge". Medical Image Analysis. 97:103224. doi:10.1016/j.media.2024.103224.
- Boecking B, Usuyama N, Bannur S, Castro DC, Schwaighofer A, Hyland S, et al (2024). "MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing". PhysioNet. doi:10.13026/9G2Z-JG61.
- Boecking B, Usuyama N, Bannur S, Castro DC, Schwaighofer A, Hyland S, et al (2022). "Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing". In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors. Computer Vision – ECCV 2022. Cham: Springer Nature Switzerland. doi:10.1007/978-3-031-20059-5_1.
- Lanfredi RB, Zhang M, Auffermann W, Chan J, Duong PA, Srikumar V, et al (2021). "REFLACX: Reports and eye-tracking data for localization of abnormalities in chest x-rays". PhysioNet. doi:10.13026/E0DJ-8498.
- Lanfredi RB, Zhang M, Auffermann WF, Chan J, Duong PAT, Srikumar V, et al (2022). "REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays". Sci Data. 9(1):350. doi:10.1038/s41597-022-01441-z.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/8qmz-da41
DOI (latest version):
https://doi.org/10.13026/6193-he91
Topics:
chest x-rays
vqa
localization
scene graphs
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project