Name: CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
Published: Oct. 15, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Hyungyung Lee , Geon Choi , Jung Oh Lee , Hangyul Yoon , Hyuk Gi Hong , Edward Choi

Published: Oct. 15, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Lee, H., Choi, G., Lee, J. O., Yoon, H., Hong, H. G., & Choi, E. (2025). CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/z3dn-nh22

MLA	Lee, Hyungyung, et al. "CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/z3dn-nh22
APA	Lee, H., Choi, G., Lee, J. O., Yoon, H., Hong, H. G., & Choi, E. (2025). CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/z3dn-nh22
Chicago	Lee, Hyungyung, Choi, Geon, Lee, Jung Oh, Yoon, Hangyul, Hong, Hyuk Gi, and Edward Choi. "CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/z3dn-nh22
Harvard	Lee, H., Choi, G., Lee, J. O., Yoon, H., Hong, H. G., and Choi, E. (2025) 'CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays' (version 1.0.0), PhysioNet. RRID:SCR_007345. Available at: https://doi.org/10.13026/z3dn-nh22
Vancouver	Lee H, Choi G, Lee J O, Yoon H, Hong H G, Choi E. CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays (version 1.0.0). PhysioNet. 2025. RRID:SCR_007345. Available from: https://doi.org/10.13026/z3dn-nh22

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning.

To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to four visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements.

This dataset is intended to serve as a standardized resource for developing, evaluating, and comparing vision-language models on clinically grounded reasoning tasks in chest X-rays.

Background

Large Vision-Language Models (LVLMs) have recently been applied to medical tasks such as report generation and visual question answering (VQA) [1–3]. Chest X-rays are commonly used for such evaluations due to their clinical importance and widespread availability.

However, existing benchmarks [4–7] mainly assess whether the model's final diagnostic output is correct, offering limited insight into how that conclusion was reached. This lack of reasoning-level evaluation poses a critical challenge in clinical AI, where transparency and interpretability are essential. Although some recent studies [8–11] include explanatory text or visual grounding, they still focus on outputs rather than the intermediate reasoning steps underlying diagnostic decisions. Without evaluating these steps, such as identifying anatomical structures, performing quantitative measurements, and applying clinical rules [12], it is difficult to determine whether a model genuinely understands medical images or relies on superficial correlations.

To address this gap, we introduce CheXStruct and CXReasonBench, two complementary resources for evaluating diagnostic reasoning from chest X-rays. CheXStruct provides structured reference reasoning by modeling clinically relevant intermediate steps, while CXReasonBench uses these structured outputs to benchmark LVLMs on their reasoning consistency and alignment with clinical logic. While prior structured datasets [13–15] provide bounding box annotations linking report-derived findings to image regions, they primarily focus on high-level findings and offer limited supervision for intermediate reasoning steps. Together, CheXStruct and CXReasonBench enable transparent, step-by-step evaluation of model reasoning beyond final diagnostic accuracy.

Methods

CheXStruct

CheXStruct is a fully automated pipeline designed to extract structured clinical information directly from chest X-rays. It performs anatomical segmentation, identifies anatomical landmarks, and derives diagnostic measurements. The pipeline then computes diagnostic indices, applies clinical thresholds, and conducts task-specific quality control based on expert-defined guidelines.

To guide this process, we defined a set of 12 diagnostic tasks in collaboration with clinical experts, categorized into two groups: radiological findings and image quality assessments.

Radiological Finding: The selected findings are diagnosable from chest X-rays alone, without requiring additional patient information such as clinical history or symptoms (e.g., pneumonia). Tasks: cardiomegaly, mediastinal widening, carina angle, trachea deviation, aortic knob enlargement, ascending aorta enlargement, descending aorta enlargement, and descending aorta tortuous.
Image Quality Assessment: These tasks evaluate the technical adequacy of image acquisition, ensuring that the radiograph meets basic quality standards for accurate interpretation. Tasks: inclusion, inspiration, rotation, projection.

Structured Clinical Information Extraction

To enable structured diagnostic reasoning, CheXStruct extracts clinically meaningful information from raw chest X-ray images through a multi-step process.

Defining Task-Specific Criteria: Each of the 12 diagnostic tasks is grounded in clinical measurement guidelines, which fall into one of two categories:

Standardized, Quantifiable Criteria: For tasks with well-established criteria that do not depend on imaging metadata (e.g., pixel spacing that represents the real-world distance of each pixel), we adopt standard clinical measurement rules. For instance, cardiomegaly is assessed using the cardiothoracic ratio (CTR), defined as the ratio of maximum horizontal cardiac width to thoracic width, which does not rely on absolute measurements (e.g., centimeters) that cannot be calculated based only on X-ray images.
Expert-Defined Criteria: To ensure reliable evaluation in tasks lacking standardized criteria or involving measurement ambiguity (e.g., when diagnosis depends on absolute measurements derived from imaging metadata), we collaborated with clinical experts to define quantifiable, clinically meaningful standards that guide structured diagnostic reasoning. For instance, mediastinal widening is typically assessed based on whether the mediastinal width exceeds 8 cm, but such absolute measurements cannot be obtained directly from X-ray images without pixel spacing. To address this, we measure the mediastinal width and the lung width at the same axial level, and assess mediastinal widening based on their ratio, enabling consistent evaluation across images.

Segmenting Anatomical Structures: Each diagnostic task requires segmentation of specific anatomical regions. We use CXAS [16], a chest X-ray segmentation model trained on expert-curated data, to obtain the necessary segmentation masks. For example, cardiomegaly requires heart and lung masks.

Quality Control

CheXStruct enforces task-specific quality control (QC) to ensure that the extracted structured information is both anatomically valid and clinically reliable. These QC rules, developed in collaboration with clinical experts, automatically exclude low-quality samples from the benchmark.

Quality Criteria Definition: We define QC rules specific to each task’s anatomical structures and associated measurement criteria. For instance, for inspiration, we verify that the right posterior ribs are segmented in anatomically consistent positions, maintaining proper spatial order and spacing.

Threshold-Based Filtering: Using the defined QC rules, CheXStruct filters out samples that fail to meet task-specific standards. For instance, we discard samples with overlapping or disordered rib masks (inspiration). Diagnostic labels are assigned only to samples passing QC, based on expert-defined thresholds. Table 1 in the Data Description section shows the number of cases extracted by CheXStruct for each diagnostic task remaining after quality control, along with their corresponding diagnostic label distribution.

CXReasonBench

Building on this structured data, we developed CXReasonBench, a multi-path, multi-stage benchmark for evaluating and guiding diagnostic reasoning. Each case is represented as a QA sequence that outlines a series of diagnostic reasoning steps. These sequences simulate two evaluation paths, direct reasoning and guided reasoning, each consisting of multiple stages such as criterion selection, visual region identification, quantitative measurement, and threshold-based decision-making. Together, these steps provide a structured framework for model assessment and form the basis for generating fine-grained, clinically grounded QA pairs.

Evaluation Pipeline

Initial Diagnostic Decision and Path Assignment: Each evaluation begins with a binary diagnostic question (e.g., Does this patient have cardiomegaly?). The model is given three options: a binary answer (e.g., Yes / No) or “I don’t know.” If the model selects a binary answer, it proceeds to Path 1 (Section 4.1.1). If it selects “I don’t know,” it is directed to Path 2.

Path 1: Direct Reasoning Process Evaluation

Path 1 evaluates whether the model’s diagnostic decision is supported by a sequence of clinically coherent reasoning steps as below. At each stage, an incorrect response terminates the evaluation, under the assumption that clinically valid reasoning cannot be inferred beyond that point. Refer to our paper [18] for stage-specific reference answers and detailed definitions for each task.

Stage 1. Diagnostic Criterion Selection: The model identifies the criterion used for the initial diagnostic question, allowing us to determine if its decision was based on clinically accepted knowledge. Failure to select a valid criterion prevents assessment of the diagnostic reasoning applied, rendering the rest of the reasoning process irrelevant.
Stage 1.5. Refined Criterion Adoption (Expert-Defined Criteria Only): For diagnostic tasks where the original criteria are difficult to apply to images (e.g., reliance on imaging metadata) or are inherently ambiguous, the model is provided with expert-defined criteria. If the model accepts it, it proceeds to Stage 2; otherwise, the model advances to Path 2.
Stage 2. Anatomical Structure Identification: The model selects all relevant anatomical regions or reference lines from highlighted chest X-ray images related to the criterion. If the model actually used the criterion for the initial diagnosis or genuinely accepted and understood an expert-defined criterion, it should have visually assessed the associated anatomical structures to apply the criterion correctly. This stage tests whether such visual grounding truly occurred.
Stage 3. Measurement or Recognition: The model applies the diagnostic criterion. In measurement-type tasks, it performs arithmetic computations based on anatomical measurements (e.g., calculating the CTR) and selects a value range (e.g., [0.50–0.52]). In recognition-type tasks, it interprets anatomical changes (e.g., identifying tracheal deviation) and selects the appropriate label (e.g., “shifted to the left”). This stage ensures that the model explicitly applies the criterion to make a decision. Failure at this stage indicates a misinterpretation of the criterion or an inability to apply it, suggesting that the model may not be able to effectively use the criterion as intended.
Stage 4. Final Decision: The model makes a final decision based on Stage 3. The evaluation of this decision depends on whether standardized diagnostic thresholds are available for the task. For tasks with standardized thresholds (e.g., CTR of 0.5 for cardiomegaly in PA view), no threshold is provided to the model. For other tasks, an expert-defined threshold is provided.

Path 2: Guided Reasoning and Re-evaluation

Path 2 is triggered when the model either responds with “I don’t know” to the initial diagnostic question or rejects the expert-defined criterion in Stage 1.5. The goal here is to evaluate whether the model can follow structured guidance to develop its reasoning process and later apply this learned reasoning to a new case within the same diagnostic task. Evaluation is terminated upon failure at any stage, as each step is interdependent; failure at an early stage indicates insufficient foundational knowledge for diagnostic reasoning, which impacts subsequent stages outlined below. Refer to our paper [18] for reference answers and definitions for each task and stage.

Stage 1. Anatomical Structures Identification: The model is shown highlighted chest X-rays and asked to identify specific anatomical structures (e.g., “Which image shows the heart?”). This stage evaluates whether the model can visually recognize and distinguish relevant anatomical regions, an essential prerequisite for accurate diagnostic reasoning.
Stage 2. Guided Measurement or Recognition: The model receives detailed visual annotations (e.g., labeled landmarks, coordinates) along with instructions for a guided diagnostic assessment. For instance, in evaluating cardiomegaly, heart and thoracic widths are labeled, coordinates are given, and the cardiothoracic ratio calculation is explained. This stage tests the model’s ability to follow structured guidance and derive clinically meaningful conclusions from annotated images.
Stage 3. Final Decision: The model is provided with a diagnostic threshold and asked to make a final decision based on the result from Stage 2. This stage evaluates whether the model can apply the threshold appropriately to reach an accurate diagnostic conclusion.

Re-evaluated Path 1 after Guidance: If the model successfully completes guided reasoning, it is considered to have acquired the correct reasoning process. This path evaluates whether the model can internalize and independently apply that reasoning to a new case within the same diagnostic task. The model is re-evaluated using the same structure as Path 1, except Stage 1.5 is omitted, as the criterion has already been introduced during guidance.

Random Re-selection: A new case is randomly chosen from the set of cases that the model previously could not reach the final stage in Path 1, or initially responded to with “I don’t know,” excluding the one just used for guidance in Path 2.
Re-evaluation Pipeline: The model is asked the diagnostic question in the same format as the initial diagnostic decision. If the model responds incorrectly or with “I don’t know,” the evaluation ends, indicating that the model has not internalized the reasoning process or cannot generalize it to a new case. If the model answers correctly, it re-enters Path 1 and resumes the evaluation.

Benchmark Structure

Benchmark Format: The benchmark uses a multiple-choice format with both single-choice (e.g., one diagnostic criterion) and multi-choice questions (e.g., all relevant anatomical structures) depending on the stage.

Two-round format with “Need new option”: In certain stages, a two-round format is used, though not always. Initially, the correct answer is intentionally excluded, and the model is presented with a “Need new option” choice. If the model selects this option, a second round is triggered, where the correct answer is revealed. This format assesses whether the model can recognize insufficient reasoning paths and appropriately defer its decision to a subsequent, more complete option set.
“None of the above” option: When the “Need new option” is not included, a “None of the above” option is provided instead. In this case, the correct answer is always included in the initial options. Selecting “None of the above” results in failure, and the model is expected to explain its reasoning.

Data Description

CheXStruct

CheXStruct extracts structured information directly from chest X-rays. The table below shows the number of cases extracted by CheXStruct for each diagnostic task remaining after quality control, along with their corresponding diagnostic label distribution.

Table 1: Number of cases and diagnostic label distribution per diagnostic task after quality control.
Diagnostic Task	Total Cases (Label Distribution)
cardiomegaly	184,169 (Normal: 112,329, Abnormal: 71,840)
mediastinal widening	174,905 (Normal: 118,092, Abnormal: 56,813)
trachea deviation	66,588 (Normal: 57,878, Deviated: 8,710)
carina angle	122,085 (Normal: 29,010, Abnormal: 93,075)
aortic knob enlargement	174,949 (Normal: 156,183, Abnormal: 18,766)
ascending aorta enlargement	75,009 (Normal: 71,126, Abnormal: 3,883)
descending aorta enlargement	106,634 (Normal: 106,274, Abnormal: 360)
descending aorta tortuous	120,330 (Normal: 116,420, Abnormal: 3,910)
inclusion	144,221 (Included: 132,594, Excluded: 11,627)
inspiration	148,222 (Good: 127,848, Poor: 20,374)
rotation	51,095 (Not Rotated: 41,155, Rotated: 9,940)
projection	92,938 (Small Overlap: 53,524, Large Overlap: 39,414)

CXReasonBench

CXReasonBench is constructed by applying CheXStruct to MIMIC-CXR-JPG [17]. We randomly sampled 100 cases for each of the 12 diagnostic tasks (1,200 in total) from the structured information extracted by CheXStruct, and a clinical expert manually reviewed all corresponding segmentation masks for quality assurance. From each case, we generate QA pairs across three distinct evaluation settings (Path 1, Path 2, and Re-evaluated Path 1), supporting multi-stage evaluation and resulting in 18,988 QA pairs in total: 8,044 in Path 1, 3,600 in Path 2, and 7,344 in Re-evaluated Path 1. Each QA pair is associated with up to four visual inputs (e.g., the original X-ray and overlaid segmentation masks).

Files and Structure

The dataset consists of two main components: CheXStruct, which provides structured diagnostic outputs extracted from chest X-ray images across multiple tasks, and CXReasonBench, a multi-stage benchmark designed to evaluate diagnostic reasoning using those structured outputs. Data is organized into CSV files for structured labels and JSON files supporting the evaluation pipeline, arranged systematically in dedicated folders reflecting diagnostic tasks and evaluation paths.

The dataset is composed of two primary components:

CheXStruct: Provides structured diagnostic outputs (labels) extracted from chest X-ray images across multiple tasks. The data is stored in CSV files, with metadata in accompanying JSON files.
CXReasonBench: A multi-stage benchmark designed to evaluate diagnostic reasoning using the structured outputs from CheXStruct. It includes reasoning stages, segmentation masks, and point annotations for each diagnostic task.

Directory Structure

base/
├── CheXStruct/
│   ├── CheXStruct_column_descriptions.json
│   ├── global/
│   │   ├── abdominal_xray.csv
│   │   ├── mask_number.csv
│   │   └── window.csv
│   └── diagnostic_tasks/
│       ├── aortic_knob_enlargement.csv
│       ├── ascending_aorta_enlargement.csv
│       ├── cardiomegaly.csv
│       ├── ...
│       └── trachea_deviation.csv
└── CXReasonBench/
    ├── dx_by_dicoms.json
    ├── pnt_on_cxr.zip
    ├── segmask_bodypart.zip
    └── qa.zip

File Format and Contents

CheXStruct/

The global/ folder contains CSV files related to global filtering tasks used by the CheXStruct pipeline to exclude non-frontal or corrupted chest X-ray images.

The diagnostic_tasks/ folder contains CSV files corresponding to 12 diagnostic tasks.

In total, the CheXStruct dataset includes 15 CSV files, each containing structured outputs extracted by the CheXStruct pipeline.

CSV File Structure

Each CSV file shares common columns across all tasks:

image_file: Unique identifier for each chest X-ray image.
viewposition: Chest X-ray view position, either PA (posteroanterior) or AP (anteroposterior).
label: Binary indicator (1 or 0) denoting the presence or absence of the specific radiological finding or the status of the image quality assessment as determined by the CheXStruct pipeline.

In addition to these, each CSV contains task-specific columns representing:

Anatomical landmarks: Coordinates or features derived from segmentation masks (e.g., cardiac endpoints, thoracic endpoints).
Diagnostic measurements: Quantitative measurements calculated from anatomical landmarks (e.g., cardiac width, thoracic width).
Diagnostic indices: Computed values based on diagnostic measurements (e.g., cardiothoracic ratio).

Detailed descriptions of each CSV file’s columns are provided in CheXStruct_column_descriptions.json, which includes all column names and their definitions for each diagnostic task and global filtering file.

CXReasonBench/

dx_by_dicoms.json: This file contains a dictionary listing DICOM IDs for each diagnostic task used in CXReasonBench. These DICOM IDs are consistent with those from the MIMIC-CXR-JPG dataset. Specifically, the file includes 100 sampled cases for each of the 12 diagnostic tasks (1,200 in total), derived from the structured clinical information extracted by the CheXStruct pipeline.

pnt_on_cxr.zip:

Contains chest X-ray images with overlaid anatomical landmarks used in Stage 2 of Path 2.
Internal folder structure:

pnt_on_cxr/
├── cardiomegaly/
│   ├── {dicom_id}.png
│   └── ...
├── aortic_knob_enlargement/
│   ├── {dicom_id}.png
│   └── ...
└── ...

segmask_bodypart.zip:

Contains chest X-ray images overlaid with anatomical segmentation masks. Used in Stage 2 of Path 1, the re-evaluation of Path 1, and Stage 1 of Path 2.
Internal folder structure:

segmask_bodypart/
├── cardiomegaly/
│   ├── heart/
│   │   ├── {dicom_id}.png
│   │   └── ...
│   └── thoracic_width_heart/
│       ├── {dicom_id}.png
│       └── ...
├── aortic_knob_enlargement/
│   ├── aortic_knob/
│   │   ├── {dicom_id}.png
│   │   └── ...
│   └── ...
└── ...

qa.zip:

Contains multi-stage diagnostic reasoning QA samples in JSON format, organized by:

diagnostic task: cardiomegaly/, carina_angle/, etc.
evaluation path: path1/, path2/, re-path1/
stage: init/, stage1/, ..., stage{N}/
option format:
- basic/: The correct answer is always included in the initial options.
- two-round/: Two-round format where the correct answer is intentionally excluded in the first round; selecting "Need new option" triggers a second round revealing the correct answer.
- two-round_partial_inclusion/: Two-round multi-choice format; only some correct answers are included in the first round, requiring "Need new option" to access the full answer set in the second round.
- two-round_none_included/: Two-round format; no correct answers are included in the first round, and the model must select "Need new option" to see the correct answer in the second round.

Internal folder structure:

qa/
├── {diagnostic_task}/
│   ├── path1/
│   │   ├── init/
│   │   │   └── basic/
│   │   │       ├── {dicom_id}.json
│   │   │       └── ...
│   │   ├── stage1/
│   │   │   ├── basic/
│   │   │   └── two-round/
│   │   ├── [optional] stage1.5/
│   │   │   └── basic/
│   │   ├── stage2/
│   │   │   ├── basic/
│   │   │   ├── two-round_partial_inclusion/
│   │   │   └── two-round_none_included/
│   │   └── stage{N}/
│   │       └── basic/
│   ├── path2/
│   │   └── stage{M}/
│   │       └── basic/
│   └── re-path1/
│       ├── init/
│       │   └── basic/
│       ├── stage1/
│       │   ├── basic/
│       │   └── two-round/
│       ├── stage2/
│       │   ├── basic/
│       │   ├── two-round_partial_inclusion/
│       │   └── two-round_none_included/
│       └── stage{L}/
│           └── basic/
└── ...

Each JSON file contains a list of Python dictionaries. Each dictionary includes the following keys:

dicom: The unique identifier of the chest X-ray image in DICOM format, corresponding to the MIMIC-CXR-JPG dataset.
dx: The name of the diagnostic task that this QA sample pertains to.
inference_type: The type of evaluation pipeline stage the QA sample belongs to:
- "reasoning" corresponds to Path 1, where the model performs direct diagnostic reasoning.
- "guidance" corresponds to Path 2, where the model is guided through structured reasoning steps.
- "review" corresponds to Re-evaluated Path 1, a re-evaluation after guided reasoning.
stage: The name of the evaluation stage within the path. This indicates which step of the diagnostic reasoning process the question evaluates.
question: The text of the question presented to the model at this stage. All options are embedded directly within the question field.
answer: The correct answer to the question.
img_path: The relative file path to the image that accompanies the question.

Example instance:

{
  "dicom": "0f12d882-1edf68c3-33914d6f-40fbb9e4-73b57d38",
  "dx": "cardiomegaly",
  "inference_type": "reasoning",
  "stage": "bodypart",
  "question": [
    "In the following images, either a segmentation mask of a specific body part or a reference line is shown. Based on the selected criterion, select all images that include the relevant body part or reference line required for applying that criterion. If only some of the required body parts or reference lines are present, select the corresponding image(s) and choose 'Need new options' to request the missing ones. If none of the relevant body parts or reference lines are present, select only 'Need new options'. Options: (a) 1st image, (b) 2nd image, (c) 3rd image, (d) 4th image, (e) Need new options",
    "Additional options are now provided. Select any additional image(s) that include the relevant body part or reference line required for applying the selected criterion. If you choose 'None of the above', please explain which body parts you used. Options: (a) 1st image, (b) 2nd image, (c) 3rd image, (d) 4th image, (e) None of the above"
  ],
  "answer": [
    "(b) 2nd image, (e) Need new options.",
    "(d) 4th image"
  ],
  "img_path": [
    [
      "/aortic_knob_enlargement/trachea/0f12d882-1edf68c3-33914d6f-40fbb9e4-73b57d38.png",
      "/cardiomegaly/thoracic_width_heart/0f12d882-1edf68c3-33914d6f-40fbb9e4-73b57d38.png",
      "/ascending_aorta_enlargement/borderline/0f12d882-1edf68c3-33914d6f-40fbb9e4-73b57d38.png",
      "/inclusion/lung_both/0f12d882-1edf68c3-33914d6f-40fbb9e4-73b57d38.png"
    ],
    [
      "/carina_angle/carina/0f12d882-1edf68c3-33914d6f-40fbb9e4-73b57d38.png",
      "/inspiration/midclavicularline/0f12d882-1edf68c3-33914d6f-40fbb9e4-73b57d38.png",
      "/mediastinal_widening/thoracic_width_mw/0f12d882-1edf68c3-33914d6f-40fbb9e4-73b57d38.png",
      "/cardiomegaly/heart/0f12d882-1edf68c3-33914d6f-40fbb9e4-73b57d38.png"
    ]
  ]
}

Usage Notes

How to Use the Dataset

The dataset is designed to support the development and evaluation of vision-language models on clinically grounded diagnostic reasoning tasks in chest X-rays. Users can employ the data to:

Benchmark models on step-by-step diagnostic reasoning, not just final answers.
Evaluate the model's ability to handle multi-stage reasoning using the provided structured QA format.

The QA JSON files can be loaded into Python as follows:

import json

# Path to a single QA JSON file
file_path = "CXReasonBench/qa/{dx_name}/{evaluation_path}/{stage}/{option_format}/{dicom}.json"

# Open and load the JSON file
with open(file_path, "r", encoding="utf-8") as file:
    qa_data = json.load(file)

# Extract relevant fields
question = qa_data.get("question")   # Question text (includes options if applicable)
answer = qa_data.get("answer")       # Reference answer
img_path = qa_data.get("img_path")   # Path(s) to associated image(s)

# Generate paths for body part segmentation masks
img_path_lst_bodypart = [f'segmask_bodypart{path}' for path in img_path]

# Generate paths for points on CXR
img_path_lst_pnt_on_cxr = [f'pnt_on_cxr{path}' for path in img_path]

# Print example
print("Question:", question)
print("Answer:", answer)
print("Image path:", img_path)

For detailed information on the dataset and guidance on how to use it, please refer to the paper [18] and the GitHub repository [19].

Limitations & Future Direction

This dataset focuses on diagnostic tasks that are structurally inferable through anatomical segmentation and does not include findings that require pixel-level analysis or additional patient context. Future versions of the dataset will expand coverage to include a wider range of diagnostic tasks and imaging features, including those that require both structural and pixel-level analysis. Ultimately, we aim to broaden the benchmark to more comprehensively reflect real-world diagnostic reasoning.

Release Notes

1.0.0 - Initial Release

Ethics

The authors have no ethical concerns to declare. This study utilized the MIMIC-CXR-JPG dataset, which was collected under institutional review board (IRB) approval and has been fully de-identified in accordance with applicable regulations.

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

Chen Z, Varma M, Delbrouck JB, Paschali M, Blankemeier L, Van Veen D, et al. Chexagent: towards a foundation model for chest X-ray interpretation. arXiv [preprint]. 2024; arXiv:2401.12208.
Deperrois N, Matsuo H, Ruiperez-Campillo S, Vandenhirtz M, Laguna S, Ryser A, et al. RadVLM: a multitask conversational vision-language model for radiology. arXiv [preprint]. 2025; arXiv:2502.03333.
Lin T, Zhang W, Li S, Yuan Y, Yu B, Li H, et al. HealthGPT: a medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation. arXiv [preprint]. 2025; arXiv:2502.09838.
Ben Abacha A, Sarrouti M, Demner-Fushman D, Hasan S, Muller H. Overview of the VQA-Med task at ImageCLEF 2021: visual question answering and generation in the medical domain. In: Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum – Working Notes. 2021.
He X, Zhang Y, Mou L, Xing E, Xie P. PathVQA: 30,000+ questions for medical visual question answering. arXiv [preprint]. 2020; arXiv:2003.10286.
Lau J, Gayen S, Ben Abacha A, Demner-Fushman D. A dataset of clinically generated visual questions and answers about radiology images. Sci Data. 2018; 5(1):1–10.
Zhang X, Wu C, Zhao Z, Lin W, Zhang Y, Wang Y, et al. PMC-VQA: visual instruction tuning for medical visual question answering. arXiv [preprint]. 2023; arXiv:2305.10415.
Bae S, Kyung D, Ryu J, Cho E, Lee G, Kweon S, et al. EHRXQA: a multi-modal question answering dataset for electronic health records with chest X-ray images. Adv Neural Inf Process Syst. 2023; 36:3867–3880.
Hu X, Gu L, An Q, Zhang M, Liu L, Kobayashi K, et al. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023; 4156–4165.
Liu B, Zou K, Zhan L, Lu Z, Dong X, Chen Y, et al. Gemex: a large-scale, groundable, and explainable medical VQA benchmark for chest X-ray diagnosis. arXiv [preprint]. 2024; arXiv:2411.16778.
Zuo Y, Qu S, Li Y, Chen Z, Zhu X, Hua E, et al. MedXpertQA: benchmarking expert-level medical reasoning and understanding. arXiv [preprint]. 2025; arXiv:2501.18362.
Truszkiewicz K, Poręba R, Gac P. Radiological cardiothoracic ratio in evidence-based medicine. J Clin Med. 2021; 10(9):2016.
Bannur S, Bouzid K, Castro D, Schwaighofer A, Thieme A, Bond-Taylor S, et al. MAIRA-2: grounded radiology report generation. arXiv [preprint]. 2024; arXiv:2406.04449.
Castro D, Bustos A, Bannur S, Hyland S, Bouzid K, Wetscherek M, et al. PadChest-GR: a bilingual chest X-ray dataset for grounded radiology report generation. NEJM AI. 2025; 2(7):AIdbp2401120.
Wu J, Agu N, Lourentzou I, Sharma A, Paguio J, Yao J, et al. Chest Imagenome dataset for clinical reasoning. arXiv [preprint]. 2021; arXiv:2108.00316.
Seibold C, Jaus A, Fink M, Kim M, Reiß S, Herrmann K, et al. Accurate fine-grained segmentation of human anatomy in radiographs via volumetric pseudo-labeling. arXiv [preprint]. 2023; arXiv:2306.03934.
Johnson A, Pollard T, Greenbaum N, Lungren M, Deng CY, Peng Y, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv [preprint]. 2019; arXiv:1901.07042.
Lee H, Choi G, Lee JO, Yoon H, Hong H, Choi E. CXReasonBench: a benchmark for evaluating structured diagnostic reasoning in chest X-rays. arXiv [preprint]. 2025; arXiv:2505.18087.
CheXStruct & CXReasonBench GitHub Repository. Available from: https://github.com/ttumyche/CXReasonBench . Accessed 9 Oct 2025.