Database Credentialed Access

RadVLM Instruction Dataset

Nicolas Deperrois Hidetoshi Matsuo Samuel Ruiperez-Campillo Moritz Vandenhirtz Sonia Laguna Alain Ryser Koji Fujimoto Mizuho Nishio Thomas Sutter Julia Vogt Jonas Kluckert Thomas Frauenfelder Christian Bluethgen Farhad Nooralahzadeh Michael Krauthammer

Published: Sept. 25, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Deperrois, N., Matsuo, H., Ruiperez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T., Vogt, J., Kluckert, J., Frauenfelder, T., Bluethgen, C., Nooralahzadeh, F., & Krauthammer, M. (2025). RadVLM Instruction Dataset (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/et5g-h222

Additionally, please cite the original publication:

Deperrois, N., Matsuo, H., Ruipérez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., ... & Krauthammer, M. (2025). RadVLM: A Multitask Conversational Vision-Language Model for Radiology. arXiv preprint arXiv:2502.03333.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

We release the RadVLM instruction dataset, a large-scale resource used to train the RadVLM model on diverse radiology tasks. The dataset contains 1,115,021 image–instruction pairs spanning five task families: (i) report generation from frontal CXRs using filtered Findings/Impression text; (ii) abnormality classification for the standard 14 CheXpert labels; (iii) anatomy grounding; (iv) abnormality detection and grounding; and (v) phrase grounding from report sentences. To support interactive use, we include ~89k LLM-generated multi-turn, multi-task conversations (~3k with spatial grounding) derived from image-linked attributes (reports, labels, boxes). Creation involved curating datasets from public sources, excluding lateral views, removing prior-study references and other non-image context from reports, fusing multi-reader annotations, and harmonizing label and coordinate formats. The resource is intended for training CXR assistants across diverse radiology tasks and within a conversational format.


Background

The shortage of trained personnel for CXR interpretation has led to the exploration of automated systems to assist physicians. These systems are trained on CXR datasets that pair images with structured attributes, such as condition labels [1-2] or free-text reports [3–6]. To broaden clinical utility, datasets should extend beyond classification or report generation to support questions about CXR technique, region-specific findings, localization of specific abnormalities, and definitions of medical terms, with queries posed flexibly and across multiple turns [7]. Inspired by recent advances in the general domain [8–10], the development of medical models introduced datasets capturing multiple user-assistant interactions for diverse medical tasks [11–13].

Despite these advancements, there remains a need for instruction datasets designed specifically for CXR interpretation. In this direction, datasets used to train models such as CheXagent [14], RaDialog [15], or MAIRA-2 [16] were released, extending beyond report generation to tasks such as observation grounding and visual question answering. However, their capacity to handle diverse and complex user queries, or to respond accurately to multiple prompts within an arbitrary conversational framework, remains limited.

In this release, we introduce a compact, multitask conversational dataset specialized in CXR interpretation. To achieve this aim, we curate comprehensive CXR datasets, each featuring diverse modalities including free-text reports, abnormality labels, and visual coordinates, and organize them into a unified instruction dataset. This dataset is comprised of single-turn image-instruction pairs for different tasks and image-conversation pairs designed for more flexible and multi-turn interactions. This dataset was used to train a vision-language architecture [10], resulting in the RadVLM model [17], that demonstrated state-of-the-art performance in diverse radiology tasks, as well as in its conversational abilities.


Methods

Instruction dataset

A key step in the development of RadVLM is the construction of an instruction dataset. For this purpose, we first aggregate and process multiple publicly available datasets containing CXR images paired with various attributes, including free-text reports, categorical labels, and bounding boxes. From these sources, we generate a dataset of over 1 million instruction instances, each consisting of a frontal CXR image and a corresponding user–assistant interaction derived from the available attributes. These interactions can be in the form of a single Q&A designed for predefined tasks (single instructions) or a multi-turn exchange (conversations), as classically designed for general domain VLMs [18–19]. Note that the full instruction dataset as described in RadVLM's publication [17] contains additional datasets [20-22] that were not included in this dataset due to their data use and agreements terms.

Free-text report generation

In alignment with existing CXR models, we aim to generate clinically coherent radiology reports from CXR images. To achieve this, we collect public datasets containing CXRs paired with anonymized free-text reports. Radiology reports often compare to and refer to prior X-ray examinations when discussing current radiological findings. These earlier images should be provided as part of the prompt when conducting report generation based on raw radiology reports [16, 23]. As we focus on the analysis of a single CXR image, we therefore used GPT-4o through a secure Azure OpenAI service – configured per the PhysioNet guidelines – to remove mentions of prior studies, consistent with recent work on report generation [6, 14].

We leverage the following dataset for the report generation task:

  • MIMIC-CXR [24], which contains 377,110 CXR images paired with free-text radiology reports describing findings and patient history. After filtering, we retain 230,980 image–text pairs for the training set.

For this task, the instructions are designed such that the user asks the assistant to generate a report for a given CXR, and the assistant responds with the filtered report corresponding to that CXR.

Abnormality classification

Another essential task an AI assistant should be capable of is identifying the presence of abnormalities on a CXR. While simpler than generating detailed, unstructured observations, this functionality serves as a quick overview for physicians, highlighting key findings before a more detailed analysis.

For this task, we collect CXR datasets paired with abnormality labels. These labels were extracted from the original textual reports via the CheXbert automatic labeling tool [25], which identifies whether each of 14 possible abnormalities is present, absent, or uncertain. In our setup, we only consider frontal images and—in line with previous work [6, 26]—treat “uncertain” labels as “absent.”

We use 237,912 pairs from MIMIC-CXR [24] (Table 1). The instructions are designed such that the user asks for the abnormalities present on the CXR, and the assistant answers by providing the list of abnormalities.

Visual grounding

Detecting the location of specific anatomical regions or pathologies on a CXR is an important task for AI assistants. In addition to providing a textual description of the image, they should be able to spot where specific observations are located. This is usually done by predicting bounding box coordinates of top-left and bottom-right corners [x1,y1,x2,y2]. While classical object detectors [27,28] tackle this task via specialized architectures, we integrate it with the other text-based tasks via next-token prediction, formatting coordinates as text enclosed between square brackets. Since input images are pre-processed to a fixed size by the vision encoder, we normalize coordinates to the original dimension, obtaining floating values between 0 and 1, following approaches in [29-31].

The following datasets, containing fine-grained X-ray annotations, were collected to design visual-grounding instructions:

  • Chest Imagenome [32], derived from MIMIC-CXR, provides bounding box coordinates for 29 anatomical regions in frontal X-ray images. In training data, we randomly select one region per image for each data point and create an instruction for anatomical region grounding.
  • VinDr-CXR [33], which contains 18,000 frontal images, each manually annotated by three different radiologists. To merge their annotations, we fuse bounding boxes of the same pathology using weighted box fusion [31], similarly to [32]. From this dataset, we design two tasks:
    1. Abnormality grounding: asking for the location of a specific abnormality.
    2. Abnormality detection: asking the location of all abnormalities, if any (Table 1).
  • MS-CXR [22] provides image–sentence pairs of bounding boxes and corresponding phrases, complementing MIMIC-CXR. From this dataset, we construct the “phrase grounding” task, where the user asks about the location of a specific sentence from a radiology report, and the assistant provides its associated bounding box coordinates.

Conversations

Fine-tuning a VLM on single instructions, as presented above, is useful to acquire maximal precision in specific tasks but does not suffice to build a robust, flexible, and conversational radiology assistant. First, in a real-life setting, we cannot assume that physicians will prompt the model with a limited set of instructions. Various question types are possible—asking about characteristics of a specific organ, the orientation of the X-ray, or the definition of certain medical terms. More importantly, interactions may span multiple Q&A rounds, sometimes referring back to earlier answers (for instance, asking about the location of a specific observation mentioned previously). The model must be tuned to sequentially connect visual concepts (textual observations, presence or absence of abnormalities, fine-grained information) throughout a single conversation.

To develop this capability in RadVLM, we constructed an instruction-tuning dataset that simulates a real-life multi-turn interaction between user and assistant, named the “conversation dataset.” Here, questions can appear in any order, and the assistant must adapt based on prior answers. Inspired by vision–language models LLaVA [18] and LLaVA-Med [12], we prompt a larger text-only LLM (GPT-4o) to generate multi-turn conversations. The prompt includes a system message instructing the LLM to generate a dialogue between a user and an assistant, along with detailed CXR information (including the radiology report, abnormality labels, bounding box coordinates, view, and gender)—see. By leveraging the provided CXR information, the assistant is prompted to respond as if it had direct visual access to the image.

Following this process, we generate a total of 89k image–conversation pairs, including 86k standard conversations and 3k “grounded conversations” that include interactions aimed at localizing specific observations. For the grounded conversations, it is essential to supply pairs of textual observations and their corresponding bounding box coordinates in the prompt. We use MS-CXR and PadChest-GR (the datasets from the “phrase grounding” task) to collect phrases linked to precise image locations. This yields 862 image–conversation pairs derived from MS-CXR.


Data Description

The instruction dataset used to train RadVLM is downloadable as a json file containing a list of dictionaries.

Each dictionary is a datapoint following this structure:

{
    "image": "path/to/image.jpg",
    "conversations": [
        {
            "from": "human",
            "value": "<image>\n<question>"
        },
        {
            "from": "gpt",
            "value": "<answer>"
        }
    ],
},

where "image" refers to the absolute path of the image and "conversations" contains the user-assistant instruction (single or multi-turn). This structure follows the LLaVA dataset format.

Images can be downloaded via the public releases of each corresponding dataset (available on PhysioNet) and should be organized following the path names provided in "image".

Descriptive Statistics

Overall, after excluding CheXpert- and PadChest-derived datapoints, the instruction dataset contains 719,675 image–instruction instances. The breakdown by dataset and task is as follows:

Overview of the instruction dataset (PhysioNet)
Task Dataset source Image-instruction pairs (#)

Report Generation

MIMIC-CXR

230,980 × 1

Abnormality classification

MIMIC-CXR

237,912 × 1

Anatomical grounding

Chest Imagenome

80,000 × 1

Abnormality grounding

VinDr-CXR

16,089 × 3

Abnormality detection

VinDr-CXR

15,000 × 2

Phrase grounding

MS-CXR

971 × 3

Conversation

MIMIC-CXR

86,155 × 1

Conversation (grounded)

MS-CXR

862 × 4


Usage Notes

This dataset is intended for researchers and developers working on medical vision-language models, particularly those focusing on:

  • Radiology report generation: Training models to generate structured radiology reports from CXRs.

  • Abnormality classification: Identifying and classifying abnormalities present in CXR images.

  • Visual grounding: Localizing anatomical structures and abnormalities within the image.

  • Conversational AI for radiology: Enabling AI models to engage in multi-turn, context-aware conversations about CXR findings.

This structure follows the LLaVA dataset format and can directly be used within their corresponding training script [36].

This dataset is a sub-set of the full RadVLM instruction dataset that includes other datasets such as CheXpert and PadChest. While these excluded datapoints are not numerous, this could explain a difference of fine-tuned performance with the official RadVLM model. To include the other datasets, we invite the user to follow the data creation pipeline described in [36].

Limitations of the current dataset include report filtering and conversation generation using GPT-4o which although has been observed to be of high quality using automated metrics, may include errors and has not been manually assessed by experts due to the scale of the release of data.


Ethics

This dataset is derived from MIMIC-CXR-JPG, Chest ImaGenome, MS-CXR and VinDr-CXR public datasets. As such, ethical consideration of these datasets should be taken into consideration.

Special consideration was given to the use of an API of a large language model (GPT-4o), given that MIMIC-CXR terms of use include a prohibition of sharing the data with third parties, as well as an obligation to ensure the electronic security of the data (https://physionet.org/news/post/gpt-responsible-use). Specifically, we used a special instance of Azure Open AI where we opted out of human review due to processing sensitive data with low likelihood of harmful outputs and/or misuse, in addition to a lack of right to allow Microsoft to process the data for abuse detection given the MIMIC-CXR terms of use.


Acknowledgements

This work was supported as part of the Swiss AI Initiative by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID a02 on Alps, and by the LOOP Zurich as part of the application driver project supporting the LOOP Zurich Biomedical Informatics Platform (BMIP). ND and FN received research support from the Digitalization Initiative of the Zurich Higher Education Institutions (DIZH)- Rapid Action Call - under TRUST-RAD project. CB received research support from the Promedica Foundation, Chur, CH. TS is supported by the grant #2021-911 of the Strategic Focal Area “Personalized Health and Related Technologies (PHRT)” of the ETH Domain (Swiss Federal Institutes of Technology). HM, MN and KF are supported by JSPS KAKENHI (Grant Number: 23KK0148). AR is supported by the StimuLoop grant #1-007811-002 and the Vontobel Foundation. MV and SL are supported by the Swiss State Secretariat for Education, Research, and Innovation (SERI) under contract number MB22.00047. MK is supported by the UZH Global Strategy and Partnerships Funding Scheme and a Research Partnership Grant with China, Japan, South Korea and the ASEAN region (RPG 072023 18).


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. Nishio M, Noguchi S, Matsuo H, Murakami T. Automatic classification between COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on chest X-ray image: combination of data augmentation methods. Sci Rep. 2020;10:1–6.
  2. Homayounieh F, Digumarthy S, Ebrahimian S, Rueckel J, Hoppe BF, Sabel BO, et al. An artificial intelligence–based chest X-ray model on human nodule detection accuracy from a multicenter study. JAMA Netw Open. 2021;4:e2141096.
  3. Nooralahzadeh F, Perez Gonzalez N, Frauenfelder T, Fujimoto K, Krauthammer M. Progressive transformer-based generation of radiology reports. In: Moens MF, Huang X, Specia L, Yih SW-t, editors. Findings of the Association for Computational Linguistics: EMNLP 2021. Punta Cana: Association for Computational Linguistics; 2021. p. 2824–32.
  4. Yang S, Wu X, Ge S, Zheng Z, Zhou SK, Xiao L. Radiology report generation with a learned knowledge base and multi-modal alignment. Med Image Anal. 2023;86:102798.
  5. Hyland SL, Bannur S, Bouzid K, Castro DC, Ranjit M, Schwaighofer A, et al. MAIRA-1: a specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668. 2023.
  6. Zambrano Chaves JM, Huang SC, Xu Y, Xu H, Usuyama N, Zhang S, et al. A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings. Nat Commun. 2025;16:3108.
  7. Tu T, Palepu A, Schaekermann M, Saab K, Freyberg J, Tanno R, et al. Towards conversational diagnostic AI. arXiv preprint arXiv:2401.00000. 2024.
  8. OpenAI. ChatGPT can now see, hear, and speak. OpenAI Blog. 2024. Available from: https://openai.com/blog/chatgpt-can-now-see-hear-speak . Accessed 2024 Nov 26.
  9. Anthropic. Introducing the next generation of Claude. Anthropic News. 2024. Available from: https://www.anthropic.com/news/next-generation-claude . Accessed 2024 Nov 26.
  10. Li B, Zhang Y, Guo D, Zhang R, Li F, Zhang H, et al. LLaVA-OneVision: easy visual task transfer. arXiv preprint arXiv:2408.03326. 2024.
  11. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80.
  12. Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, et al. LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890. 2023.
  13. Sellergren A, Kazemzadeh S, Jaroensri T, Kiraly A, Traverse M, Kohlberger T, et al. MedGemma technical report. arXiv preprint arXiv:2507.05201. 2025.
  14. Chen Z, Varma M, Xu J, Paschali M, van Veen D, Johnston A, et al. A vision-language foundation model to enhance efficiency of chest X-ray interpretation. arXiv preprint arXiv:2401.00000. 2024.
  15. Pellegrini C, Özsoy E, Busam B, Navab N, Keicher M. RaDialog: a large vision-language model for radiology report generation and conversational assistance. arXiv preprint arXiv:2311.18681. 2023.
  16. Bannur S, Bouzid K, Castro DC, Schwaighofer A, Thieme A, Bond-Taylor S, et al. MAIRA-2: grounded radiology report generation. arXiv preprint arXiv:2406.00000. 2024.
  17. Deperrois N, Matsuo H, Ruipérez-Campillo S, Vandenhirtz M, Laguna S, Ryser A, et al. RadVLM: a multitask conversational vision-language model for radiology. arXiv preprint arXiv:2502.03333. 2025.
  18. Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. arXiv preprint arXiv:2304.08485. 2023.
  19. Wang P, Bai S, Tan S, Wang S, Fan Z, Bai J, et al. Qwen2-VL: enhancing vision-language model perception of the world at any resolution. arXiv preprint arXiv:2409.12191. 2024.
  20. Bustos A, Pertusa A, Salinas JM, de la Iglesia-Vaya M. PadChest: a large chest X-ray image dataset with multi-label annotated reports. Med Image Anal. 2020;66:101797.
  21. Castro DC, Bustos A, Bannur S, Hyland SL, Bouzid K, Wetscherek MT, et al. PadChest-GR: a bilingual chest X-ray dataset for grounded radiology report generation. arXiv preprint arXiv:2411.05085. 2024.
  22. Boecking B, Usuyama N, Bannur S, Castro DC, Schwaighofer A, Hyland S, et al. Making the most of text semantics to improve biomedical vision–language processing. In: European Conference on Computer Vision. Cham: Springer; 2022. p. 1–21.
  23. Kim S, Nooralahzadeh F, Rohanian M, Fujimoto K, Nishio M, Sakamoto R, et al. Boosting radiology report generation by infusing comparison prior. In: Demner-Fushman D, Ananiadou S, Cohen K, editors. Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Toronto: Association for Computational Linguistics; 2023. p. 50–61.
  24. Johnson AE, Pollard TJ, Greenbaum NR, Lungren MP, Deng CY, Peng Y, et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. 2019.
  25. Smit A, Jain S, Rajpurkar P, Pareek A, Ng AY, Lungren MP. CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv preprint arXiv:2004.09167. 2020.
  26. Yang L, Xu S, Sellergren A, Kohlberger T, Zhou Y, Ktena I, et al. Advancing multimodal medical capabilities of Gemini. arXiv preprint arXiv:2405.03162. 2024.
  27. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49.
  28. Redmon J. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2016.
  29. You H, Zhang H, Gan Z, Du X, Zhang B, Wang Z, et al. Ferret: refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704. 2023.
  30. Park J, Kim S, Yoon B, Hyun J, Choi K. M4CXR: exploring multi-task potentials of multi-modal large language models for chest X-ray interpretation. arXiv preprint arXiv:2408.16213. 2024.
  31. Zhang H, You H, Dufter P, Zhang B, Chen C, Chen HY, et al. Ferret-v2: an improved baseline for referring and grounding with large language models. arXiv preprint arXiv:2404.07973. 2024.
  32. Wu JT, Agu NN, Lourentzou I, Sharma A, Paguio JA, Yao JS, et al. Chest ImaGenome dataset for clinical reasoning. arXiv preprint arXiv:2108.00316. 2021.
  33. Nguyen HQ, Lam K, Le LT, Pham HH, Tran DQ, Nguyen DB, et al. VinDr-CXR: an open dataset of chest X-rays with radiologists’ annotations. Sci Data. 2022;9:429.
  34. Solovyev R, Wang W, Gabruseva T. Weighted boxes fusion: ensembling boxes from different object detection models. Image Vis Comput. 2021;107:104117.
  35. Müller P, Kaissis G, Rueckert D. ChEX: interactive localization and region description in chest X-rays. arXiv preprint arXiv:2404.15770. 2024.
  36. RadVLM GitHub repository. Available from: https://github.com/uzh-dqbm-cmi/RadVLM.git . Accessed 2025 Sep 10.

Files