Database Credentialed Access

MS-CXR-T: Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Shruthi Bannur Stephanie Hyland Qianchu Liu Fernando Pérez-García Max Ilse Daniel Coelho de Castro Benedikt Boecking Harshita Sharma Kenza Bouzid Anton Schwaighofer Maria Teodora Wetscherek Hannah Richardson Tristan Naumann Javier Alvarez Valle Ozan Oktay

Published: March 17, 2023. Version: 1.0.0

When using this resource, please cite: (show more options)
Bannur, S., Hyland, S., Liu, Q., Pérez-García, F., Ilse, M., Coelho de Castro, D., Boecking, B., Sharma, H., Bouzid, K., Schwaighofer, A., Wetscherek, M. T., Richardson, H., Naumann, T., Alvarez Valle, J., & Oktay, O. (2023). MS-CXR-T: Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Bannur, S., Hyland, S., Liu, Q., Perez-Garcia, F., Ilse, M., Castro, D. C., ... & Oktay, O. (2023). Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing. arXiv preprint arXiv:2301.04558.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


MS-CXR-T is a multi-modal benchmark dataset for evaluating biomedical vision-language processing (VLP) models on two distinct temporal tasks in radiology: image classification and sentence similarity. The former comprises multi-image frontal chest X-rays with ground-truth labels (N=1326) across 5 findings, with classes corresponding to 3 states of disease progression for each finding: {'Improving', 'Stable', 'Worsening'}, expanding on the Chest ImaGenome progression dataset. The latter quantifies the temporal-semantic similarity of text embeddings extracted from pairs of sentences (N=361). The pairs can be either paraphrases or contradictions in terms of disease progression. The data for both tasks was manually annotated and reviewed by a board-certified radiologist. The dataset provides researchers an opportunity to evaluate both image and text models on these biomedical temporal tasks and reproduce experiments reported in the corresponding literature.


Common benchmarks for biomedical VLP methods often focus on “Static” tasks solvable with a single medical image, such as disease detection [1] or phrase grounding [2]. Temporal aspects, such as quantifying disease progression, have been largely overlooked in this line of research. This has been hindered due to lack of reproducible research due to private datasets [3] and lack of publicly available multi-modal benchmark datasets.

Here, we provide a general-purpose temporal benchmark based on the MIMIC-CXR v2 dataset [4]. We are extending the previously-published temporal image classification dataset, Chest ImaGenome progression [5], with radiologist annotated examples, and introducing a novel sentence similarity dataset to support evaluation of temporal sensitivity of text encoder models.


Here we describe the procedure for data collection and curation in detail for both benchmark datasets.

Temporal image classification

The MS-CXR-T temporal image classification dataset contains progression labels for five findings (Consolidation, Edema, Pleural Effusion, Pneumonia, and Pneumothorax) across three progression classes ('Improving', 'Stable', and 'Worsening'), their associated label quality and image pair IDs. Each image pair consists of a “current” image with its “prior” image, both from the same subject. The nearest prior image for a given image is determined by analysing the acquisition timestamps.

This benchmark builds on the publicly available Chest ImaGenome gold and silver standard datasets [5] where the latter provides progression labels automatically derived from radiology reports.  We mapped the attribute “Pulmonary edema/hazy opacity” as used by the Chest ImaGenome dataset to “Edema” as used by CheXbert [6]. Other pathologies had direct analogues.

1) Promote silver standard examples to gold standard: We randomly selected a set of candidate studies that are part of the ImaGenome silver dataset, after excluding any studies that had been previously verified as part of the ImaGenome gold dataset. Later, we filtered out the studies where there are multiple different progression labels for a single pathology (e.g. left pleural effusion has increased, right pleural effusion remains stable).

We conducted a review process of the selected candidates, asking a board-certified radiologist to either accept or reject the label. To inform their review of the labels, the radiologist was given access to the radiology report for the current image, and the sentence from which the auto-generated label had been extracted – this mimics the method through which the ImaGenome gold dataset was created. A subset of the dataset were manually re-annotated by blinding a second radiologist to the labels, in order to measure the inter-observer agreement and label quality. For the study pairs where we collected only a single label, the corresponding label quality was indicated as “one_expert”.

2) Curate examples based on image quality: After collecting our curated labels and labels from the ImaGenome gold dataset, we matched the report-based labels to specific image pairs, performing a second data curation step to create the image dataset. To ensure the diagnostic quality of all images in the dataset, if a study had multiple frontal scans, we performed a quality control step asking a radiologist to select the best image for each study.

Temporal sentence similarity

In this section, we describe the process of creating the MS-CXR-T temporal sentence similarity benchmark, which consists of pairs of paraphrase or contradiction sentences in terms of disease progression. We create this dataset using two different methods:

  1. “RadGraph”, where paraphrase and contradiction sentence pairs are discovered by analysing graph representations of sentences derived from RadGraph [7].
  2. “Swaps”, where paraphrases and contradictions are created by swapping out temporal keywords in the sentence.

1) Collect and filter candidate sentences: To create this dataset, we first collected a set of sentences from the MIMIC-CXR dataset [4], using the Stanza constituency parser [8] to extract individual sentences from reports. Using the CheXbert labeller [6], we filtered this set to sentences that described one of five pathologies - Consolidation, Edema, Pleural Effusion, Pneumonia or Pneumothorax. We then filtered to sentences which contained at least one mention of a temporal keyword, with high sensitivity.

2) Construct paraphrase and contradiction pairs: Using this sentence pool, paraphrase and contradiction pairs were constructed in two ways:

  1. “RadGraph”: We paired sentences from the sentence pool by matching on RadGraph [7] entities, relaxing the matching constraint only for temporal keywords and possible mentions of pathologies.
  2. “Swaps”: We swapped out temporal keywords in a sentence to create sentence pairs, choosing swap candidates from the top-5 masked token predictions from CXR-BERT-Specialized [2] provided they were temporal keywords.

3) Filter and label valid sentence pairs: After creating candidate sentence pairs, we manually filtered out sentence pairs with ambiguous differences in terms of disease progression. A board-certified radiologist then verified each candidate sentence pair and annotated as either paraphrase or contradiction, if the pair was valid. Sentences were filtered out in the annotation process if:

  1. They were not clear paraphrases or contradictions.
  2. The sentences differed in meaning and this difference was not related to any temporal information.
  3. They were grammatically incorrect.

Data Description

Temporal image classification: The dataset consists of image pairs (DICOM IDs) with a single corresponding label (one of 'improving', 'stable', 'worsening') for each one of the five findings listed in Table 1. The class distribution for the image classification task in MS-CXR-T is shown in the same table below. Although the initial selection was performed randomly, the distribution of the dataset skews towards the stable and worsening classes. This could be explained by patients being more likely to get a chest X-ray when their condition is stable or deteriorating, as opposed to when there is an improvement in patient condition.

Table 1: Temporal image classification benchmark: Distribution of multi-image studies across different clinical findings, distribution of classes per finding (in the order of improving, stable, worsening), and number of subjects.
Findings Number of samples Class distribution Number of subjects



14% / 42% / 44%




31% / 26% / 43%


Pleural effusion


19% / 49% / 32%




8% / 25% / 67%




15% / 55% / 30%




18% / 40% / 42%


Temporal sentence similarity: Table 2 shows a small set of examples from both subsets of the benchmark. As illustrated by the examples below, in the “Swaps” subset, only a small number of words can differ between sentence pairs, whereas the RadGraph subset permits variation in phrasing and syntax. The distribution of sentence pairs across the paraphrase and contradiction classes are shown in Table 3.

Table 2: Examples from the temporal sentence similarity benchmark.
Subset Label Sentence 1 Sentence 2



Unchanged small-to-moderate right pleural effusion

Stable small-to-moderate right pleural effusion.



Interval worsening of the right-sided pneumothorax.

Interval resolution of the right-sided pneumothorax.



There has also been a slight increase in left basal consolidation.

There is slight interval progression of left basal consolidation.



Right mid and lower lung consolidations are unchanged.

There has been worsening of the consolidation involving the right mid and lower lung fields.

Table 3: Temporal sentence similarity benchmark: number of paraphrase and contradiction examples in the full dataset, and across the RadGraph and Swaps subsets.
Subset Paraphrase pairs Contradiction pairs Total













Annotation schema

Table 4: Schema of the temporal image classification dataset
Variable  Description
study_id A unique identifier for the radiology report written for the given current chest x-ray.
previous_study_id A unique identifier for the radiology report written for the previous chest x-ray.
dicom_id A unique DICOM identifier for the current chest x-ray.
previous_dicom_id A unique DICOM identifier for the previous chest x-ray for the same subject.
subject_id is a unique identifier which specifies an individual patient.
{FINDING}_progression Categorical class labels {'improving', 'stable', 'worsening'}, indicating the disease progression from previous scan to current scan. Labels are provided for each finding separately if applicable.
{FINDING}_label_quality Label quality showing the agreement among multiple experts (N=2), {'multiple_experts', 'disagreement', 'one_expert'}. If a study is labelled only once, the corresponding value is set to 'one_expert'.
Table 5: Schema of the temporal sentence similarity dataset
Variable Description
sentence_1 The premise sentence
sentence_2 The hypothesis sentence
category The label of sentence pair, decided based on the temporal semantics {'contradiction', 'paraphrase'}
subset_name The type of sentence pairs {'Swaps', 'RadGraph'}, whether the hypothesis sentence is formed by keyword swaps or graphical representation of entities.

Folder structure

This project contains two files:

  • MS_CXR_T_temporal_image_classification_v1.0.0.csv: Label data for the temporal image classification task, across all five findings in a tabular format.
  • MS_CXR_T_temporal_sentence_similarity_v1.0.0.csv: Sentence similarity annotations for both subsets in a tabular format.

Usage Notes

We release the MS-CXR-T dataset to encourage reproducible evaluation of image and text models with respect to temporal capability, specifically for temporal image classification and temporal semantic similarity of sentences. In an associated paper [9];we evaluate current multi-modal models on these tasks.

Given the class imbalance in the temporal image classification task, we recommend using macro-statistics (such as macro-F1 or macro-accuracy) to enable a balanced assessment of performance across all three classes. We further recommend evaluating performance per pathology, again due to varied prevalence in the dataset, as some pathologies were found to be more challenging than others [9].

The sentence similarity dataset can be used to evaluate the sensitivity of a text encoder, for example by computing cosine similarity between sentence pair embeddings and selecting a threshold above which pairs are considered paraphrases to report either accuracy or AUROC.


MS-CXR-T is a research artifact of the corresponding work [9], “Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing”. In this capacity, MS-CXR-T facilitates reproducibility and serves as an addition to the benchmarking landscape. The dataset is released with instances chosen from the public MIMIC-CXR v2 image-text dataset. As such, ethical considerations of that project should be taken into considerations in addition to those provided below.

In concordance with existing research [10], the application of filters results in a dataset that is both slightly older (average age 64.99 vs 56.85 in all MIMIC-CXR v2) and slightly less female (percentage female 45.88% vs 52.39% in all MIMIC-CXR). The dataset particularly focuses on subjects where findings are seen in two consecutive images, which could potentially explain the slight shift in average age. While these are relatively small shifts and the primary intention of this dataset is to facilitate reproducibility as a benchmark, we have disclosed this both alongside the dataset and in the corresponding work. 


The authors would like to thank Hannah Richardson for the guidance offered as part of the compliance review of the datasets used in this study, and Dr Maria Teodora Wetscherek for her clinical input and data annotations provided to this study.As MS-CXR-T builds on prior work, we would further like to acknowledge the contributors of the Chest ImaGenome dataset [5], the RadGraph tool and dataset [7], and the underlying MIMIC-CXR dataset [4].

Conflicts of Interest

The authors have no conflicts of interest to declare.


  1. Shih G, Wu CC, Halabi SS, Kohli MD, Prevedello LM, Cook TS, et al. Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. Radiology: Artificial Intelligence. 2019 Jan;1(1):e180041.
  2. Boecking B, Usuyama N, Bannur S, Castro DC, Schwaighofer A, Hyland S, Wetscherek M, Naumann T, Nori A, Alvarez-Valle J, Poon H. Making the most of text semantics to improve biomedical vision–language processing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI 2022 Oct 29 (pp. 1-21). Cham: Springer Nature Switzerland.
  3. Shamout FE, Shen Y, Wu N, Kaku A, Park J, Makino T, Jastrzębski S, Witowski J, Wang D, Zhang B, Dogra S. An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department. NPJ digital medicine. 2021 May 12;4(1):80.
  4. Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data. 2019 Dec 12;6(1):317.
  5. Wu JT, Agu NN, Lourentzou I, Sharma A, Paguio JA, Yao JS, Dee EC, Mitchell W, Kashyap S, Giovannini A, Celi LA. Chest ImaGenome dataset for clinical reasoning. arXiv preprint arXiv:2108.00316. 2021 Jul 31; Available from:
  6. Smit A, Jain S, Rajpurkar P, Pareek A, Ng AY, Lungren M. Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020 Nov (pp. 1500-1519).
  7. Jain S, Agrawal A, Saporta A, Truong SQ, VinBrain V, Bui T, Chambon P, Zhang Y, Lungren MP, Ng AY, Langlotz CP. RadGraph: Extracting Clinical Entities and Relations from Radiology Reports.
  8. Zhang Y, Zhang Y, Qi P, Manning CD, Langlotz CP. Biomedical and clinical English model packages for the Stanza Python NLP library. Journal of the American Medical Informatics Association. 2021 Sep;28(9):1892-9.
  9. Bannur S, Hyland S, Liu Q, Perez-Garcia F, Ilse M, Castro DC, Boecking B, Sharma H, Bouzid K, Thieme A, Schwaighofer A. Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing. arXiv preprint arXiv:2301.04558. 2023 Jan 11; Available from:
  10. Weber GM, Adams WG, Bernstam EV, Bickel JP, Fox KP, Marsolo K, Raghavan VA, Turchin A, Zhou X, Murphy SN, Mandl KD. Biases introduced by filtering electronic health records for patients with "complete data". J Am Med Inform Assoc. 2017 Nov 1;24(6):1134-1141. doi: 10.1093/jamia/ocx071. PMID: 29016972; PMCID: PMC6080680.

Parent Projects
MS-CXR-T: Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.