Database Restricted Access

Swiss-Mammo: A physician-written, synthetic dataset of German mammography reports

Daniel Reichenpfader Sandro von Däniken Harald Marcel Bonel

Published: April 30, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Reichenpfader, D., von Däniken, S., & Bonel, H. M. (2025). Swiss-Mammo: A physician-written, synthetic dataset of German mammography reports (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/mrg5-ja22

Additionally, please cite the original publication:

Reichenpfader D, Knupp J, von Däniken S, Gaio R, Dennstädt F, Cereghetti GM, Sander A, Hiltbrunner H, Nairz K, Denecke K Enhancing BERT with Frame Semantics to Extract Clinically Relevant Information from German Mammography Reports: Algorithm Development and Validation Journal of Medical Internet Research. 16/03/2025:68427 (forthcoming/in press)

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

This dataset, Swiss-Mammo, contains 28 manually constructed German mammography reports, each paired with an English translation. The reports are stratified across BI-RADS categories 0 through 6, with three reports per category. All reports were manually created by a radiology resident without the use of generative AI and independently reviewed by a senior radiologist specialized in breast imaging to ensure clinical plausibility and linguistic accuracy.

Swiss-Mammo was developed to support research on information extraction from radiology reports using large language models (LLMs) and frame semantics. The dataset provides a high-quality, controlled benchmark for evaluating systems designed to structure free-text radiological narratives. It is particularly suited for applications in natural language processing, clinical informatics, and decision support systems in breast imaging.


Background

Radiology reporting plays a critical role in clinical communication, diagnosis, and treatment planning. However, the continued use of unstructured, free-text formats introduces variability and ambiguity that can hinder downstream analysis and interoperability. Structured reporting has been proposed as a remedy, with the European Society of Radiology (ESR) advocating for its wider adoption due to the potential improvements in clarity, consistency, and quality of radiology communication [1]. Despite these benefits, structured reporting remains underutilized in clinical practice.

Large language models (LLMs) offer a promising solution to bridge the gap between unstructured reports and structured data by automatically extracting and normalizing relevant clinical information. However, rigorous evaluation of these models is often hampered by the lack of publicly available datasets with consistent annotation or stratification [2].

Swiss-Mammo addresses this need by providing a synthetic, stratified dataset of mammography reports annotated with BI-RADS categories. The dataset enables reproducible evaluation of LLM-based pipelines and supports the development and benchmarking of information extraction tools for radiological text. By offering high-quality synthetic data, it also helps mitigate privacy concerns typically associated with real patient data.


Methods

Report Creation and Review

Each report in Swiss-Mammo was manually written by a radiology resident with clinical training in breast imaging. Reports were designed to reflect typical linguistic and clinical features observed in real-world German mammography narratives of a Swiss hospital. While not based on individual patient records, the reports draw upon commonly encountered imaging findings and follow a standardized mammography reporting structure. To ensure clinical realism and semantic accuracy, all reports underwent a review process by a senior radiologist specialized in breast imaging.

The dataset comprises 28 reports, with four examples per BI-RADS category (0 through 6), stratified to support balanced model evaluation. Each report is available in both German (original) and English (translation).

Similarity Assessment

To evaluate how well the synthetic reports reflect the semantic structure of real mammography narratives, we conducted a cosine similarity analysis using sentence embeddings from the gte-Qwen2-1.5B-instruct model [3]. Therefore, we compared the Swiss-Mammo corpus to a stratified set of 28 real reports sampled from a clinical mammography corpus of a Swiss hospital (dataset A), a separate stratified dataset of 210 real reports sampled from the same mammography corpus (dataset B), and an additional baseline consisting of 21 synthetic reports generated using ChatGPT (dataset C).

Results of the similarity assessment
Swiss-Mammo Dataset A Dataset B Dataset C
Swiss-Mammo - 0.78 0.76 0.58
Dataset A 0.78 - 0.79 0.59
Dataset B 0.76 0.79 - 0.57
Dataset C 0.58 0.59 0.57 -

The similarity assessment shows only a minor decrease in average cosine similarity comparing both dataset A and the Swiss-Mammo corpus to dataset B (0.79 to 0.76). In contrast, the ChatGPT-generated reports (dataset C) exhibit substantially lower similarity scores (< 0.6), indicating reduced semantic alignment with real clinical narratives.

This analysis supports the utility of the synthetic reports for benchmarking and evaluation tasks, particularly when compared to generic LLM-generated text.

Report structure

Each report follows a consistent structure that includes the following sections (in German and English):

  1. Clinical Indication

  2. Imaging Findings

  3. Assessment (including BI-RADS category)

  4. Recommendation (if applicable)


Data Description

The dataset is distributed in two formats:

  1. Plain Text Files (.txt):
    Each report is saved as a UTF-8 encoded text file. Filenames follow the convention:
    SwissMammo_<ID>_BR<BI-RADS category>_<Language>.txt

    • ID: A zero-padded unique identifier (001 to 028)

    • BI-RADS: Category ranging from 0 to 6 (based on the higher category if both breasts differ)

    • Language: DE for German or EN for English translation

    Example: SwissMammo_003_BR0_EN.txt

  2. CSV File (SwissMammo.csv):
    A tabular version is provided for easier use in data processing pipelines. The CSV file includes the following columns:

    • ID: Report identifier (e.g., 003)

    • BIRADS: BI-RADS category (integer from 0 to 6)

    • Text_de: Full text of the report in German

    • Text_en: English translation

All reports are fully synthetic, and any patient identifiers or dates are completely fictional.


Usage Notes

Applications

The Swiss-Mammo dataset is intended for a variety of research applications, including, but not limited to benchmarking large language models for medical information extraction from mammography reports, training and evaluating NLP pipelines in radiology, and investigating semantic structuring of reports.

Limitations

The dataset is synthetic and does not represent actual clinical distributions or patient populations. While care has been taken to ensure realism, the reports do not capture the full complexity or variability of real clinical documentation (including e.g., abbreviations or errors). Moreover, the report structure is specific to a Swiss university hospital and might not be generalizable to other institutional reporting standards. Furthermore, the translation to English was performed manually, assisted by automated translation tools. It has not undergone formal linguistic validation.

Tools and Code

The cosine similarity evaluation and embedding extraction were implemented using the sentence-transformers library. If you wish to reproduce or extend our analysis, we provide a corresponding .py file in the code folder.

Technical Notes

  • German umlauts (ä, ö, ü) and other diacritics are preserved in UTF-8 encoding. Users should ensure proper decoding when loading the .txt or .csv files.

  • Stratification means that the dataset does not reflect the true prevalence of BI-RADS categories in the general population.


Ethics

As this is a purely synthetic dataset, no real patient data was used, and there are no ethics concerns to declare.


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. European Society of Radiology (ESR). ESR paper on structured reporting in radiology—update 2023. Insights Imaging. 2023 Nov 23;14(1):199.
  2. Reichenpfader D, Müller H, Denecke K. A scoping review of large language model based approaches for information extraction from radiology reports. npj Digit Med. 2024 Aug 24;7(1):1–12.
  3. Li Z, Zhang X, Zhang Y, Long D, Xie P, Zhang M. Towards General Text Embeddings with Multi-stage Contrastive Learning [Internet]. arXiv; 2023 [cited 2025 Apr 15]. Available from: http://arxiv.org/abs/2308.03281

Share
Access

Access Policy:
Only registered users who sign the specified data use agreement can access the files.

License (for files):
PhysioNet Restricted Health Data License 1.5.0

Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0

Corresponding Author
You must be logged in to view the contact information.

Files