Database Restricted Access
Swiss-Mammo: A physician-written, synthetic dataset of German mammography reports
Daniel Reichenpfader , Sandro von Däniken , Harald Marcel Bonel
Published: April 30, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Reichenpfader, D., von Däniken, S., & Bonel, H. M. (2025). Swiss-Mammo: A physician-written, synthetic dataset of German mammography reports (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/mrg5-ja22
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
This dataset, Swiss-Mammo, contains 28 manually constructed German mammography reports, each paired with an English translation. The reports are stratified across BI-RADS categories 0 through 6, with three reports per category. All reports were manually created by a radiology resident without the use of generative AI and independently reviewed by a senior radiologist specialized in breast imaging to ensure clinical plausibility and linguistic accuracy.
Swiss-Mammo was developed to support research on information extraction from radiology reports using large language models (LLMs) and frame semantics. The dataset provides a high-quality, controlled benchmark for evaluating systems designed to structure free-text radiological narratives. It is particularly suited for applications in natural language processing, clinical informatics, and decision support systems in breast imaging.
Background
Radiology reporting plays a critical role in clinical communication, diagnosis, and treatment planning. However, the continued use of unstructured, free-text formats introduces variability and ambiguity that can hinder downstream analysis and interoperability. Structured reporting has been proposed as a remedy, with the European Society of Radiology (ESR) advocating for its wider adoption due to the potential improvements in clarity, consistency, and quality of radiology communication [1]. Despite these benefits, structured reporting remains underutilized in clinical practice.
Large language models (LLMs) offer a promising solution to bridge the gap between unstructured reports and structured data by automatically extracting and normalizing relevant clinical information. However, rigorous evaluation of these models is often hampered by the lack of publicly available datasets with consistent annotation or stratification [2].
Swiss-Mammo addresses this need by providing a synthetic, stratified dataset of mammography reports annotated with BI-RADS categories. The dataset enables reproducible evaluation of LLM-based pipelines and supports the development and benchmarking of information extraction tools for radiological text. By offering high-quality synthetic data, it also helps mitigate privacy concerns typically associated with real patient data.
Methods
Report Creation and Review
Each report in Swiss-Mammo was manually written by a radiology resident with clinical training in breast imaging. Reports were designed to reflect typical linguistic and clinical features observed in real-world German mammography narratives of a Swiss hospital. While not based on individual patient records, the reports draw upon commonly encountered imaging findings and follow a standardized mammography reporting structure. To ensure clinical realism and semantic accuracy, all reports underwent a review process by a senior radiologist specialized in breast imaging.
The dataset comprises 28 reports, with four examples per BI-RADS category (0 through 6), stratified to support balanced model evaluation. Each report is available in both German (original) and English (translation).
Similarity Assessment
To evaluate how well the synthetic reports reflect the semantic structure of real mammography narratives, we conducted a cosine similarity analysis using sentence embeddings from the gte-Qwen2-1.5B-instruct model [3]. Therefore, we compared the Swiss-Mammo corpus to a stratified set of 28 real reports sampled from a clinical mammography corpus of a Swiss hospital (dataset A), a separate stratified dataset of 210 real reports sampled from the same mammography corpus (dataset B), and an additional baseline consisting of 21 synthetic reports generated using ChatGPT (dataset C).
Swiss-Mammo | Dataset A | Dataset B | Dataset C | |
Swiss-Mammo | - | 0.78 | 0.76 | 0.58 |
Dataset A | 0.78 | - | 0.79 | 0.59 |
Dataset B | 0.76 | 0.79 | - | 0.57 |
Dataset C | 0.58 | 0.59 | 0.57 | - |
The similarity assessment shows only a minor decrease in average cosine similarity comparing both dataset A and the Swiss-Mammo corpus to dataset B (0.79 to 0.76). In contrast, the ChatGPT-generated reports (dataset C) exhibit substantially lower similarity scores (< 0.6), indicating reduced semantic alignment with real clinical narratives.
This analysis supports the utility of the synthetic reports for benchmarking and evaluation tasks, particularly when compared to generic LLM-generated text.
Report structure
Each report follows a consistent structure that includes the following sections (in German and English):
-
Clinical Indication
-
Imaging Findings
-
Assessment (including BI-RADS category)
-
Recommendation (if applicable)
Data Description
The dataset is distributed in two formats:
-
Plain Text Files (.txt):
Each report is saved as a UTF-8 encoded text file. Filenames follow the convention:SwissMammo_<ID>_BR<BI-RADS category>_<Language>.txt
-
ID: A zero-padded unique identifier (001 to 028)
-
BI-RADS: Category ranging from 0 to 6 (based on the higher category if both breasts differ)
-
Language:
DE
for German orEN
for English translation
Example:
SwissMammo_003_BR0_EN.txt
-
-
CSV File (SwissMammo.csv):
A tabular version is provided for easier use in data processing pipelines. The CSV file includes the following columns:-
ID
: Report identifier (e.g., 003) -
BIRADS
: BI-RADS category (integer from 0 to 6) -
Text_de
: Full text of the report in German -
Text_en
: English translation
-
All reports are fully synthetic, and any patient identifiers or dates are completely fictional.
Usage Notes
Applications
The Swiss-Mammo dataset is intended for a variety of research applications, including, but not limited to benchmarking large language models for medical information extraction from mammography reports, training and evaluating NLP pipelines in radiology, and investigating semantic structuring of reports.
Limitations
The dataset is synthetic and does not represent actual clinical distributions or patient populations. While care has been taken to ensure realism, the reports do not capture the full complexity or variability of real clinical documentation (including e.g., abbreviations or errors). Moreover, the report structure is specific to a Swiss university hospital and might not be generalizable to other institutional reporting standards. Furthermore, the translation to English was performed manually, assisted by automated translation tools. It has not undergone formal linguistic validation.
Tools and Code
The cosine similarity evaluation and embedding extraction were implemented using the sentence-transformers library. If you wish to reproduce or extend our analysis, we provide a corresponding .py file in the code folder.
Technical Notes
-
German umlauts (ä, ö, ü) and other diacritics are preserved in UTF-8 encoding. Users should ensure proper decoding when loading the
.txt
or.csv
files. -
Stratification means that the dataset does not reflect the true prevalence of BI-RADS categories in the general population.
Ethics
As this is a purely synthetic dataset, no real patient data was used, and there are no ethics concerns to declare.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- European Society of Radiology (ESR). ESR paper on structured reporting in radiology—update 2023. Insights Imaging. 2023 Nov 23;14(1):199.
- Reichenpfader D, Müller H, Denecke K. A scoping review of large language model based approaches for information extraction from radiology reports. npj Digit Med. 2024 Aug 24;7(1):1–12.
- Li Z, Zhang X, Zhang Y, Long D, Xie P, Zhang M. Towards General Text Embeddings with Multi-stage Contrastive Learning [Internet]. arXiv; 2023 [cited 2025 Apr 15]. Available from: http://arxiv.org/abs/2308.03281
Access
Access Policy:
Only registered users who sign the specified data use agreement can access the files.
License (for files):
PhysioNet Restricted Health Data License 1.5.0
Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/mrg5-ja22
DOI (latest version):
https://doi.org/10.13026/gqzb-fg27
Corresponding Author
Files
- sign the data use agreement for the project