Database Credentialed Access

RadCoref: Fine-tuning coreference resolution for different styles of clinical narratives

Yuxiang Liao Hantao Liu Irena Spasic

Published: Jan. 30, 2024. Version: 1.0.0

When using this resource, please cite: (show more options)
Liao, Y., Liu, H., & Spasic, I. (2024). RadCoref: Fine-tuning coreference resolution for different styles of clinical narratives (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Yuxiang Liao, Hantao Liu, Irena Spasić. (2023). "Fine-tuning Coreference Resolution for Different Styles of Clinical Narratives". Journal of Biomedical Informatics, 149.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


RadCoref is a small subset of MIMIC-CXR with manually annotated coreference mentions and clusters. The dataset is annotated by a panel of three cross-disciplinary experts with experience in clinical data processing following the i2b2 annotation scheme with minimum modification. The dataset consists of Findings and Impression sections extracted from full radiology reports. The dataset has 950, 25 and 200 section documents for training, validation, and testing, respectively. The training and validation sets are annotated by one annotator. The test set is annotated by two human annotators independently, of which the results are merged manually by the third annotator. The dataset aims to support the task of coreference resolution on radiology reports. Given that the MIMIC-CXR has been de-identified already, no protected health information (PHI) is included.


Radiology reports play a vital role in patient care as referring clinicians use them to determine an appropriate course of action. Narrative radiology reports vary excessively in their language, length and style, which may affect their clarity and hence the referring clinicians' decision-making [1]. These issues gave rise to an idea of structured reporting, which has a potential for improving the clarity of radiology reports. Automated structuring of narrative reports can facilitate extraction, storage and retrieval of information they describe [2]. Coreference Resolution (CR), which aims to explicitly link up all expressions that mention the same entity [3], is necessary to identify sentences that belong to topically cohesive observations of a radiology report.

Although a variety of CR models have been integrated into popular open-source natural language processing (NLP) tools, such as Stanford CoreNLP [4], AllenNLP [5] and spaCy [6], these NLP tools are generic and as such their performance does not necessarily transfer into specialized domains such as the clinical one [7]. To make CR more feasible in the clinical domain, we release the first dataset with coreference labels in radiology reports.


Data Annotation

The data were annotated by a panel of three cross-disciplinary experts with experience in clinical data processing. The annotation schema followed the schema of i2b2, a clinical-domain dataset with manual coreference annotation used in the fifth i2b2/VA challenge [8], with minimum modification for rapid and consistent annotation. Our dataset is based on the radiology reports from MIMIC-CXR [9]. We segmented the original reports into sections and then only retained sections with headings regarding the "FINDINGS" and the "IMPRESSION", which provided a total of 156,011 and 189,465 text snippets, respectively. However, except for the sections that have explicit headings pointing to either the "FINDINGS" or the "IMPRESSION", some reports merged both sections into one, namely "FINDINGS AND IMPRESSION". There were 508,472 such text snippets. In addition to the above three sections listed under the title of "FINAL REPORT", we found a few reports have extra sections beyond the "FINAL REPORT", such as "PROVISIONAL FINDINGS IMPRESSION" which had 200 text snippets. Each text snippet was considered to be a section in our dataset. The sections were tokenized by spaCy [6] which produced 7,488,528 tokens on "FINDINGS", 9,094,320 tokens on "IMPRESSION", 508,272 tokens on "FINDINGS AND IMPRESSION", and 9600 tokens on "PROVISIONAL FINDINGS IMPRESSION". On average, an individual section had 48 tokens.

During a pilot annotation experiment, we noticed that annotators commonly miss some coreferring mentions. Although employing multiple independent human annotators could alleviate this problem, false negatives will still accumulate as the result of fatigue incurred by long-drawn-out annotation. Therefore, we fine-tuned the Longdoc model [10] jointly on the OntoNotes [11] and i2b2 datasets and used it to automatically annotate labels, which were then passed to human annotators for curation. This approach improved not only the efficiency of manual annotation but also improve its accuracy. We also discovered that many text snippets did not contain coreference labels. We again used the original Longdoc to filter out those snippets with zero automatically annotated coreference labels. A total of 89.5% of original text snippets were removed and the remaining ones were grouped according to the number of coreference clusters they have.

For the test set, there are 100 findings sections and 100 impression sections which were randomly sampled from each group within the sampling pool. For each section, two human annotators were asked to annotate it independently. Their results were merged manually by the third annotator. All annotators had access to the pre-annotated section. Any disagreements were solved by discussion. The annotators were trained for 10 minutes to use Brat Rapid Annotation Tool (BRAT), a web-based tool for manual text annotation [12], and to familiarise themselves with the annotation schema. The inter-annotator agreements computed by the weighted Krippendroff's alpha proposed by Passonneau [13] were 0.79 between two independent annotators, 0.87 between the first annotator and the merged result, and 0.8 between the second annotator and the merged result, respectively.

For the training and validation sets, we first sampled 500 sections using the same sampling strategy as the test set, of which half are findings sections and half are impression sections. The sections were annotated by one annotator. The training set consists of 475 sections and the validation set consists of 25 sections. We did not follow the traditional 8:2 splitting strategy since our experiment showed that providing 25 sections for validation could provide better results. 

Furthermore, we utilized active learning to annotate additional 475 sections. We divided the data available for annotation into an annotated pool and an unannotated pool and split the model training process into multiple iterations. Each iteration used an updated annotated pool to train the same initial model based on Longdoc. Subsequently, the trained model selected top-k samples that may help it gain the highest performance improvement for manual annotation and added them to the annotated pool for the next iteration. For the query strategy, we adapted the highest mention detection entropy proposed by Yuan et al. [23], which aims to determine a sample that the model is the least certain at the current state based on the predicted probability. More details can be found in [19]. 

Model Training and Evaluation

Different training strategies are used to fine-tune the model on domain-specific data. Joint training is a simple yet effective method that collates multiple datasets to create a new dataset and uses it to train a new model from scratch. This method can effectively improve the generalization ability of a CR model [14]. Alternatively, continued training utilizes a pre-trained model to initialize a model to be fine-tuned on a target dataset. It has been successfully used for the rapid transfer of CR models from one dataset to another [18].

In addition, we proposed an ensemble algorithm that aggregates and refines the results from multiple CR models. The key idea is that when models trained on different data using different methods make the same prediction, the prediction itself is more likely to be correct. Given the result of a CR model:  c l u s t e r s = { m e n t i o n 1 , . . . , m e n t i o n i } clusters = \{mention_1, ..., mention_i\} , where m e n t i o n = { t o k e n 1 , . . . , t o k e n j } mention = \{token_1, ..., token_j\} , the algorithm propagates the result of each CR model from cluster-level to token-level and computes the majority vote. The clusters are then refined based on the tokens that received the majority agreement. Details and visualization about the algorithm can be found in [19]. On the other hand, CR models are most commonly evaluated using the average F1 score among three metrics based on link, mention and entity as proposed in the Conference on Computational Natural Language Learning (CoNLL) 2012 Shared Task [17]. We followed this convention to evaluate our models.

First, Zhang et al. [20] successfully transferred a general model of linguistic analysis to the clinical domain. They did so by adding automatically annotated MIMIC-III data [21] to manually annotated English Web Treebank [22] to retrain the model. We followed this approach by using our ensemble algorithm to automatically annotate MIMIC-CXR data, adding it to OntoNotes and retraining the Longdoc model (Longdoc-silver). However, we got only 61.9% CoNLL F1 which is worse than the original Longdoc model (F1=64.9%). This highlights the need for human-annotated data in this field.

Therefore, we fine-tuned three Longdoc models using our manually annotated datasets. All these three models utilized continued training with Longdoc-silver employed as their initial models. The first model, Longdoc-random, was trained on 475 randomly sampled data, achieving F1=79.8%. The second model, Longdoc-active, was trained in 13 iterations of training-annotation via active learning, resulting in 475 training data and achieved F1=77.9%. The third model, Longdoc-merge, was trained on the merged 950 sections from random sampling and active learning, achieving F1=79.5%. Subsequently, we employed the ensemble algorithm with these three models, achieving F1=80.6%. Benchmarks of using external models and utilizing different encoders can be found in our source publication [19].

A List of Models Involved and Training Details

  • The original Longdoc model [10]: Longdoc has three components: a Longformer-large [14] document encoder, a mention proposer and a mention cluster predictor. The mention proposer and the mention cluster predictor are two stacked feed-forward neural networks (FFNNs). The whole model was jointly fine-tuned on three general domain datasets: OntoNotes [11], PreCo [15], and LitBank [16]. We used this model to pre-annotate the raw section text to help us filter and organize sections for the selection of training, validation and test data.
  • Longdoc-i2b2: This is a fine-tuned Longdoc model jointly on the OntoNotes and i2b2 datasets: We used this model to pre-annotate text for human annotators for curation. 500 randomly selected sections of the training and validation sets, and 200 sections of the test set are pre-annotated by this model.
  • Longdoc-silver: This is a fine-tuned Longdoc model jointly trained on the OntoNotes and silver MIMIC-CXR data. The silver data is automatically annotated by using our ensemble algorithm with a subset of CR models that employ fundamentally different approaches including a neural approach (the original Longdoc), a statistical machine learning approach (CoreNLP – Statistical [24]) and a traditional rule-based approach (CoreNLP – Deterministic [24]). We used this model as the initial model of continued training for Longdoc-random, Longdon-active, Longdoc-merge.
  • Longdoc-random: This is a fine-tuned Longdoc model trained on 475 randomly sampled data. It utilized continued training with Longdoc-silver as the initial model. This model was employed by the ensemble algorithm to create the inference data.
  • Longdon-active: This is a fine-tuned Longdoc model trained with active learning and continued training. This model was trained in 13 iterations. Each iteration consists of a training-sampling-annotation process except for the first iteration which starts with sampling. In each iteration, the training used Longdoc-silver as the initial model. The training data was all sections sampled and annotated from previous iterations. Subsequently, the trained model was used to sample data with the highest entropy for a human annotator to annotate. There are 475 annotated data produced as a byproduct. This model was employed by the ensemble algorithm to create the inference data.
  • Longdoc-merge: This is a fine-tuned Longdoc model trained on 950 data from random sampling and active learning. It used continued training with Longdoc-silver as the initial model. This model was employed by the ensemble algorithm to create the inference data.

The hyperparameters for the above models were based on the original Longdoc model. The model encoder was frozen during the fine-tuning. The mention proposer and the mention cluster predictor used Adam as optimizer with an initial rate of 3e-4. The learning rate decayed linearly throughout the fine-tuning. We used a batch size of 1 document, with a maximum of 100,000 training steps and a patience of 10 epochs for early stopping. The gradient update of each epoch corresponds to the size of the training set. The OntoNotes dataset was downsampled to 1k per epoch for Longdoc-i2b2 and 0.5k for Longdoc-silver. The models were fine-tuned on Nvidia V100/P100 16G, requiring ~40 minutes for Longdoc-random and ~80 minutes for Longdoc-merge.

Data Description

File Structure Overview
├── Inference_data
│   ├── findings
│   │   └── p10
│   │       └── p10000032
│   │           ├── s50414267.csv
│   │           └── s53189527.csv
│   ├── impression
│   ├── findings_and_impression
│   └── provisional_findings_impression
└── manual_data
    ├── train_950
    │   ├── conll
    │   │   └── train.conll
    │   └── longformer
    │       └── train.4096.jsonlines
    ├── train_475
    ├── dev
    └── test

Detailed File Structure Description 

  • manual_data: This folder contains the manually annotated data. It contains four subfolders corresponding to train/dev/test splits.
    • test: This folder contains the data for testing. It consists of 200 sections.
    • dev: This folder contains the data for validation. It consists of 25 sections.
    • train_475: This folder contains the data for training. It consists of 475 sections.
    • train_950: This folder contains the data for training. It consists of 950 sections of which half of them are identical to train_475 and another half are sampled via active learning.
    • There are two subfolders in the above folders, including conll and longformer.
      • The conll folder has a file suffixed with .conll, which stores all sections in the CoNLL format. Every section is identified A_B where A is the study_id of the original reports in MIMIC-CXR and B is the heading to which this section belongs in a report; every section is tokenized and sentence-segmented by spaCy. 
        • CoNLL is a tab-separated values format often used in NLP applications. In our dataset, every section is surrounded by lines of "#begin document (A_B); part 0" and "#end document". Every token is represented as one line and each sentence is separated by an empty line. Every token line has 12/13 columns separated by "\t". Column 1 is the section_id (A_B) to which this token belongs to. Column 3 is the index of this token in this sentence. Column 4 is the string of the token. Column 13 stores the label of the coreference cluster to which the mention of this token belongs. For a single-token mention, the label of the corresponding token is "(n)" where n is a number indicating the ID of a coreference cluster. For a mention with multiple tokens, labels are assigned to its start and end tokens, which are "(n" and "n)", respectively. Multiple labels are concatenated by "|" if exist. The mentions in the same cluster have the same ID. If a token has no label assigned, this line has only 12 columns. We don't use the other columns, hence column 2 is all "0" and columns 5-12 are all "_".
      • The longformer folder has a file suffixed with .jsonlines, with all sections stored in a JSON-based format. Each section is named A_B_0, where A is the study_id of the original reports in MIMIC-CXR and B is the section of a report to which this section belongs. Each section is processed using the script provided by Toshniwal et al. [10]. The file can be directly used for training the Longdoc model.
  • inference_data: This folder contains the inference results obtained by our ensemble algorithm with fine-tuned models (Longdoc-random, Longdoc-active and Longdoc-merge). It contains four folders named corresponding to the section of a report to which the sections belong. Each folder has multiple sub-folders named according to the patient_id identical to MIMIC-CXR. Each sub-folders contains multiple CSV files named according to the study_id identical to MIMIC-CXR. Each file is a specific section of the report with that study_id in MIMIC-CXR, which is tokenized by spaCy, and the inference results are assigned to the tokens.
    • findings: This folder contains 156,011 sections. 7,488,528 tokens are generated by spaCy, on average 48 tokens per section.
    • impression: This folder contains 189,465 sections. 9,094,320 tokens are generated by spaCy, on average 48 tokens per section.
    • findings_and_impression: This folder contains 10,589 sections. 508,272 tokens are generated by spaCy, on average 48 tokens per section.
    • provisional_findings_impression: This folder contains 200 sections. 9600 tokens are generated by spaCy, on average 48 tokens per section.
    • The columns of the CSV file are "token", "sent_group", "coref_group", and "coref_group_conll".
      • "token" represents the token string extracted by spaCy.
      • "sent_group" indicates which sentence the token belongs to.
      • "coref_group" indicates which coreference cluster the mention of the token belongs. All tokens within a mention are assigned a label, while in "coref_group_conll" only the start and end tokens are assigned a label. Multiple labels can be assigned when a token belongs to multiple mentions. Example value: [0, 1].
      • "coref_group_conll" follows the CoNLL format (column 13) to indicate the coreference mention and the coreference cluster. Example value: ["(0)", "(1"].
        • "(n" and "n)" indicate the start and the end tokens of a mention, respectively, and "(n)" indicates a single-token mention, where "n" indicates the ID of a coreference cluster to which this mention belongs. Every label is double-quoted; multiple labels are separated by a comma and surrounded by square brackets. This is a literal structure of a Python list and can be easily resolved by using `ast.literal_eval()`. This column is more accurate than "coref_group" in determining the boundary of a mention span. For example, when two mentions of the same coreference cluster are neighbouring adjacent in the text.

Usage Notes

The dataset is free to use if researchers adhere to the data usage agreement. The data has been used to train a coreference model. Researchers may (1) use our manually annotated data to train coreference models to link up all expressions that mention the same entity in radiology reports, (2) use our inference data to identify sentences that belong to topically cohesive observations of a radiology report, which is essential for structured reporting of radiology narratives. Relevant code and model of this work are available on GitHub at yxliao95/sr_coref [25].

There are several limitations of this work. First, this study serves as an auxiliary study of structured reporting on MIMIC-CXR; based on this, we opted to annotate individual sections of reports rather than complete reports to alleviate the annotation complexity. Second, the data were annotated by cross-disciplinary experts with experience in clinical data processing, therefore no label will be assigned for the coreference mentions or clusters that we are uncertain of. Third, we omitted complex coreference mentions that were difficult to be defined by clustering. For example, "All these findings suggest an extensive right upper lung malignancy...". The mention "All these findings" may link to many individual mentions, while those mentions should not be in the same cluster.

Release Notes

Version 1.0.0


The authors declare no ethics concerns.


This work is part of a PhD project funded by China Scholarship Council-Cardiff University Scholarship (CSC202108140022). The scholarship had been awarded to Y.L. The project is supervised by I.S. and H.L. 

Conflicts of Interest

No conflicts of interest to declare.


  1. Dhakshinamoorthy Ganeshan, Phuong-Anh Thi Duong, Linda Probyn, Leon Lenchik, Tatum A. McArthur, Michele Retrouvey, Emily H. Ghobadi, Stephane L. Desouches, David Pastel and Isaac R. Francis (2018). "Structured Reporting in Radiology" Academic Radiology. 251:66-73 doi:
  2. Morteza Pourreza Shahri, Amir Tahmasebi, Bingyang Ye, Henghui Zhu, Javed Aslam and Timothy Ferris (2020). "An Ensemble Approach for Automatic Structuring of Radiology Reports". Association for Computational Linguistics.
  3. Pengcheng Lu and Massimo Poesio (2021). "Coreference Resolution for the Biomedical Domain: A Survey". Association for Computational Linguistics.
  4. Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard and David McClosky (2014). "The Stanford CoreNLP Natural Language Processing Toolkit". Association for Computational Linguistics.
  5. The Allen Institute for Artificial Intelligence (2017). "AllenNLP". (Accessed 17 Nov 2022).
  6. Explosion (2016). "SpaCy: Industrial-Strength Natural Language Processing". (Accessed 16 Nov 2022).
  7. Irina Temnikova, William A. Baumgartner Jr., Negacy D. Hailu, Ivelina Nikolova, Tony McEnery, Adam Kilgarriff, Galia Angelova and K. Bretonnel Cohen (2014). "Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora". European Language Resources Association (ELRA).
  8. Ozlem Uzuner, Andreea Bodnari, Shuying Shen, Tyler Forbush, John Pestian and Brett R South (2012). "Evaluating the state of the art in coreference resolution for electronic medical records". Journal of the American Medical Informatics Association. 195:786-91 doi: 10.1136/amiajnl-2011-000784.
  9. Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark and Steven Horng (2019). "MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports". Scientific Data. 61:317 doi: 10.1038/s41597-019-0322-0.
  10. Shubham Toshniwal, Patrick Xia, Sam Wiseman, Karen Livescu and Kevin Gimpel (2021). "On Generalization in Coreference Resolution". Association for Computational Linguistics.
  11. Author (2013). "OntoNotes Release 5.0". Linguistic Data Consortium. doi:
  12. Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou and Jun’ichi Tsujii (2012). "brat: a Web-based Tool for NLP-Assisted Text Annotation". Association for Computational Linguistics.
  13. Rebecca J. Passonneau (2004). "Computing Reliability for Coreference Annotation”" European Language Resources Association (ELRA).
  14. Iz Beltagy, Matthew E. Peters, Arman Cohan. (2020). "Longformer: The long-document transformer". arXiv:2004.05150.
  15. Hong Chen, Zhenhua Fan, Hao Lu, Alan Yuille and Shu Rong (2018). "PreCo: A Large-scale Dataset in Preschool Vocabulary for Coreference Resolution". Association for Computational Linguistics.
  16. David Bamman, Olivia Lewke and Anya Mansoor (2020). "An Annotated Dataset of Coreference in English Literature". European Language Resources Association.
  17. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina and Yuchen Zhang (2012). "CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes". Association for Computational Linguistics.
  18. Patrick Xia, Benjamin Van Durme. (2021). "Moving on from OntoNotes: Coreference Resolution Model Transfer." Association for Computational Linguistics.
  19. Yuxiang Liao, Hantao Liu, Irena Spasić. (2023). "Fine-tuning Coreference Resolution for Different Styles of Clinical Narratives". Journal of Biomedical Informatics, 149.
  20. Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D Manning, Curtis P Langlotz. (2021). "Biomedical and clinical English model packages for the Stanza Python NLP library". Journal of the American Medical Informatics Association, 28 (9), 1892-2189, 10.1093/jamia/ocab090
  21. Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, Roger G. Mark (2016). "MIMIC-III, a freely accessible critical care database". Scientific Data, 3(1), Article 160035.
  22. Ann Bies, Justin Mott, Colin Warner, Seth Kulick. (2012). "English Web Treebank, LDC2012T13". Linguistic Data Consortium, Philadelphia.
  23. Michelle Yuan, Patrick Xia, Chandler May, Benjamin Van Durme, Jordan Boyd-Graber. (2022). "Adapting Coreference Resolution Models through Active Learning". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, Association for Computational Linguistics, 1, 7533-7549. 10.18653/v1/2022.acl-long.519
  24. Stanford NLP Group. "Coreference Resolution". [Online]. Available from: [Accessed 22nd December 2023].
  25. Yuxiang Liao, Hantao Liu, Irena Spasić. "Fine-tuning coreference resolution for different styles of clinical narratives". [Online]. Available from: [Accessed 22nd December 2023].

Parent Projects
RadCoref: Fine-tuning coreference resolution for different styles of clinical narratives was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.