Challenge Credentialed Access

SNOMED CT Entity Linking Challenge

Will Hardman Mark Banks Rory Davidson Donna Truran Nindya Widita Ayuningtyas Hoa Ngo Alistair Johnson Tom Pollard

Published: Dec. 19, 2023. Version: 1.0.0

When using this resource, please cite: (show more options)
Hardman, W., Banks, M., Davidson, R., Truran, D., Ayuningtyas, N. W., Ngo, H., Johnson, A., & Pollard, T. (2023). SNOMED CT Entity Linking Challenge (version 1.0.0). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


This challenge, sponsored by SNOMED International, seeks to advance the development of Entity Linking models that operate on unstructured clinical texts. Participants in the challenge will train entity linking models using a subset of MIMIC-IV-Note discharge summaries that have been annotated with SNOMED CT concepts by a team of medical professionals. The full dataset (which is comprised of a training set and a test set) consists of approximately 75,000 annotations across nearly 300 discharge summaries. The competition is being hosted by DrivenData, whose platform will manage registration, code submission, and evaluation.


Our objective is to stimulate the development and sharing of new approaches for building Clinical Entity Linking models and to provide a new data resource to the machine learning for health community. Much of the world’s healthcare data is stored in free-text documents, typically clinical notes recorded in electronic health record systems. Extracting useful information from these notes has the potential to unlock new opportunities for analytics and research, in turn stimulating the development of new clinical interventions, treatment pathways, and better patient outcomes. 

One way to extract structured information from clinical free-text is by first locating all references to clinical concepts (the "Named Entity Recognition" step) and matching them to a well defined concept in a knowledge base (the "Linking" step).  This process is known as Clinical Entity Linking - or sometimes as Automated Clinical Coding. An good overview of the current state of automated clinical coding can be found in [1].

In this competition, the knowledge base to which named entities will be matched is SNOMED CT [2]. SNOMED CT is a clinical terminology containing over 360,000 medical concepts, such as specific medications and diseases. Participants will train entity linking models using a subset of MIMIC-IV-Note discharge summaries that have been annotated with SNOMED CT concepts by a team of medical professionals.  Participants might like to explore the following avenues:

  1. Whether recently release GPT-pattern large language models (LLMs) can play a role in the NER or L steps of Entity Linking.
  2. How the structural knowledge encoded in SNOMED CT can be used to improve the generation of concept embeddings or in concept disambiguation.

We hope that this challenge will support the clinical machine learning community in developing efficient and reliable tools for automating the coding of patient data, facilitating interoperability, decision support, and, ultimately, improving health outcomes for patients worldwide.


Participants to this challenge are expected to submit an entity linking model which can be run against a set of annotated discharge notes that have been witheld.

The competion is therefore being hosted by DrivenData, whose platform will manage the registration, the code submission process and the leaderboard [3].  The annotations and discharge notes are accessible only via PhysioNet [4].

Participants will need to sign up to DrivenData to participate in this challenge, which goes live on December 19th 2023 and closes on March 5th 2024 [5].

Participation is open to everyone - singly or in teams - except for the following:

  • Employees of SNOMED International
  • Members of the challenge's data annotation team
  • Employees of DrivenData
  • Employees of Veratai Ltd

Challenge submission site

Data Description

The SNOMED International team annotated nearly 300 discharge notes from MIMIC-IV-Note. Of the 219K SNOMED CT concepts that annotators could have used (taken from the "body structure", "procedure" and "clinical finding" portions of the terminology), approximately 7,000 appear across these documents.

Documents to be annotated were randomly selected from the discharge summaries.  The team followed the process outlined in [7] and used the MedCATTrainer software [8] to mark the annotations.  Around 70% of the annotations were doubly-annotated (budgetary constraints prevented double-annotation of the remaining 30%) with all disagreements from the double-annotations being passed to a third annotator for a final vote.

The annotations have been divided into a training dataset (which participants will have access to) and a scoring dataset (withheld; against which submissions will be evaluated).

The training dataset is comprised of 204 documents containing 51,574 annotations in which 5,336 distinct concepts appear. Note that there are concepts which appear in the test dataset which do not appear in the training dataset. Your models should appropriately leverage the SNOMED CT clinical terminology and the relationships contained therein to generalize to concepts not seen in the training dataset.

The files included are as follows:

  • train_annotations.csv: A comma-delimited text file with a header row and no string qualifiers. The fields are:
    • note_id: STRING. Identifies the discharge note to which the annotation belongs. This will match the corresponding note_id field in the MIMIC-IV-Note Discharge Notes dataset.
    • start: INT. The character start index of the annotation span in the document.
    • end: INT. The character end index of the annotation span in the document.
    • concept_id: INT. The corresponding SNOMED Concept ID (sctid).
  • mimic-iv_notes_training_set.csv: The relevant discharge notes extracted from MIMIC-IV-Note. The note_id and text fields are present.


For this challenge, participants may use a variety of tokenizers that lead to small variation in token start and end indices. To account for this variation, rather than using a metric like macro F1 across specific tokens, we instead use a character-level metric.

Performance for this challenge is evaluated according to a class macro-averaged character intersection-over-union (IoU), defined as follows for character classification predictions P and character ground truth classifications G:

  IoU class = P class char G class char P class char G class char \text{IoU}_{\text{class}} = \frac{P^{\text{char}}_{\text{class}} \cap G^{\text{char}}_{\text{class}}}{P^{\text{char}}_{\text{class}} \cup G^{\text{char}}_{\text{class}}}

macro IoU = classes P G IoU class N classes P G \text{macro IoU} = \frac{\sum_{\text{classes} \in P \cup G} \text{IoU}_{\text{class}}}{N_{\text{classes} \in P \cup G}}

Where P class char P^{\text{char}}_{\text{class}} is the set of characters in all predicted spans for a given class category,  G class char G^{\text{char}}_{\text{class}} is the set of characters in all ground truth spans for a given class category, and classes P G \text{classes} \in P \cup G are the set of categories present in either the ground truth or the predicted spans.

Note that the predicted concept ID must match exactly. Relationships between concepts are not taken into account for scoring purposes.

Release Notes

This dataset is release 1.0.0.

At present, this dataset contains only the training annotations. Once the Entity Linking challenge is complete, the scoring annotations will also be added to complete the dataset.

Beyond that, the authors may, in future, continue to update this dataset. Updates could include:

  1. Reviewing some of the annotation spans (a known issue is that occasionally annotators highlighted extra characters or missed characters from the span they intended to annotate).
  2. Adding meta-annotations.
  3. Adding additional notes in order to increase the number of distinct concepts present in the dataset.


The authors declare no ethics concerns. All members of the challenge team who worked with the discharge summaries, including annotators, completed training and credentialing requirements for data access.


The data annotation project and entity linking challenge have been funded by SNOMED International.

We thank the efforts of the data annotation team who worked on this project:

Vicky Hei Fung
Dr Ismat Mohd Sulaiman
Nindya Widita Ayuningtyas
Donna Truran
Hoa Ngo
Michael Bond
Mark Banks

Special thanks also to Michael Scanlan from the MIT Laboratory for Computational Physiology for managing the credentialing process for the project team and for the competition participants.

Conflicts of Interest

No conflicts of interest have been identified.


  1. Dong, H., Falis, M., Whiteley, W., Alex, B., Matterson, J., Ji, S., Chen, J., & Wu, H. (2022). Automated clinical coding: what, why, and where we are? Npj Digital Medicine, 5(1). ‌
  2. SNOMED CT website. [Accessed on 19 Dec 2023]
  3. DrivenData website. [Accessed on 19 Dec 2023]
  4. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
  5. SNOMED Challenge Submission Page. [Accessed on 19 Dec 2023]
  6. Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet.
  7. Healthcare Text Annotation Guidelines. (2023, October 4). Google Inc. ‌
  8. Searle, T., Zeljko Kraljevic, Bendayan, R., Bean, D., & Dobson, R. (2019). MedCATTrainer: A Biomedical Free Text Annotation Interface with Active Learning and Research Use Case Specific Customisation. ‌

Parent Projects
SNOMED CT Entity Linking Challenge was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.