Challenge Credentialed Access

Analysis of Clinical Text: Task 14 of SemEval 2015

Guergana Savova

Published: Dec. 24, 2014. Version: 1.0 <View latest version>


When using this resource, please cite: (show more options)
Savova, G. (2014). Analysis of Clinical Text: Task 14 of SemEval 2015 (version 1.0). PhysioNet. https://doi.org/10.13026/xb00-c765.

Additionally, please cite the original publication:

Noémie Elhadad, Sameer Pradhan, Sharon Gorman, Suresh Manandhar, Wendy Chapman, Guergana Savova. SemEval-2015 Task 14: Analysis of Clinical Text. Proc. of the 9th International Workshop on Semantic Evaluation (SemEval 2015). June 2015, Denver, CO.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems, organized under the umbrella of SIGLEX, the Special Interest Group on the Lexicon of the Association for Computational Linguistics. This project describes Task 14 ("Analysis of Clinical Text") of the International Workshop on Semantic Evaluation 2015 (SemEval 2015). The purpose of Task 14 is to enhance current research in natural language processing (NLP) methods used in the clinical domain, and to introduce clinical text processing to the broader NLP community. The task aims to combine supervised methods for text analysis with unsupervised approaches. More specifically, the task aims to combine supervised methods for entity/acronym/abbreviation recognition and mapping to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with access to larger clinical corpus for utilizing unsupervised techniques. Additionally, it will also evaluate systems on the task of template filling, which involves the population of eight attributes of the identified disorders with their normalized values.


Objective

SemEval-2015 Task 14, Analysis of Clinical Text is the newest iteration in a series of community challenges organized around named entity recognition for clinical texts [1,2]. The tasks leverage annotations from the Shared Annotated Resources (ShARe) corpus, which consists of clinical notes with annotated mentions disorders, along with their normalization to a medical terminology and eight additional attributes [3]. The challenge has two subtasks (Subtask 1 and Subtask 2):

  • 1: named entity recognition
  • 2: template slot filling
    • 2a: template slot filling given gold-standard disorder spans
    • 2b: end-to-end disorder span identification together with template slot filling

The purpose of the challenge is to identify advances in clinical named entity recognition and establish the state of the art in disorder template slot filling. More specifically, the challenge aims to combine supervised methods for entity/acronym/abbreviation recognition and mapping to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with access to larger clinical corpus for utilizing unsupervised techniques. Additionally, it will also evaluate systems on the task of template filling, which involves the population of eight attributes of the identified disorders with their normalized values.

Subtask 1: Disorder Identification

For subtask 1, the goal is to recognize the span of a disorder mention in input clinical text and to normalize the disorder to a unique CUI in the UMLS/SNOMED-CT terminology. UMLS/SNOMED-CT terminology is defined as the set of CUIs in the UMLS, but restricted to concepts that are included in the SNOMED-CT terminology. Participants were free to use any publicly available resources, such as UMLS, WordNet, and Wikipedia, as well as the large corpus of unannotated clinical notes

Subtask 2: Disorder Slot Filling

This task focuses on identifying the normalized value for the nine attributes described above: the CUI of the disorder (very much like in Subtask 1), negation indicator, subject, uncertainty indicator, course, severity, conditional, generic indicator, and body location. We describe Subtask 2 as a slot-filling task: given a disorder mention (either provided by gold-standard or identified automatically) in a clinical note, identify the normalized value of the nine slots. Note that there are two aspects to slot filling: cues in the text and normalized value. In this task, we focus on normalized value and ignore cue detection. To understand the state of the art for this new task, we considered two subtasks. In both cases, given a disorder span, participants are asked to identify the nine attributes related to the disorder. In 2a, the gold-standard disorder span(s) are provided as input. In 2b, no gold-standard information is provided; systems must recognize spans for disorder mentions and fill in the value of the nine attributes.


Participation

This year, we shift focus on the task of identifying a series of attributes describing a disorder mention. Like for previous challenges, we use the ShARe corpus and introduce a new set of annotations for disorder attributes.

Data Access

Clinical data, even in its de-identified form, has various privacy controls in place. In order to access the annotation along with the associated clinical notes, participants must complete the PhysioNet credentialing process by signing a Data Use Agreement and completing a short online course in human subjects research.

Challenge Timeline

  • Trial data released: May 30, 2014
  • Training data ready: July 30, 2014
  • Evaluation period starts: December 5, 2014
  • Evaluation period ends: December 22, 2014
  • Paper submission due: January 30, 2015
  • Paper reviews due: February 28, 2015
  • Paper acceptance notification: March 5, 2015
  • Camera ready due: March 30, 2015
  • Register to participate in the SemEval workshop: May 10, 2015
  • SemEval workshop: June 4-5, 2015 (co-located with NAACL-2015 in Denver, Colorado)

Data Description

The dataset used is the ShARe corpus [3]. As a whole, it consists of 531 deidentified clinical notes (a mix of discharge summaries and radiology reports) selected from the MIMIC-II clinical database (Version 2.5). The test set is a previously unseen set of clinical notes from the ShARe corpus. In addition to the ShARe corpus annotations, task participants were provided with a large set of unlabeled deidentified clinical notes, also from MIMIC II (400,000+ notes). The ShARe corpus contains gold-standard annotations of disorder mentions and a set of attributes. We refer to the nine attributes as a disorder template. The annotation schema for the template was derived from the established clinical element model [5]. The complete guidelines for the ShARe annotations are available on the ShARe website. Here, we provide a few examples to illustrate what each attribute captures:

  • In the statement “patient denies numbness,” the disorder numbness has an associated negation attribute set to “yes.”
  • In the sentence “son has schizophrenia”, the disorder schizophrenia has a subject attribute set to “family member.”
  • The sentence “Evaluation of MI.” contains a disorder (MI) with the uncertainty attribute set to “yes”.
  • An example of disorder with a non-default course attribute can be found in the sentence “The cough got worse over the next two weeks.”, where its value is “worsened.”
  • The severity attribute is set to “slight” in “He has slight bleeding.
  • In the sentence “Pt should come back if any rash occurs,” the disorder rash has a conditional attribute with value “true.”
  • In the sentence “Patient has a facial rash”, the body location associated with the disorder “facial rash” is “face” with CUI C0015450. Note that the body location does not have to be a substring of the disorder mention, even though in this example it is.

Evaluation

Evaluation for Task 1 is reported according to a F1 score, which captures both the disorder span recognition and the CUI normalization steps. We compute two versions of the F-score:

  • Strict F-score: a predicted mention is considered a true positive if (i) the character span of the disorder is exactly the same as for the gold-standard mention; and (ii) the predicted CUI is correct. The predicted disorder is considered a false positive if the span is incorrect or the CUI is incorrect.
  • Relaxed F-score: a predicted mention is a true positive if (i) there is any word overlap between the predicted mention span and the gold-standard span (both in the case of contiguous and discontiguous spans); and (ii) the predicted CUI is correct. The predicted mention is a false positive if the span shares no words with the gold-standard span or the CUI is incorrect.

We introduce a variety of evaluation metrics, which capture different aspects of the task of disorder template slot filling. Overall, for Task 2a, we reported average unweighted accuracy, weighted accuracy, and per-slot weighted accuracy for each of the nine slots. For Task 2b, we report the same metrics, and in addition report relaxed F for span identification. For further details, please refer to the associated paper [1].


Release Notes

The formal challenge is now complete, but the data remains available for those interested in exploring the tasks. SemEval-2015 took place in 2015. It was co-located with NAACL-HLT 2015, 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies and *SEM 2015, Third Joint Conference on Lexical and Computational Semantics, Denver, USA. There were 17 different tasks in SemEval-2015 evaluating various computational semantic systems.

Post Challenge Update

For Subtask 1 (disorder span detection and normalization), 16 teams participated. The best system yielded a strict F1-score of 75.7, with a precision of 78.3 and recall of 73.2. For Subtask 2a (template slot filling given gold standard disorder spans), six teams participated. The best system yielded a combined overall weighted accuracy for slot filling of 88.6. For Subtask 2b (disorder recognition and template slot filling), nine teams participated. The best system yielded a combined relaxed F (for span detection) and overall weighted accuracy of 80.8.


Acknowledgements

This work was supported by the Shared Annotated Resources (ShARe) project NIH R01 GM090187. We greatly appreciate the hard work of our program committee members and the ShARe annotators. We are very grateful to the PhysioNet team for making the MIMIC resource available to the community.


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. Noémie Elhadad, Sameer Pradhan, Sharon Gorman, Suresh Manandhar, Wendy Chapman, Guergana Savova. SemEval-2015 Task 14: Analysis of Clinical Text. https://www.aclweb.org/anthology/S15-2051.pdf
  2. Semeval-2015: Task 14 website. https://alt.qcri.org/semeval2015/task14/ [Accessed: 23 Dec 2020]
  3. Shared Annotated Resources (ShARe) project website. https://healthnlp.hms.harvard.edu/ [Accessed: 23 Dec 2020]
  4. Clinical Element Model website. http://www.opencem.org/ [Accessed: 23 Dec 2020]

Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Discovery

DOI (version 1.0):
https://doi.org/10.13026/xb00-c765

DOI (latest version):
https://doi.org/10.13026/96re-n451

Topics:
semeval nlp

Corresponding Author
You must be logged in to view the contact information.
Versions
  • 1.0 - Dec. 24, 2014
  • 2.0 - Dec. 28, 2014

Files