Challenge Credentialed Access
Analysis of Clinical Text: Task 14 of SemEval 2015
Published: Dec. 28, 2014. Version: 2.0
Pradhan, Sameer; Elhadad, Noemie; Chapman, Wendy; Manandhar, Suresh; Savova, Guergana. 2014. SemEval 2014 Task 7: Analysis of Clinical Text. Proc. of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 54-62. Dublin, Ireland. August 23-24, 2014. http://www.aclweb.org/anthology/S14-2007
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems, organized under the umbrella of SIGLEX, the Special Interest Group on the Lexicon of the Association for Computational Linguistics. This project describes "Analysis of Clinical Text" task of the International Workshop on Semantic Evaluation 2014 and 2015 (SemEval 2014 and 2015) [2,3,4,5]. The purpose of the task is to enhance current research in natural language processing (NLP) methods used in the clinical domain, and to introduce clinical text processing to the broader NLP community. The task aims to combine supervised methods for text analysis with unsupervised approaches for entity/acronym/abbreviation recognition and mapping to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). It also evaluated systems on the task of template filling , which involves the population of eight attributes of the identified disorders with their normalized values.
The task of Analysis of Clinical Text was organized as part of SemEval 2014 and 2015 as a challenge. It is the newest iteration in a series of community challenges organized around named entity recognition for clinical texts. The tasks leverage annotations from the Shared Annotated Resources (ShARe) corpus, which consists of MIMIC II v2.5. clinical notes with annotated mentions disorders, along with their normalization to a medical terminology and eight additional attributes. The challenge has two subtasks (Subtask 1 and Subtask 2):
- 1: named entity recognition
- 2: template slot filling
- 2a: template slot filling given gold-standard disorder spans
- 2b: end-to-end disorder span identification together with template slot filling
The purpose of the challenge is to identify advances in clinical named entity recognition and establish the state of the art in disorder template slot filling.
Subtask 1: Disorder Identification
For subtask 1, the goal is to recognize the span of a disorder mention in input clinical text and to normalize the disorder to a unique CUI in the UMLS/SNOMED-CT terminology. UMLS/SNOMED-CT terminology is defined as the set of CUIs in the UMLS, but restricted to concepts that are included in the SNOMED-CT terminology.
Subtask 2: Disorder Slot Filling
This task focuses on identifying the normalized value for the nine attributes: the CUI of the disorder, negation indicator, subject, uncertainty indicator, course, severity, conditional, generic indicator, and body location. We describe Subtask 2 as a slot-filling task: given a disorder mention (either provided by gold-standard or identified automatically) in a clinical note, identify the normalized value of the nine slots. Note that there are two aspects to slot filling: cues in the text and normalized value. In this subtask, we focus on normalized value and ignore cue detection. To understand the state of the art for this new task, we considered two subtasks. In both cases, given a disorder span, participants are asked to identify the nine attributes related to the disorder. In 2a, the gold-standard disorder span(s) are provided as input. In 2b, no gold-standard information is provided; systems must recognize spans for disorder mentions and fill in the value of the nine attributes.
The SemEval challenge was held in 2014 and 2015 and is described in [4,5].
Clinical data, even in its de-identified form, has various privacy controls in place. In order to access the associated clinical notes, participants must complete the PhysioNet credentialing process by signing a Data Use Agreement and completing a short online course in human subjects research.
The dataset used is the ShARe corpus [4,5]. As a whole, it consists of 531 deidentified clinical notes (a mix of discharge summaries and radiology reports) selected from the MIMIC-II clinical database (Version 2.5). The ShARe corpus contains gold-standard annotations of disorder mentions and a set of attributes. We refer to the nine attributes as a disorder template. The annotation schema for the template was derived from the established clinical element model . Here, we provide a few examples to illustrate what each attribute captures:
- In the statement “patient denies numbness,” the disorder numbness has an associated negation attribute set to “yes.”
- In the sentence “son has schizophrenia”, the disorder schizophrenia has a subject attribute set to “family member.”
- The sentence “Evaluation of MI.” contains a disorder (MI) with the uncertainty attribute set to “yes”.
- An example of disorder with a non-default course attribute can be found in the sentence “The cough got worse over the next two weeks.”, where its value is “worsened.”
- The severity attribute is set to “slight” in “He has slight bleeding.”
- In the sentence “Pt should come back if any rash occurs,” the disorder rash has a conditional attribute with value “true.”
- In the sentence “Patient has a facial rash”, the body location associated with the disorder “facial rash” is “face” with CUI C0015450. Note that the body location does not have to be a substring of the disorder mention, even though in this example it is.
Evaluation for subtask 1 is reported according to a F1 score, which captures both the disorder span recognition and the CUI normalization steps. We compute two versions of the F-score:
- Strict F-score: a predicted mention is considered a true positive if (i) the character span of the disorder is exactly the same as for the gold-standard mention; and (ii) the predicted CUI is correct. The predicted disorder is considered a false positive if the span is incorrect or the CUI is incorrect.
- Relaxed F-score: a predicted mention is a true positive if (i) there is any word overlap between the predicted mention span and the gold-standard span (both in the case of contiguous and discontiguous spans); and (ii) the predicted CUI is correct. The predicted mention is a false positive if the span shares no words with the gold-standard span or the CUI is incorrect.
We introduce a variety of evaluation metrics, which capture different aspects of the task of disorder template slot filling. Overall, for subtask 2a, we reported average unweighted accuracy, weighted accuracy, and per-slot weighted accuracy for each of the nine slots. For subtask 2b, we report the same metrics, and in addition report relaxed F for span identification. For further details, please refer to the associated papers [4,5].
The formal challenge was held at SemEval 2014 and 2015 and is now complete, but the data remains available for those interested in exploring the tasks. The original text can be obtained in the Files section for this project. The associated annotations are distributed through the hNLP Center . Please contact the Corresponding Author - Guergana Savova - for more information, or to request the annotations.
Post Challenge Update
For full evaluation statistics, please refer to the papers [4,5].
SemEval 2014: A total of 21 teams competed in subtask A, and 18 of those also participated in subtask B. For subtask A, the best system had a strict F1-score of 81.3, with a precision of 84.3 and recall of 78.6. For subtask B, the same group had the best strict accuracy of 74.1.
SemEval 2015: For Subtask 1 (disorder span detection and normalization), 16 teams participated. The best system yielded a strict F1-score of 75.7, with a precision of 78.3 and recall of 73.2. For Subtask 2a (template slot filling given gold standard disorder spans), six teams participated. The best system yielded a combined overall weighted accuracy for slot filling of 88.6. For Subtask 2b (disorder recognition and template slot filling), nine teams participated. The best system yielded a combined relaxed F (for span detection) and overall weighted accuracy of 80.8.
This work was supported by the Shared Annotated Resources (ShARe) project funded by NIH R01GM090187 and R01GM114355. We greatly appreciate the hard work of our program committee members and the ShARe annotators. We are very grateful to the PhysioNet team for making the MIMIC resource available to the community.
Conflicts of Interest
The authors have no conflicts of interest to declare.
- Clinical Element Model website. http://www.opencem.org/ [Accessed: 23 Dec 2020]
- SemEval 2014: Task 7 Analysis of Clinical Text. https://alt.qcri.org/semeval2014/task7/ [Accessed: 23 Dec 2020]
- SemEval 2015: Task 14 Analysis of Clinical Text. https://alt.qcri.org/semeval2015/task14/ [Accessed: 23 Dec 2020]
- Pradhan, Sameer; Elhadad, Noemie; Chapman, Wendy; Manandhar, Suresh; Savova, Guergana. 2014. SemEval 2014 Task 7: Analysis of Clinical Text. Proc. of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 54-62. Dublin, Ireland. August 23-24, 2014. http://www.aclweb.org/anthology/S14-2007
- Elhadad, Noemie; Pradhan, Sameer; Lipsky-Gorman, Sharon; Manandhar, Suresh; Chapman, Wendy; Savova, Guergana. 2015. SemEval 2015 Task 14: Analysis of Clinical Text. XProc. of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver, CO. June 4, 2015. http://anthology.aclweb.org/S/S15/S15-2051.pdf
- Health Natural Language Processing (hNLP) Center website. http://center.healthnlp.org/ [Accessed: 30 Dec 2020]
Only PhysioNet credentialed users who sign the specified DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0