CUILESS2016

When referencing this data, please cite:

John D. Osborne, Matthew B. Neu, Maria I. Danila, Thamar Solorio and Steven J. Bethard. CUILESS2016: a clinical corpus applying compositional normalization of text mentions. Journal of Biomedical Semantics 2018 9:2. doi:10.1186/s13326-017-0173-6

Please also include the standard citation for PhysioNet:

Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [Circulation Electronic Pages; http://circ.ahajournals.org/cgi/content/full/101/23/e215]; 2000 (June 13).

Introduction

The Concept Unique Identifier (CUI)-less database contains a corpus of "CUI-less" concepts taken from the SemEval2015 Task 14 that have been assigned CUIs. The annotation process allows assignment of CUIS from any Unified Medical Language System (UMLS) semantic group and compositional normalization using more than one CUI per disease entity. Concepts are mapped to SNOMED CT as represented in the September 2016 version found in Unified Medical Language System (UMLS) 2016AB.

Data Collection

We annotated 5397 disorder mentions from the ShARe corpus to SNOMED CT that were previously normalized as “CUI-less” in the “SemEval-2015 Task 14” shared task because they lacked a pre-coordinated mapping. Unlike the previous normalization method, we do not restrict concept mappings to a particular set of the Unified Medical Language System (UMLS) semantic types and allow normalization to occur to multiple UMLS Concept Unique Identifiers (CUIs). We computed annotator agreement and assessed semantic coverage with this method. See the referenced paper for more details.

Files

The development CUILESS_DEV.tar.gz and training CUILESS_TRAIN.tar.gz data sets consist of pipe delimited data files in the same format as the SemEval2015-Task 14 (http://alt.qcri.org/semeval2015/task14/index.php?id=data-and-tools) and shown below:

    report name|disorder-span|cui|Norm_NI|Cue_NI|Norm_SC|Cue_SC|Norm_UI|Cue_UI|
    Norm_CC|Cue_CC|Norm_SV|Cue_SV|Norm_CO|Cue_CO|Norm_GC|Cue_GC|Norm_BL|Cue_BL|
    Norm_DT|Norm_TE|Cue_TE

The only difference is that the "cui" column can consist of one or more cuis separated by spaces.

Clinical text is not distributed in this corpus. Interested parties can download the original clinical text and sign the Data Usage Agreement as outlined in: http://alt.qcri.org/semeval2015/task14/index.php?id=data-and-tools. This dataset is free to download provide by interested parties provided the interested agrees not to redistribute the data or attempt to re-identify patients in the dataset.

Contact

John D Osborne [josborne (at) uabmc (dot) edu]

Icon  Name                                 Last modified      Size  Description
[PARENTDIR] Parent Directory - [   ] CUILESS_DEV.tar.gz 2018-01-19 16:20 39K [   ] CUILESS_TRAIN.tar.gz 2018-01-19 16:20 75K [   ] DOI 2018-08-29 16:32 19 [   ] MD5SUMS 2018-08-30 16:37 264 [   ] SHA1SUMS 2018-08-30 16:37 304 [   ] SHA256SUMS 2018-08-30 16:37 424 [TXT] semanticly_similiar_concept_sets.txt 2018-01-19 16:20 34K

Questions and Comments

If you would like help understanding, using, or downloading content, please see our Frequently Asked Questions.

If you have any comments, feedback, or particular questions regarding this page, please send them to the webmaster.

Comments and issues can also be raised on PhysioNet's GitHub page.

Updated Friday, 28 October 2016 at 16:58 EDT

PhysioNet is supported by the National Institute of General Medical Sciences (NIGMS) and the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number 2R01GM104987-09.