Database Open Access
CUILESS2016
John Osborne , Maria Ioana Danila , Steven Bethard
Published: Jan. 24, 2018. Version: 1.0.0
New Database Added: CUILESS16 (Jan. 24, 2018, midnight)
The Concept Unique Identifier (CUI)-less database contains a corpus of "CUI-less" concepts taken from the SemEval2015 Task 14 that have been assigned CUIs. The annotation process allows assignment of CUIS from any Unified Medical Language System (UMLS) semantic group and compositional normalization using more than one CUI per disease entity. Concepts are mapped to SNOMED CT as represented in the September 2016 version found in Unified Medical Language System (UMLS) 2016AB.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Introduction
The Concept Unique Identifier (CUI)-less database contains a corpus of "CUI-less" concepts taken from the SemEval2015 Task 14 that have been assigned CUIs. The annotation process allows assignment of CUIS from any Unified Medical Language System (UMLS) semantic group and compositional normalization using more than one CUI per disease entity. Concepts are mapped to SNOMED CT as represented in the September 2016 version found in Unified Medical Language System (UMLS) 2016AB.
Data Collection
We annotated 5397 disorder mentions from the ShARe corpus to SNOMED CT that were previously normalized as “CUI-less” in the “SemEval-2015 Task 14” shared task because they lacked a pre-coordinated mapping. Unlike the previous normalization method, we do not restrict concept mappings to a particular set of the Unified Medical Language System (UMLS) semantic types and allow normalization to occur to multiple UMLS Concept Unique Identifiers (CUIs). We computed annotator agreement and assessed semantic coverage with this method. See the referenced paper for more details.
Files
The development CUILESS_DEV.tar.gz and training CUILESS_TRAIN.tar.gz data sets consist of pipe delimited data files in the same format as the SemEval2015-Task 14 (http://alt.qcri.org/semeval2015/task14/index.php?id=data-and-tools) and shown below:
report name|disorder-span|cui|Norm_NI|Cue_NI|Norm_SC|Cue_SC|Norm_UI|Cue_UI| Norm_CC|Cue_CC|Norm_SV|Cue_SV|Norm_CO|Cue_CO|Norm_GC|Cue_GC|Norm_BL|Cue_BL| Norm_DT|Norm_TE|Cue_TE
The only difference is that the "cui" column can consist of one or more cuis separated by spaces.
Clinical text is not distributed in this corpus. Interested parties can download the original clinical text and sign the Data Usage Agreement as outlined in: http://alt.qcri.org/semeval2015/task14/index.php?id=data-and-tools. This dataset is free to download provide by interested parties provided the interested agrees not to redistribute the data or attempt to re-identify patients in the dataset.
Contributors
John D. Osborne
Assistant Professor
Informatics Institute (General Medicine)
University of Alabama at Birmingham
Access
Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.
License (for files):
Open Data Commons Attribution License v1.0
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/C2F66G
Corresponding Author
Files
Total uncompressed size: 148.3 KB.
Access the files
- Download the ZIP file (149.0 KB)
- Access the files using the Google Cloud Storage Browser here. Login with a Google account is required.
-
Access the data using the Google Cloud command line tools (please refer to the gsutil
documentation for guidance):
gsutil -m -u YOUR_PROJECT_ID cp -r gs://cuiless16-1.0.0.physionet.org DESTINATION
-
Download the files using your terminal:
wget -r -N -c -np https://physionet.org/files/cuiless16/1.0.0/
-
Download the files using AWS command line tools:
aws s3 sync s3://physionet-open/cuiless16/1.0.0/ DESTINATION
Name | Size | Modified |
---|---|---|
CUILESS_DEV.tar.gz (download) | 39.0 KB | 2018-01-19 |
CUILESS_TRAIN.tar.gz (download) | 75.5 KB | 2018-01-19 |
SHA256SUMS.txt (download) | 272 B | 2019-02-20 |
semanticly_similiar_concept_sets.txt (download) | 33.6 KB | 2018-01-19 |