Database Restricted Access
Gout Emergency Department Chief Complaint Corpora
John David Osborne , Tobias O'Leary , Amy Mudano , James Booth , Giovanna Rosas , Gurusai Sujitha Peramsetty , Anthony Knighton , Jeff Foster , Ken Saag , Maria Ioana Danila
Published: Oct. 19, 2020. Version: 1.0
When using this resource, please cite:
(show more options)
Osborne, J. D., O'Leary, T., Mudano, A., Booth, J., Rosas, G., Peramsetty, G. S., Knighton, A., Foster, J., Saag, K., & Danila, M. I. (2020). Gout Emergency Department Chief Complaint Corpora (version 1.0). PhysioNet. https://doi.org/10.13026/96v3-dw72.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
The Gout Emergency Department Chief Complaint Corpora (GED3C) consists of 2 corpora of free text triage nurse chief complaints (up to 282 characters in length) collected from 2019 to 2020 at an academic medical center in the Deep South. The smaller corpus "GOUT-CC-2019-CORPUS" consists of 300 chief complaints from 2019 selected by the presence of the keyword "gout". The larger corpus "GOUT-CC-2020-CORPUS" contains 8037 chief complaints collected from a single month in 2020. No other selection criteria (gout or otherwise) was used to generate GOUT-CC-2020-CORPUS, making this corpus representative for medical conditions of interest in the underlying urban Black majority Emergency Department patient population.
We anticipate this corpus being useful in the development of Emergency Department alerting algorithms for gout. Both GOUT-CC-2019-CORPUS and GOUT-CC-2020-CORPUS were annotated with respect to predicted gout flare status as determined retrospectively by manual review of the chief complaint. A subset of patients with these chief complaints underwent chart review by rheumatologists to verify gout flare status guided by the Gaffo criteria. These corpora may be useful to researchers at other institutions who want to develop or validate existing alerting algorithms at a 2nd institution. These corpora are also available for use in masked language model development. Its terse, abbreviation-rich content sets it apart from more lengthy clinical text and to our knowledge this is the only chief complaint corpus publicly available.
Gout affects over 9 million Americans and is the most common form of inflammatory arthritis in men with a prevalence rate over 5% . The U.S. National Emergency Department Sample (NEDS) reports over 200,000 Emergency Department (ED) visits annually with gout as the primary diagnosis, accounting for 0.2% of all visits . Unfortunately, most patients will be discharged home, many without adequate follow-up care. Thus, providing continuity of care after the ED visit for an acute flare is the cornerstone for improving outcomes of patients with gout. The identification of gout in chief complaints could help bridge this gap, since chief complaints are typically entered at the outset of an ED visit before structured data exists. This provides an opportunity for patient contact for gout clinical trial recruitment and/or specialty gout care teams which motivated our creation of this data set. However, we also envision these chief complaints may be useful as standalone short text for other studies, such as those focusing on emergency medicine, alerting, or diseases of interest well represented in chief complaints.
Both the GOUT-CC-2019-CORPUS and CC-2020-CORPUS were generated in 2019 and 2020 respectively from an academic medical center in the southern United States. While patient demographics for both corpora were not calculated, the demographics of the ED population from which they were derived are known. In 2019, patients who visited the ED were 54% female, and 46% male. The ED population was 55% Black, 40% White, 2% Hispanic, and 1% Asian. Age distribution was 5% between ages 1-20 years, 35% between ages 21-40 years, 35% between ages 41-60 years, 20% between ages 61-80 years, and 5% between ages 81-100 years. The demographics of this corpus make it of particular interest to researchers in the Deep South.
All data was processed and analyzed in 2020, with predicted gout flare status determined by the chief complaint (CC) only (done in an Excel spreadsheet) whereas chart review was done using Cerner PowerChart electronic health record (EHR) and tracked in Excel. The GOUT-CC-2019-CORPUS was generated using ”gout” as a keyword search to identify 300 chief complaints from the EHR. This was used to support the initial development of the algorithm, including oversampling of the rare gout flare class. Additionally, a distinct set of 8042 chief complaints from 2020 (CC-2020-CORPUS) was also created. These chief complaints were selected without regard to any medical criteria, the only requirement being that they were created in the target month in 2020. Both corpora had the chief complaint annotated retrospectively to indicate whether the chief complaint (and only the chief complaint) was indicative of a gout flare, not indicative of a gout flare, or unknown with respect to gout flare. These results are stored in the "Predict" column of both corpora. GOUT-CC-2019-CORPUS was double-annotated by a practicing rheumatologist (MD) and a PhD informatician (JDO) to calculate annotator agreement, which was 0.825 (Cohen's Kappa) for the GOUT-CC-2019-CORPUS. Thereafter, a full manual chart review was performed by one rheumatologist (MD) and a post-doctoral fellow (GR) to determine gout flare status for 197 of the 300 ED encounters. Annotator agreement was 0.774 (Cohen's Kappa) for chart review, and the results of this chart review are stored in the "Consensus" column.
While the initial CC-2020-CORPUS consisted of all chief complaints in the target month in 2020, a total of 5 chief complaints were removed from the original 8042 chief complaints since 4 of them contained distinguishing information on suicide attempts, assaults or description of incidents that could allow for identification of the patient by those familiar with the incident. An additional chief complaint was removed because the encounter could no longer be found in the EHR leaving a corpus of 8037 chief complaints. Similar to the GOUT-CC-2019-CORPUS annotation process, two annotators (AM and JDO) screened for the presence of gout flares using the chief complaint (results stored in the "Predict" column), double annotating a set of 300 mentions to compute annotator agreement of 0.965 Cohen's Kappa. Then, the rheumatologist (MD) and the post-doctoral fellow (GR) performed an ED clinical note review on all patients with chief complaints identified as either indicative or unknown for gout, again guided by the criteria for a gout flare as described by Gaffo et al  resulting in an annotator agreement of 0.856 Cohen's Kappa. Additionally, a sample of 100 patients screened as negative for gout flare by chief complaint alone had their chart reviewed by the rheumatologist to verify negative gout flare status. The negative cases for review were selected by the presence of gout related keywords to strengthen our confidence that no gout flares were missed. Results for the chart review for gout flare status are stored in the "Consensus" column.
De-identification was carried out by fine-tuning named entity recognition algorithms based on BERT  and ALBERT  to meet the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor specifications. Fine-tuning was done over 70 epochs with a batch size of 32 with an initial learning rate of 0.1 and hidden size of 256 for both the BERT base uncased model and ALBERT base v2 model. A manually created regular expression algorithm was also created. Results for all 3 algorithms were consolidated based on span overlap into personal health information tags (PHI). The 18 PHI tags correspond to those used in the I2B2 2014 Corpus  are used including an additional tag for age > 90. In addition to all HIPAA Safe Harbor data elements redacted, specific time information (not just dates) that could allow identification via cell phone tracking was also identified and removed. This software based de-identification process was followed by manual review of all chief complaints by 2 students (GSP and AK) using BRAT software  to ensure that no personal information remained. A synthetic data set for both corpora suitable for training masked language models was generated using BRATsynthetic , which substitutes in random words suitable for the HIPAA Safe Harbor element it is replacing.
Data for this project was collected under IRB-300004156 and IRB-300001664; Improving Care for Gout in the Southeast Enhancing Gout Minority Patients Care and Participation in Gout Clinical Research. This de-identified data is approved for distribution per the U-BRITE Deidentified Translational Data Repository for Research and Education (IRB-300002212).
The corpus consists of 2 distinct data sets: GOUT-CC-2019-CORPUS and GOUT-CC-2020-CORPUS available in 4 different formats. Each data set has both a tab separated (tsv) and a comma separated (csv) version, along with a SYNTHETIC and REDACTED version for the csv and tsv file. The REDACTED version has all personal information (PI) including personal health information (PHI) removed and the redacted text is replaced by 1 of the 18 HIPAA classes of protected information (e.g., medical record number, MRN) and any time specific information is also removed. The SYNTHETIC version replaces the REDACTED text by a randomly chosen text string appropriate to that HIPAA or time class using the BRATsynthetic software .
Each row in the file represents a chief complaint and consists of 3 fields:
- The free-text "Chief Complaint" column consists of one or more (mostly improper) sentences written in the abbreviated and medical acronym rich English language used by ED triage nurses in an urban academic medical center in the southern United States. PHI is identified as text between
>>, for example
- The "Predict" column indicates if the complaint may be related to a gout flare. Values are yes (
Y), no (
N), unknown (
U) or unmarked (
- The "Consensus" column indicates if the patient at the time of the ED visit was experiencing a gout flare as determined by chart review by a rheumatologist. Values are yes (
Y), no (
N), unknown (
U) or unmarked (
Data can be used directly. Possible applications of this project include:
- Early detection of gout in an ED setting, before structured data is entered into this system
- Development of ED alerts that require an additional (2nd institution) data source for algorithm validation
- The de-identified chief complaints themselves are also useful as a source for training for clinical masked language modelling, important since chief complaint data is not currently found in the MIMIC data set. Additionally, the patient population also has different demographic characteristics and the text is quite different from other note types, being extremely abbreviation rich and terse. For masked language model training, the SYNTHETIC rather than REDACTED data sets are appropriate for training.
One limitation of this data set is that the number of gout relevant chief complaints is small relative to the number of total chief complaints resulting in an unbalanced data set. We suggest institutions interested in developing gout alerts use this data set in conjunction with their own alert data to achieve higher performance.
This initial 1.0 release is the first complete release. A subsequent release with data from another institution may follow in 2021.
This data set was supported by funding from NIH grant 3P50AR60772-08S1 and a NVidia corporation grant for a Titan XP GPU used for algorithm training.
Conflicts of Interest
The authors have no conflicts of interest to declare.
- Chen‐Xu, M., Yokose, C., Rai, S. K., Pillinger, M. H., & Choi, H. K. (2019). Contemporary prevalence of gout and hyperuricemia in the United States and decadal trends: the National Health and Nutrition Examination Survey, 2007–2016. Arthritis & Rheumatology, 71(6), 991-999.
- Singh, J. A., & Yu, S. (2016). Time trends, predictors, and outcome of emergency department use for gout: a nationwide US study. The Journal of rheumatology, 43(8), 1581-1588.
- Gaffo, A. L., Schumacher, H. R., Saag, K. G., Taylor, W. J., Dinnella, J., Outman, R., ... & Chou, C. T. (2012). Developing a provisional definition of flare in patients with established gout. Arthritis & Rheumatism, 64(5), 1508-1517.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Stubbs, A., Kotfila, C., & Uzuner, Ö. (2015). Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of biomedical informatics, 58, S11-S19.
- Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. I. (2012, April). BRAT: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 102-107).
- BRATsynthetic Software, GitHub. https://github.com/uabnlp/BRATsynthetic
Only registered users who sign the specified data use agreement can access the files.
License (for files):
PhysioNet Restricted Health Data License 1.5.0
Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0
gout nlp emergency department
- sign the data use agreement for the project