Database Credentialed Access

Phenotype Annotations for Patient Notes in the MIMIC-III Database

Edward Moseley Leo Anthony Celi Joy Wu Franck Dernoncourt

Published: March 5, 2020. Version: 1.20.03

When using this resource, please cite: (show more options)
Moseley, E., Celi, L. A., Wu, J., & Dernoncourt, F. (2020). Phenotype Annotations for Patient Notes in the MIMIC-III Database (version 1.20.03). PhysioNet.

Additionally, please cite the original publication:

Gehrmann S, Dernoncourt F, Li Y, Carlson ET, Wu JT, Welt J, et al. (2018) Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PLoS ONE 13(2): e0192360.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


A crucial step within secondary analysis of electronic health records (EHRs) is to identify the patient cohort under investigation. While EHRs contain medical billing codes that aim to represent the conditions and treatments patients may have, much of the information is only present in the patient notes. Therefore, it is critical to develop robust algorithms to infer patients' conditions and treatments from their written notes.

We introduce a dataset for patient phenotyping, a task that is defined as the identification whether a patient has a given phenotype (also referred to as indication) based on their patient note. Patient notes of MIMIC-III, a dataset collected from Intensive Care Units of a large tertiary care hospital in Boston, were manually annotated for the presence of several high-context phenotypes relevant to treatment and risk of re-hospitalization.

Each note has been annotated by two expert human annotators (one clinical researcher and one resident physician). Annotated phenotypes include treatment non-adherence, chronic pain, advanced/metastatic cancer, as well as 10 other phenotypes. This dataset can be utilized for academic and industrial research in medicine and computer science, particularly within the field of medical natural language processing. 


As EHRs act to streamline the healthcare administration process, much of the data collected and stored in structured format may be those data most relevant to reimbursement and billing, and may not necessarily be those data which were most relevant during the clinical encounter. For example, a diabetic patient who does not adhere to an insulin treatment regimen and who thereafter presents to the hospital with symptoms indicating diabetic ketoacidosis (DKA) will be treated and considered administratively as an individual presenting with DKA, though that medical emergency may have been secondary to non-adherence to the initial treatment regimen in the setting of diabetes. In this instance, any retrospective study analyzing only the structured data from many similarly selected clinical encounters will necessarily then underestimate the effect of treatment non-adherence with respect to hospital admissions. 

While certain high context information may not be found in the structured EHR data, it may be accessible in patient notes, including nursing progress notes and discharge summaries, particularly through the utilization of natural language processing (NLP) technologies[1,2]. Given progress in NLP methods, we sought to address the issue of unstructured clinical text by defining and annotating clinical phenotypes in text which may otherwise be prohibitively difficult to discern in the structured data associated with the text entry. For this task, we chose the notes present in the publicly available MIMIC database[3].

Given the MIMIC database as substrate and the aforementioned policy initiatives to reduce unnecessary hospital readmissions, as well as the goal of providing structure to text, we elected to focus on patients who were frequently readmitted to the ICU[4]. In particular, a patient who is admitted to the ICU more than three times in a single year. By defining our cohort in this way we sought to ensure we were able to capture those characteristics unique to the cohort in a manner which may yield actionable intelligence on interventions to assist this patient population.


Clinical researchers teamed with junior medical residents in collaboration with more senior intensive care physicians to carry out text annotation over the period of August, 2015 to October, 2016.

Operators were grouped to facilitate the annotation of notes in duplicate, allowing for cases of disagreement between operators. The operators within each team were instructed to work independently on note annotation. Clinical texts were annotated in batches which were time-stamped on their day of creation, when both operators in a team completed annotation of a batch, a new batch was created and transferred to them.

Two groups (group 1: co-authors ETM & JTW; group 2: co-authors JW & JF) of two operator pairs of one clinical researcher and one resident physician. Everyone was first trained on the high-context phenotypes to look for as well as their definitions by going through a number of notes in a group. A total of 13 phenotypes were considered for annotation, and the label "unsure'' was used to indicate that an operator would like to seek assistance determining the presence of a phenotype from a more senior physician.

Data Description

We have created a dataset of patient notes, all in the English language, with a focus on frequently readmitted patients, labeled with 15 clinical patient phenotypes believed to be associated with risk of recurrent Intensive Care Unit (ICU) readmission per our domain experts (co-authors LAC, PAT, DAG) as well as the literature [5-7].

Each entry in this database consists of a MIMIC-III derived Subject Identifier ("SUBJECT_ID", integer), a Hospital Admission Identifier ("HADM_ID", integer), the index from MIMIC-III v1.4 NOTEEVENTS table ("ROW_ID", integer), 15 Phenotypes (binary) including "None'' and "Unsure'', and Operator (string).

For most purposes, this table can be joined to other tables in MIMIC-III as if it were the NOTEEVENTS table.

Phenotype definitions are as follows:

  • Advanced Cancer - Cancers with very high mortality (pancreatic, esophageal, stomach/gastric, biliary, anaplastic)
  • Advanced Heart Disease - Ejection fraction (EF) less than 30%, severe cardiomyopathy, severe aortic stenosis, any mention of heart transplant (considered for, set to receive, denied). 
  • Advanced Lung Disease - Pulmonary Function Test (PFT) results of Forced Expiratory Volume (FEV1) less than 50% of normal, or Forced Vital Capacity (FVC) less than 70%. Severe chronic obstructive pulmonary disease (COPD), which may be indicated by Gold Stage III-IV. Severe interstitial lung disease (ILD).
  • Alcohol Abuse - Recent alcohol abuse history which is an active problem at the time of admission, whether it is the primary cause of admission or not.
  • Chronic Neurological Dystrophies - Chronic central nervous system (CNS) or spinal cord diseases, including: multiple sclerosis (MS), amyotrophic lateral sclerosis (ALS), muscular dystrophies, myasthenia gravis, Parkinson's Disease, epilepsy, stroke and cerebrovascular accident (CVA) with residual deficits, and various neuromuscular diseases or dystrophies.
  • Chronic Pain Fibromyalgia - Any etiology of chronic pain (including fibromyalgia) requiring long-term opioid/narcotic medication to control.
  • Dementia - Alzheimer's and other forms of dementia mentioned in the text.
  • Depression - Diagnosis of depression, treatment of depression, presentation to the ICU with symptoms of depression including acts of self-harm or suicide.
  • Developmental Delay - Includes congenital, genetic and idiopathic disabilities. 
  • Non Adherence - Temporary or permanent discontinuation of a treatment, including pharmaceuticals or appointments, without consulting a physician prior to doing so. This includes skipping dialysis appointments or leaving the hospital against medical advice. A patient who sees a physician to discuss adverse events associated with a medication may or may not constitute non-adherence depending on whether or not the treatment was ceased without the physician's consultation.
  • None - True when no indication is apparent to the annotator.
  • Obesity - Any mention of obesity as a consideration in the healthcare encounter. Abdominal obesity is not sufficient.
  • Other Substance Abuse - Intravenous drug abuse, illicit drug use, accidental overdose of psychoactive or narcotic medications. Remote use of marijuana is not sufficient.
  • Schizophrenia and other Psychiatric Disorders - Psychiatric disorders in DSM-5 classification, including schizophrenia, bipolar and anxiety disorders. Does not include depression.
  • Unsure - Indicates ambiguity with regard to one or several of the other indications (including "None") on the part of the note annotator.

Usage Notes

As this corpus of annotated patient notes is made up of original healthcare data which contains protected health information (PHI) per The Health Information Portability and Accountability Act of 1996 (HIPAA)[8] and can be joined to the MIMIC-III database, those who wish to access to these data must satisfy all requirements to access the data contained within MIMIC-III.

Release Notes

This release coincides with presentation of this data set at the Language Resources and Evaluation Conference in Marseille, France, May 11-16 2020.


The authors would like to acknowledge Kai-ou Tang and William Labadie-Moseley for assistance in the development of a graphical user interface for text annotation. We would also like to thank Philips Healthcare, The Laboratory of Computational Physiology at The Massachusetts Institute of Technology, and staff at the Beth Israel Deaconess Medical Center, Boston, for supporting the MIMIC-III database, from which these data were derived.

Conflicts of Interest

None to report.


  1. Kangovi, S., Mitra, N., Grande, D., White, M. L., McCollum, S., Sellman, J., Shannon, R.P., and Long, J.A. (2014). Patient-centered community health worker intervention to improve posthospital outcomes: a randomized clinical trial. JAMA Internal Medicine, 174(4):535–543
  2. Kansagara, D., Ramsay, R.S., Labby, D., and Saha, S. (2012). Post-discharge intervention in vulnerable, chronically ill patients. Journal of Hospital Medicine, 7(2):124–130
  3. Kansagara, D., Englander, H., Salanitro, A., Kagen, D., Theobald, C., Freeman, M., and Kripalani, S. (2011). Risk prediction models for hospital readmission: a systematic review. Journal of the American Medical Association, 306(15):1688–1698
  4. Ryan, J., Hendler, J., and Bennett, K. P. (2015). Understanding Emergency Department 72-Hour Revisits Among Medicaid Patients Using Electronic Healthcare Records. Big Data, 3(4):238–248
  5. Moon, S., Liu, S., Scott, C.G., Samudrala, S., Abidian, M.M., Geske, J.B., Noseworthy, P.A., Shellum, J.L., Chaudhry, R., Ommen, S. R., Nishimura, R. A., Liu, H., and Arruda-Olson, A.M. (2019). Automated extraction of sudden cardiac death risk factors in hypertrophic cardiomyopathy patients by natural language processing. International Journal of Medical Informatics, 128:32–38
  6. Chan, A., Chien, I., Moseley, E., Salman, S., Kaminer Bourland, S., Lamas, D., Walling, A. M.,Tulsky, J. A., and Lindvall, C. (2019). Deep learning algorithms to identify documentation of serious illness conversations during intensive care unit admissions. Palliative Medicine, 33(2):187–196
  7. Johnson, A.E., Pollard, T.J., Shen, L., Lehman, L.W., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A., and Mark, R.G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3:160035
  8. Health insurance portability and accountability act of 1996. Public law, 104:191

Parent Projects
Phenotype Annotations for Patient Notes in the MIMIC-III Database was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.