Database Credentialed Access

Nosocomial Risk Datasets from MIMIC-III

Travis Goodwin

Published: Sept. 15, 2022. Version: 1.0

When using this resource, please cite: (show more options)
Goodwin, T. (2022). Nosocomial Risk Datasets from MIMIC-III (version 1.0). PhysioNet.

Additionally, please cite the original publication:

Travis R Goodwin, Dina Demner-Fushman, A customizable deep learning model for nosocomial risk prediction from critical care notes with indirect supervision, Journal of the American Medical Informatics Association, Volume 27, Issue 4, April 2020, Pages 567–576,

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Reliable longitudinal risk prediction for hospitalized patients is needed to provide quality care. Our goal is to foster the development of generalizable models capable of leveraging clinical notes to predict healthcare-associated diseases 24–96 hours in advance. We developed data to explore the problem of predicting the risk of hospital acquired (occurring \geq 48 hours after admission) acute kidney injury, pressure injury, or anemia \geq 24 hours before it is implicated by the patient’s chart, labs, or notes. We relied on the MIMIC-III critical care database and extract distinct positive and negative cohorts for each disease. We retrospectively determine the date-of-event using structured and unstructured criteria so that it may be used as a form of indirect supervision to train and evaluate automatic systems for predicting disease risk using clinical notes. This data was used as the experimental basis for the CANTRIP project.


Risk prediction from EHR data has received considerable attention over the last decade, with the majority of approaches predicting a specific outcome or single disease; however a recent review of 107 risk prediction studies observed that most studies (a) relied on only a small list of predefined variables rather than leveraging the breadth of data in the EHR, and (b) neglected to consider longitudinal relationships in the data[4]. Moreover, very few studies involved clinical text in any capacity. We developed data for three common nosocomial diseases: hospital acquired acute kidney injury (HAAKI), hospital acquired pressure injury (HAPI), and hospital acquired anemia (HAA) -- each with its own training, validation, and testing cohorts.

Hospital acquired acute kidney injury (HAAKI)

Acute kidney injury (AKI) affects as many as 20% of all hospitalizations and is associated with increased mortality, end-stage renal disease, and chronic kidney disease[5,6]. Unfortunately, current criteria for AKI are primarily markers of established kidney damage or impaired function. As such, new approaches for earlier prediction of AKI before significant kidney damage is established could improve outcomes. This dataset is the first to our knowledge designed for predicting AKI or HAAKI using clinical notes.

Hospital acquired anemia (HAA)

A substantial number of hospital patients with normal HgB on admission become anemic during the course of their hospitalization resulting in increased length of stay, hospital charges, and substantial risk of in-hospital mortality (by 51%–228%, depending on severity)[7]. It has been shown that, in critical care, phlebotomy is highly associated with changes in HgB and hematocrit; moreover, critical care patients average 40–70 mL of blood drawn daily and every 50 mL of blood drawn increases their risk of moderate to severe HAA by 18%[8,9]. Consequently, the ability to automatically predict HAA would enable physicians to switch to small volume phlebotomy tubes, minimizing blood loss from in-dwelling catheters, and reducing blood tests[9]. Although there has been some work on predicting anemias such as classifying iron deficiency anemia using artificial neural networks[10], or predicting moderate to severe anemia for patients with ulcerative colitis using logistic regression[11], we were unable to find any datasets designed for developing automatic methods for predicting hospital acquired anemia whether using structured or unstructured data.

Hospital acquired pressure injury (HAPI)

The development of pressure injuries (i.e., pressure ulcers or bed sores) can lead to several complications, including sepsis, cellulitis, osteomyelitis, pain, depression, and increased mortality (as high as 60% within 1 year of hospital discharge for older patients who develop a pressure ulcer during their stay)[12,13]. There have been no conclusive studies on the identification of pressure ulcer risk factors, nor were any of the existing risk-assessment scales developed especially for use in intensive care unit (ICU) patients[14]. With this dataset, we enable data-driven approaches to reliably detect pressure ulcer for ICU patients without physician interaction or pre-specified feature extraction, allowing for potentially improved patient outcomes.


The data in this project is derived from MIMIC-III[1,2]. To account for irregular gaps in the patient’s hospital visit, we adopt an abstract representation of the patient’s hospital visit which we call their clinical chronology. We represent the chronology as

  1. a discrete, discontiguous sequence of L L snapshots, s 1 , s 2 , , s L s_1, s_2, \cdots, s_L ⁠, where each snapshot encodes the clinical observations documented in any clinical notes produced on the same (calendar) day, and

  2. a sequence of elapsed times, δ 1 , δ 2 , , δ L \delta_1, \delta_2, \cdots, \delta_L such that δ i \delta_i encodes the number of hours between s i s_i and s i 1 s_{i-1} and δ 0 \delta_0 encodes the number of hours between hospital admission and the first clinical note.

Natural language preprocessing

We extracted the set of observations from each clinical note in MIMIC-III using MetaMap Lite[15]. We then filtered out all observations that:

  1. were not affirmed, certain, present, and associated with the patient;
  2. occurred in a section corresponding to consults, family history, past medical history, or social history;
  3. had a UMLS[16] semantic type not corresponding to a medical problem, intervention, drug, or anatomic region; or
  4. belonged to InfoBot’s medical stop word list.

Further details are provided in [3].

Determining the Date-of-Event

We determined the Date-of-Event as the first date in which the disease is documented in a clinical note, or evidenced by the patient’s labs or chart. Specifically, for each disease, we defined 1 or more (a) seed concepts in the UMLS hierarchy, (b) lexical patterns, and (c) structured criteria using the laboratory, chart, and/or vital sign information in MIMIC. We determined the DOE as the first date in which (1) any observation extracted from a clinical note associated with that date descends from any of the UMLS seed concepts; (2) any observation or any text in the note contains any of the lexical patterns not immediately followed by a colon (to rule out structural matches, eg, “bed sore: none”); or (3) the structured criteria is met.

Disease UMLS Seed CUI Lexical patterns (regular expressions) Structured criteria
HAAKI C0022660 (Kidney Failure, Acute) kidney failure, renal failure, kidney injury, renal injury, AKI KDIGO[17]
HAPI C0011127 (Pressure Ulcer) bed sore, bed ulcer, pressure sore, pressure ulcer, decub* sore, decub* ulcer NPUAP[18]
HAA C0002871 (Anemia) anemia, anaemia, HAA WHO[19]

Abbreviations: AKI, acute kidney injury; CUI, concept unique identifier; HAA, hospital acquired anemia; HAAKI, hospital acquired acute kidney injury; HAPI, hospital acquired pressure injury; KDIGO, Kidney Disease Improving Global Outcomes; NPUAP, National Pressure Uncler Advisory Panel; UMLS, Unified Medical Language System; WHO, World Health Organization.

Encoding elapsed times

We encoded elapsed times using the sinusoidal representation proposed in [20] and further detailed in [31].

Creating positive and negative examples

To enable a model to be trained without manually quantifying the risk of disease for each snapshot in each patient’s chronology, we used the DOE as a form of indirect supervision to produce positive and negative examples. Specifically, for each positive admission (i.e., admissions with chronologies in which the patient eventually develops the disease) we created a labeled example by:

  1. Truncating each chronology to end at the last snapshot occurring 24–96 hours before the DOE;

  2. Defining the prediction window Δ \Delta as the elapsed time (in hours) between the final snapshot (after truncation) and the DOE; and

  3. Assigning the label y = 1 y=1 .

To create negative examples, we first grouped positive admissions into buckets based on demographic and admission information including the patient’s age, sex, and race as well as their admitting ICU, source of admission (i.e., clinic, physician, transfer, or other), type of admission (i.e., elective, emergency, urgent), Oxford Acute Severity of Illness Score[21] and type of insurance (i.e., government, private, Medicaid, Medicare, or self pay). For each bucket b b , we assumed the Time-to-Event (TTE, i.e., the number of hours elapsed from hospital admission to DOE) followed a Gamma prior distribution (i.e., TTE Γ ( k b , θ b ) \sim \Gamma(k_b, \theta_b) ⁠) and determined k b k_b and θ b \theta_b using maximum likelihood estimates over each positive example in the bucket. This allowed us to create labels for our negative examples by:

  1. Determining which bucket b b each negative example belonged to;

  2. Sampling TTE′ Γ ( k b , θ b ) \sim \Gamma(k_b, \theta_b) ;

  3. Defining the DOE as either (a) the date obtained by projecting TTE' from the date of hospital admission or (b) the discharge date, whichever occurred first;

  4. Truncating the chronology to end at the snapshot 24–96 hours before the DOE; and

  5. Defining Δ \Delta as the hours elapsed between the final snapshot (after truncation) and the DOE.

This process is illustrated and further detailed in [3].

Data Description

Chronology format

The chronology CSV files have the following format:


where [observations] is encoded as a space separated list of observation IDs (e.g., UMLS CUIs), and [timestamp] is the chart time for that set of observations.

Admission format

The admission CSV files follow the format:


where [timestamp] is the admission time for the associated hospital admission

Label format:

The label CSV files have the following format:


where [timestamp] is the chart time of the label and [label] is a zero or one indicating the date-of-event for the disease

Usage Notes

This data can be used to train models for predicting nosocomial disease risk 48-96 hours in advance. An example toolkit for this is provided through CANTRIP on GitHub[3,22]. If used, please cite the associated manuscript.

Release Notes

Initial release version 1.0


This project relies exclusively on de-identified data from MIMIC-3 and was approved by the NIH IRB.


This data was developed utilizing the computational resources of the NIH HPC Biowulf cluster[23] and was produced for the paper A customizable deep learning model for nosocomial risk prediction from critical care notes with indirect supervision published in the Journal of the American Medical Informatics Association[3].

Conflicts of Interest

The authors have no conflicts of interest to disclose.


  1. Johnson AEW , Pollard TJ, Shen L, et al. . MIMIC-III, a freely accessible critical care database. Sci Data2016; 3 (1): 160035.
  2. Goldberger AL , Amaral LA, Glass L, et al. . PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation2000; 101 (23): E215–20.
  3. Travis R Goodwin, Dina Demner-Fushman, A customizable deep learning model for nosocomial risk prediction from critical care notes with indirect supervision, Journal of the American Medical Informatics Association, Volume 27, Issue 4, April 2020, Pages 567–576,
  4. Magill SS , Edwards JR, Bamberg W, et al. . Multistate point-prevalence survey of health care–associated infections. N Engl J Med2014; 370 (13): 1198–208.
  5. Silver SA , Long J, Zheng Y, et al. . Cost of acute kidney injury in hospitalized patients. J Hosp Med2017; 12 (2): 70–6.
  6. Chertow GM , Burdick E, Honour M, et al. . Acute kidney injury, mortality, length of stay, and costs in hospitalized patients. J Am Soc Nephrol2005; 16 (11): 3365–70.
  7. Henderson JM , Blackstone EH, Hixson ED, et al. . Hospital-acquired anemia: prevalence, outcomes, and healthcare implications. J Hosp Med2013; 8: 506–12.
  8. Thavendiranathan P , Bagai A, Ebidia A, et al. . Do blood tests cause anemia in hospitalized patients? The effect of diagnostic phlebotomy on hemoglobin and hematocrit levels. J Gen Intern Med2005; 20 (6): 520–4.
  9. McEvoy MT , Shander A. Anemia, bleeding, and blood transfusion in the intensive care unit: causes, risks, costs, and new strategies. Am J Crit Care2013; 22 (6): eS1–13.
  10. Azarkhish I , Raoufy MR, Gharibzadeh S. Artificial intelligence models for predicting iron deficiency anemia and iron serum level based on accessible laboratory data. J Med Syst2012; 36 (3): 2057–61.
  11. Khan N , Patel D, Shah Y, et al. . A novel model for predicting incident moderate to severe anemia and iron deficiency in patients with newly diagnosed ulcerative colitis. Dig Dis Sci2017; 62 (5): 1295–304.
  12. Brem H , Maggi J, Nierman D, et al. . High cost of stage IV pressure ulcers. Am J Surg2010; 200 (4): 473–7.
  13. Thomas DR , Goode PS, Tarquine PH, et al. . Hospital-acquired pressure ulcers and risk of death. J Am Geriatr Soc1996; 44 (12): 1435–40.
  14. Keller B , Wille J, van Ramshorst B, et al. . Pressure ulcers in intensive care patients: a review of risks and prevention. Intensive Care Med2002; 28 (10): 1379–88.
  15. Demner-Fushman D , Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc2017; 24 (4): 841–4.
  16. Lindberg DA , Humphreys BL, McCray AT. The Unified Medical Language System. Methods Inf Med 1993; 32: 281–291.
  17. Khwaja A. KDIGO clinical practice guidelines for acute kidney injury. Nephron Clinical Practice. 2012;120(4):c179-84.
  18. Black J, Baharestani MM, Cuddigan J, Dorner B, Edsberg L, Langemo D, Posthauer ME, Ratliff C, Taler G. National Pressure Ulcer Advisory Panel's updated pressure ulcer staging system. Advances in skin & wound care. 2007 May 1;20(5):269-74.
  19. World Health Organization. Haemoglobin concentrations for the diagnosis of anaemia and assessment of severity. World Health Organization; 2011.
  20. Vaswani A , Shazeer N, Parmar N, et al. . Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, et al. , eds. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017; December 4–9, 2017: 5998–6008; Long Beach, CA.
  21. Johnson AEW , Kramer AA, Clifford GD. A new severity of illness scale using a subset of Acute Physiology and Chronic Health Evaluation data elements shows comparable predictive accuracy. Crit Care Med2013; 41 (7): 1711–8.
  22. Goodwin, T. R. (n.d.). CANTRIP Version (1.0). GitHub. Retrieved August 12, 2022, from
  23. U.S. Department of Health and Human Services. (n.d.). NIH HPC Systems. National Institutes of Health. Retrieved August 12, 2022, from

Parent Projects
Nosocomial Risk Datasets from MIMIC-III was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.