Software Credentialed Access

Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays

Luis R Soenksen Yu Ma Cynthia Zeng Leonard David Jean Boussioux Kimberly Villalobos Carballo Liangyuan Na Holly Wiberg Michael Li Ignacio Fuentes Dimitris Bertsimas

Published: Aug. 23, 2022. Version: 1.0.1

When using this resource, please cite: (show more options)
Soenksen, L. R., Ma, Y., Zeng, C., Boussioux, L. D. J., Villalobos Carballo, K., Na, L., Wiberg, H., Li, M., Fuentes, I., & Bertsimas, D. (2022). Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays (version 1.0.1). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


A multimodal combination of the MIMIC-IV v1.0.0 and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 databases filtered to only include patients that have at least one chest X-ray performed with the goal of validating multi-modal predictive analytics in healthcare operations can be generated with the present resource. This multimodal dataset generated through this code contains 34,540 individual patient files in the form of "pickle" Python object structures, which covers a total of 7,279 hospitalization stays involving 6,485 unique patients. Additionally, code to extract feature embeddings as well as the list of pre-processed features are included in this repository.


As described in Soenksen et al 2022 [3], the MIMIC datasets can be used for the purpose of testing multimodal machine learning systems. To generate a multimodal dataset, our project utilizes the Medical Information Cart for Intensive Care (MIMIC)-IV v1.0 [1] resource, which contains de-identified records of 383,220 individual patients admitted to the intensive care unit (ICU) or emergency department (ED) of Beth Israel Deaconess Medical Center (BIDMC), in combination the MIMIC Chest X-ray (MIMIC-CXR-JPG) database v2.0.0 [2] containing 377,110 radiology images with free-text reports representing 227,835 medical imaging events that can be matched to corresponding patients included in MIMIC-IV v1.0.

We combined MIMIC-IV v1.0 [1] and MIMIC-CXR-JPG v2.0.0 [2] into a unified multimodal dataset, which we identify as HAIM-MIMIC-MM in Soenksen et al 2022 [3], based on matched patient, admission, and imaging-study identifiers (i.e., subject_id, stay_id, study_id from the MIMIC-IV and MIMIC-CXR-JPG databases). We used this multimodal dataset to assist in the systematic evaluation of improvements in predictive value from multi-modality in canonical artificial intelligence models for healthcare. The file format produced by the present code includes structured patient information, time-series data, medical images, and unstructured text notes for each patient.

Building this combination of MIMIC-IV and MIMIC-CXR-JPG into independent patient files for use with the Holistic Artificial Intelligence in Medicine (HAIM) framework presented in Soenksen et al 2022 [3] requires credentialed access to MIMIC-IV v1.0 [1] and MIMIC-CXR-JPG v2.0.0 [2]. A GitHub repository describing the use of this multimodal combination database as a canonical example to train multimodal artificial intelligence models for clinical use and healthcare operations can be found online [4].

Software Description

The multimodal clinical database used in Soenksen et al 2022 [3], contains N=34,537 samples, spanning 7,279 unique hospitalizations and 6,485 patients. This database contains 4 distinct data modalities (i.e., tabular data, time-series information, text notes, and X-ray images).

Every patient file in this multimodal dataset includes information extracted from the following fields in MIMIC-IV v1.0 [1] and MIMIC-CXR-JPG v2.0.0 [2]: admissions, demographics, transfers, core, diagnoses icd, drgcodes, emar, emar detail, hcpcsevents, labevents, microbiologyevents, poe, poe detail, prescriptions, procedures icd, ser- vices, procedureevents, outputevents, inputevents, icustays, datetimeevents, chartevents, cxr, imcxr, noteevents, dsnotes, ecgnotes, echonotes, rad-notes. We have created sample Jupyter notebooks and python files to showcase how this structure is generated based on MIMIC-IV v1.0 [1] and MIMIC-CXR-JPG v2.0.0 [2].

Our selected structure based on individual patient files in pickle format provides several advantages for training artificial intelligence and machine learning models based on this multi-modal dataset. For instance, the high compatibility and handling speed that python has for pickle files, allows for fast loading, while the individualized patient files allow for easier input of patient samples of selected criteria into training algorithms for standard open-source machine learning libraries written in a python programming language.

The code provided processes and saves all the individual patient files locally as “pickle” python-language object structures for ease of processing in subsequent sampling and modeling tasks. The final file structure should be organized in folders of 1000 files each, where the file name is organized as haim-ID.pkl, and the mapping between haim-ID and the MIMIC patient IDs is recorded in the file "haim_mimiciv_key_ids.csv".

The definition of the data structure for all patient files in relation to the individualized data in each pickle file is as follows:

Patient class structure

class Patient_ICU(object):
    def __init__(self, admissions, demographics, transfers, core,
        diagnoses_icd, drgcodes, emar, emar_detail, hcpcsevents,
        labevents, microbiologyevents, poe, poe_detail,
        prescriptions, procedures_icd, services, procedureevents,
        outputevents, inputevents, icustays, datetimeevents,
        chartevents, cxr, imcxr, noteevents, dsnotes, ecgnotes,
        echonotes, radnotes):

        ## CORE
        self.admissions = admissions
        self.demographics = demographics
        self.transfers = transfers
        self.core = core

        ## HOSP
        self.diagnoses_icd = diagnoses_icd
        self.drgcodes = drgcodes
        self.emar = emar
        self.emar_detail = emar_detail
        self.hcpcsevents = hcpcsevents
        self.labevents = labevents
        self.microbiologyevents = microbiologyevents
        self.poe = poe
        self.poe_detail = poe_detail
        self.prescriptions = prescriptions
        self.procedures_icd = procedures_icd = services

        ## ICU
        self.procedureevents = procedureevents
        self.outputevents = outputevents
        self.inputevents = inputevents
        self.icustays = icustays
        self.datetimeevents = datetimeevents
        self.chartevents = chartevents

        ## CXR
        self.cxr = cxr
        self.imcxr = imcxr

        ## NOTES
        self.noteevents = noteevents
        self.dsnotes = dsnotes
        self.ecgnotes = ecgnotes
        self.echonotes = echonotes
        self.radnotes = radnotes

In the specific context of Soenksen et al 2022 [3], the selection of "pickle files" for per-patient data format was done to provide an interface with common machine learning and artificial intelligence modeling techniques which heavily rely on Python to conduct computational experiments.

In addition to the code to generate the multimodal patient files, we include the extracted HAIM embeddings on such files for convenience. We hope this format allows a wide audience for more direct access to this merged dataset.

Technical Implementation

The multimodal combination of the MIMIC-IV v1.0.0 [1] and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 [2] is processed in our code by first importing all MIMIC-IV tables in combination with compressed JPG formatted images from the MIMIC-CXR-JPG database, which need to be downloaded locally via credentialed access on PhysioNet. Both data sources have been previously independently de-identified by deleting all personal health information (PHI), following the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements.

After getting access from PhysioNet, our code unifies registries on MIMIC-IV v1.0.0 [1] and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 [2] based on matched patient, admission, and imaging-study identifiers (i.e., subject_id, stay_id, study_id). We have created a HAIM GitHub repository [4] for collaborative code development on our HAIM framework, testing, and reproduction of the results presented in Soenksen et al 2022 [3]. We welcome code contributions from all users, and we encourage discussion of the data via the GitHub issues.

Installation and Requirements

As specified in Soenksen et al 2022 [3], the individual multimodal HAIM patient files were generated using a computer with 8 cores and 32Gb of available random access memory (RAM) for this processing task, and a minimum of 20Gb in RAM is usually required for processing. All required installations for this project are specified in the "env.yaml" file within the "env" folder. A sample of 5 folders with previously generated multimodal patient files (Folder00 to Folder04) is included as part of the "Sample_Multimodal_Patient_Files" repository to facilitate testing and validation of the merged dataset in the form of individual patient files along with the presented multi-modal machine learning techniques in Soenksen et al 2022 [3].

Usage Notes

Three Jupyter notebook files demonstrate:

  1. The generation of the merged HAIM-MIMIC-MM dataset ("1_Generate_HAIM-MIMIC-MM.ipynb")
  2. The generation of embeddings based on the individual patient files from HAIM-MIMIC-MM ("Generate_Embeddings_from_Pickle_Files.ipynb"), and
  3. Sample utilization of such embeddings for the creation of a predictive task using machine learning ("3_Use_Embedding_for_Prediction.ipynb")

All other code needed to evaluate this multimodal database and reproduce the conclusions in its companion modeling work in Soenksen et al 2022 [3] are available in our GitHub repository [4].

Release Notes

Version 1.0.0: First release of the software and sample data.

Version 1.0.1: Updated release of the software and sample data.


The authors declare no ethics concerns.


We thank the PhysioNet team from the MIT Laboratory for Computational Physiology for providing our researchers with credentialed access to the MIMIC-IV v1.0.0 [1] and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 [2] datasets and for their support in guiding multimodal data interrogation and consolidation. We especially thank Leo A. Celi and Sicheng Hao for their support on review of the HAIM data, as well as the Harvard TH Chan School of Public Health, Harvard Medical School, the Institute for Medical Engineering and Science at MIT, and the Beth Israel Deaconess Medical Centre for their continued support of this work. We thank MIT Supercloud services for their support and help in setting up a workspace as well as for offering technical advice throughout the project.

Conflicts of Interest

Authors declare no competing interests.


  1. Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2021). MIMIC-IV (version 1.0). PhysioNet.
  2. Johnson, A., Lungren, M., Peng, Y., Lu, Z., Mark, R., Berkowitz, S., & Horng, S. (2019). MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet.
  3. Soenksen, L.R., Ma, Y., Zeng, C., Boussioux, L.D., Carballo, K.V., Na, L., Wiberg, H.M., Li, M.L., Fuentes, I. and Bertsimas, D., 2022. Integrated multimodal artificial intelligence framework for healthcare applications. arXiv preprint arXiv:2202.12998.
  4. Soenksen, L.R., Ma, Y., Zeng, C., Boussioux, L.D., Carballo, K.V., Na, L., Wiberg, H.M., Li, M.L., Fuentes, I. and Bertsimas, D., Holistic Artificial Intelligence in Medicine, (2022), GitHub repository

Parent Projects
Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.