Database Credentialed Access
AIPatient KG: MIMIC-III and CORAL Electronic Health Records based Patient Knowledge Graph
Published: April 15, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Fan, L., & Yu, H. (2025). AIPatient KG: MIMIC-III and CORAL Electronic Health Records based Patient Knowledge Graph (version 1.0.0). PhysioNet. https://doi.org/10.13026/vjrq-9328.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
This study integrates the MIMIC-III and CORAL electronic health records into knowledge graphs to enhance their utility for advanced medical analysis and decision-making. MIMIC-III contains comprehensive data from over 40,000 patients, while CORAL focuses on oncology-specific information from 40 patients, aiding in complex medical reasoning. We used a LLM (Large Language Model)-based Named Entity Recognition approach to extract relevant medical information from these datasets, independently verified by domain experts, and constructed the AIPatient and CORAL Knowledge Graph in Neo4j. This graph supports the AIPatient system, which simulates patient interactions for advanced decision support. Additionally, we introduce MIMIC-III and CORAL Question and Answering sets, which are created for evaluating system performance such as accuracy, robustness and stability.
Background
MIMIC-III (Medical Information Mart for Intensive Care III) is a comprehensive database that contains de-identified health-related data for over 40,000 patients from Beth Israel Deaconess Medical Center in Boston [1]. This dataset provides a wide range of information, including demographics, vital signs, laboratory tests, and medications. Similarly, the CORAL (expert-Curated medical Oncology Reports to Advance Language model inference) dataset focuses on a specialized subset of medical data, including de-identified records from 20 breast cancer and 20 pancreatic cancer patients collected at the University of California, San Francisco between 2012 and 2022 [2] . Both electronic health records (EHR) provide valuable resource for diverse healthcare research and analysis.
Integrating knowledge graphs with EHR such as the MIMIC-III and CORAL enhances the utility of this data for complex medical reasoning and decision support [3]. By organizing and integrating information through a network of interconnected entities representing symptoms, diagnoses, and treatments, knowledge graphs enhance data interoperability and accessibility.
Recent advancements in technologies like Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) further amplify the benefits of using knowledge graphs in healthcare [4, 5]. RAG, by leveraging the retrieval of information during the generation process, can use the structured data provided by knowledge graphs to produce more accurate and contextually relevant responses. This is particularly useful in clinical settings where decision-making often relies on vast amounts of disparate data that must be synthesized quickly and accurately.
Here, we present the data description of AIPatient Knowledge Graph (AIPatient KG), which is designed to support AIPatient, an advanced simulated patient system powered by Reasoning RAG and Multi-agent framework [6]. In addition to AIPatient knowledge graph, we also introduce the AIPatient Knowledge Graph-CORAL (AIPatient KG-CORAL), which is tailored for Out Of Distribution (OOD) analysis.
Methods
To construct the dataset, we first restrict the current analysis scope to patients aged 18 years and above and exclude cases of parturition, elective procedures, severe car accidents, or instances where the patient lost the ability to communicate or interact with medical care personnel. These cases are excluded because they involve clinical complexities that fall outside the intended focus of AIPatient’s patient-provider interaction model. Patients unable to communicate effectively are incapable of providing disease-related information to medical professionals. Similarly, including patients under 18 would introduce unique clinical and developmental complexities requiring a separate, pediatric-focused model. Cases like parturition, elective procedures, and severe car accidents also involve highly specialized care and specific clinical pathways that add complexity beyond the system’s current scope.
Given the suitable patient population, we implemented a stratified sampling approach to select 56 patients from the MIMIC-III database, ensuring a representative distribution across major diagnostic categories. The stratification was based on the International Classification of Diseases, Ninth Revision (ICD-9) diagnostic codes, which were grouped into clinically relevant categories. These categories included infectious and parasitic diseases, neoplasms, endocrine and metabolic disorders, diseases of the circulatory and respiratory systems, and several other major disease classes. The rationale behind this stratification was to capture a diverse set of patient profiles that reflect a broad spectrum of medical conditions encountered in a clinical setting.
When constructing the AIPatient KG, we first extract patients’ symptoms, medical history, allergies, social history and family history from the discharge summary using an LLM-based NER approach. Two medical domain experts then independently evaluated and consolidated the NER results. We then used the NER results combined with structured data from tables to construct the knowledge graph in Neo4j [7]. Similar processes are completed to construct the AIPatient KG-CORAL using the entire patient population.
Table 1 shows the NER F1 results of AIPatient KG by entity categories. We observe superior knowledgebase validity for the GPT-4-Turbo model, with the highest overall F1 of 0.89. For the Claude family models, Claude-3.5-sonnet shows the best performance of 0.74 overall F1. We note that the GPT-family models specifically excelled in extracting Allergies, where the older version of Claude models suffered (Claude-3-haiku and Claude-3-sonnet). Based on these results, we use the GPT-4 Turbo model to construct the final version of AIPatient KG.
Table 1: NER F1 Results by Entity Categories | ||||||
Entity Categories | claude-3 haiku | claude-3 sonnet | claude-3.5 sonnet | gpt-3.5 turbo | gpt-4o | gpt-4 turbo |
Symptom Group | 0.69 | 0.70 | 0.78 | 0.72 | 0.75 | 0.90 |
Medical History | 0.69 | 0.69 | 0.89 | 0.87 | 0.96 | 0.98 |
Allergies | 0.68 | 0.69 | 0.70 | 0.69 | 0.74 | 0.87 |
Family and Social History Group | 0.71 | 0.71 | 0.75 | 0.71 | 0.74 | 0.91 |
Overall | 0.69 | 0.70 | 0.74 | 0.71 | 0.75 | 0.89 |
Notes: (1) Symptom Group includes symptoms, duration, intensity and frequency. (2) Family and Social History Group includes Family History (family members and their medical history) and Social History |
For evaluating the downstream tasks of AIPatient, we developed two medical Question-and-Answering dataset for MIMIC-III and CORAL patients, consisting of 524 and 423 questions respectively. These questions focus on medical entities and relationships within the records, and an example for a symptom-related question is: “what is the duration of the symptom ‘chest pain’?”
To construct these datasets, we first identified high-priority clinical concepts by reviewing MIMIC-III and CORAL’s structured and unstructured data elements, ensuring coverage across diverse patient presentations. Three expert medical doctors collaborated to formulate and validate the questions, ensuring that they adhered to standard clinical documentation practices. Each question was linked to specific data fields or textual sources within the electronic health records, allowing for systematic evaluation of AIPatient’s information retrieval and reasoning capabilities. The answer generation process involved extracting ground truth responses directly from patient records, with validation from an independent medical doctor to ensure clinical accuracy and consistency.
Data Description
MIMIC.backup and CORAL.backup contain the knowledge graph snapshots that can be used to recreate the AIPatient KG and AIPatient KG-CORAL in Neo4j. Table 2 - 5 show the nodes and relationships, as well as the statistics. Note that the Vital Node and HAS_VITAL relationship are specific to the AIPatient KG and not AIPatient KG-CORAL. Additionally, the Patient node in AIPatient KG-CORAL knowledge graph only contains Gender, Age, and Ethnicity, and the Admission node only contains Admission_Type. In Table 6, we show the columns in the Question-and-Answer Dataset.
Node Label | Properties | Counts |
---|---|---|
Patient | SUBJECT_ID, GENDER, AGE, ETHNICITY, RELIGION, MARITAL_STATUS | 56 |
Admission | HADM_ID, DURATION, ADMISSION_TYPE, ADMISSION_LOCATION, DISCHARGE_LOCATION, INSURANCE | 56 |
Symptom | name | 240 |
Duration | name | 41 |
Intensity | name | 19 |
Frequency | name | 7 |
History | name | 364 |
Vital | LABEL, VALUE | 156 |
Allergy | name | 51 |
SocialHistory | name | 213 |
FamilyMember | description | 20 |
FamilyMedicalHistory | name | 46 |
Relationship | Source | Target | Counts |
---|---|---|---|
HAS_ADMISSION | Patient | Admission | 56 |
HAS_MEDICAL_HISTORY | Patient | History | 451 |
HAS_FAMILY_MEMBER | Patient | FamilyMember | 55 |
HAS_SYMPTOM | Admission | Symptom | 309 |
HAS_SOCIAL_HISTORY | Admission | SocialHistory | 221 |
HAS_VITAL | Admission | Vital | 193 |
HAS_ALLERGY | Admission | Allergy | 61 |
HAS_NOSYMPTOM | Admission | Symptom | 41 |
HAS_DURATION | Symptom | Duration | 65 |
HAS_INTENSITY | Symptom | Intensity | 22 |
HAS_FREQUENCY | Symptom | Frequency | 7 |
HAS_MEDICAL_HISTORY | FamilyMember | FamilyMedicalHistory | 55 |
Node Label | Properties | Counts |
---|---|---|
Patient | SUBJECT_ID, GENDER, AGE, ETHNICITY, RELIGION, MARITAL_STATUS | 40 |
Admission | HADM_ID, DURATION, ADMISSION_TYPE, ADMISSION_LOCATION, DISCHARGE_LOCATION, INSURANCE | 40 |
Symptom | name | 71 |
Intensity | name | 2 |
Frequency | name | 4 |
History | name | 299 |
Allergy | name | 2 |
SocialHistory | name | 182 |
FamilyMember | description | 27 |
FamilyMedicalHistory | name | 53 |
Relationship | Source | Target | Counts |
---|---|---|---|
HAS_ADMISSION | Patient | Admission | 40 |
HAS_MEDICAL_HISTORY | Patient | History | 347 |
HAS_FAMILY_MEMBER | Patient | FamilyMember | 89 |
HAS_SYMPTOM | Admission | Symptom | 25 |
HAS_SOCIAL_HISTORY | Admission | SocialHistory | 225 |
HAS_ALLERGY | Admission | Allergy | 4 |
HAS_NOSYMPTOM | Admission | Symptom | 60 |
HAS_INTENSITY | Symptom | Intensity | 3 |
HAS_FREQUENCY | Symptom | Frequency | 4 |
HAS_MEDICAL_HISTORY | FamilyMember | FamilyMedicalHistory | 109 |
Variable | Description |
---|---|
SUBJECT_ID | patient identifier |
HADM_ID | admission identifier; in AIPatient KG - CORAL is equal to SUBJECT_ID |
Question | expert-curated question |
Question Category | question category, includes patient, admission, symptoms, vitals, medical history, allergies, family and social history |
Correct Answer |
correct answers to the question |
To enhance accessibility and usability, we have extracted all relevant data of AIPatient KG and CORAL KG and saved it in JSON format. This JSON representation serves as a convenient way for researchers to quickly view and access the extracted information without requiring specialized database tools (the Python script for extracting and storing the data is available in AIPatient's Github Repo) [8]. However, for optimal usability and full functionality, the knowledge graph is hosted within a Neo4j environment. Neo4j provides an interactive and efficient platform for querying complex relationships within the data, allowing researchers to explore structured connections between medical entities, patient histories, and extracted information in a way that is more dynamic than static JSON files.
Usage Notes
The AIPatient KG enables advanced applications in AI-driven medical reasoning, clinical decision support, and medical education. By structuring patient data into a Neo4j-hosted knowledge graph, it facilitates efficient querying of symptoms, diagnoses, and treatments, supporting the development of AI models for automated medical question-answering and Retrieval-Augmented Generation (RAG). It is particularly valuable for training LLMs in clinical contexts and simulating patient interactions for medical trainees. However, its focus on critical care and oncology data may limit generalizability to broader patient populations, and inherent EHR inconsistencies pose challenges for AI applications.
To recreate the AIPatient KG or AIPatient KG - CORAL knowledge graph in Neo4j, first navigate to browser Neo4j Aura login and enter credentials. Then click Create Instance and wait until a new instance populates. After “Instance01” is available, click on the three dots on the upper right of the instance. Then select “Inspect” – “Restore from backup file” and drop the files with the “.backup” extensions to recreate the knowledge graph. The database creation could take up to 5 minutes.
The data collection, cleaning, and analysis code are available on the GitHub repository [8]. This repository includes scripts for extracting structured and unstructured data, preprocessing patient records, and constructing the AIPatient Knowledge Graph in Neo4j. Additionally, the repository also contains code for conducting ablation experiments, system performance evaluation and web-based demo used in the research paper [6, 8].
Release Notes
MIMIC-III data were accessed through PhysioNet. A data usage agreement was required to obtain the data. LLM were used in compliance with PhysioNet standard, including using the GPT models on Azure and the Claude models on Amazon Bedrock for data privacy.
Ethics
MIMIC-III data were accessed through PhysioNet. A data usage agreement was required to obtain the data. LLM were used in compliance with PhysioNet standard, including using the GPT models on Azure and the Claude models on Amazon Bedrock for data privacy.
Acknowledgements
The authors acknowledge the Rackham Graduate Student Research Grant (L.F.). The authors acknowledge the help from Dr. Libby Hemphill for the initial data acquisition and computing resource support.
Conflicts of Interest
H.Y.: Employment as a data scientist and stockholder in Amazon.com, Inc., unrelated to this work. All other authors declare they have no competing interests.
References
- Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35
- Sushil, M., Kennedy, V., Mandair, D., Miao, B., Zack, T., & Butte, A. (2024). CORAL: expert-Curated medical Oncology Reports to Advance Language model inference (version 1.0). PhysioNet. https://doi.org/10.13026/v69y-xa45.
- Aldughayfiq B, Ashfaq F, Jhanjhi NZ, Humayun M. Capturing Semantic Relationships in Electronic Health Records Using Knowledge Graphs: An Implementation Using MIMIC III Dataset and GraphDB. Healthcare (Basel). 2023 Jun 15;11(12):1762. doi: 10.3390/healthcare11121762. PMID: 37372880; PMCID: PMC10297905.
- Ng, K. K. Y., Matsuba, I., & Zhang, P. C. (2024). RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations. Journal Name, Volume(Issue). https://orcid.org/0009-0002-2848-0571, https://orcid.org/0009-0002-2848-0571, https://orcid.org/0000-0001-7981-4303.
- Clusmann, J., Kolbinger, F.R., Muti, H.S. et al. The future landscape of large language models in medicine. Commun Med 3, 141 (2023). https://doi.org/10.1038/s43856-023-00370-1
- Yu, H., Zhou, J., Li, L., Chen, S., Gallifant, J., Shi, A., Li, X., Hua, W., Jin, M., Chen, G., Zhou, Y., Li, Z., Gupte, T., Chen, M.-L., Azizi, Z., Zhang, Y., Assimes, T. L., Ma, X., Bitterman, D. S., Lu, L., & Fan, L. (2024). AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow. arXiv:2409.18924 [cs.CL]. https://doi.org/10.48550/arXiv.2409.18924
- Neo4j Inc. Neo4j Graph Database [Software]. Version 5.0. 2024. Available from: https://neo4j.com/
- Yu H. AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow [Internet]. GitHub repository. 2025 [cited 2025 Mar 9]. Available from: https://github.com/huiziy/AIPatient/tree/main
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/vjrq-9328
DOI (latest version):
https://doi.org/10.13026/jxcs-fq11
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project