Name: Asclepius-R : Clinical Large Language Model Built On MIMIC-III Discharge Summaries
Published: Jan. 30, 2024
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Model Credentialed Access

Sunjun Kweon , Junu Kim , Jiyoun Kim , Sujeong Im , Eunbyeol Cho , Seongsu Bae , Jungwoo Oh , Gyubok Lee , Jong Hak Moon , Seng Chan You , Seungjin Baek , Chang Hoon Han , Yoon Bin Jung , Yohan Jo , Edward Choi

Published: Jan. 30, 2024. Version: 1.0.1 <View latest version>

This is not the latest version. Click here for the latest version.

When using this resource, please cite: (show more options)
Kweon, S., Kim, J., Kim, J., Im, S., Cho, E., Bae, S., Oh, J., Lee, G., Moon, J. H., You, S. C., Baek, S., Han, C. H., Jung, Y. B., Jo, Y., & Choi, E. (2024). Asclepius-R : Clinical Large Language Model Built On MIMIC-III Discharge Summaries (version 1.0.1). PhysioNet. https://doi.org/10.13026/s5rz-1j65.

MLA	Kweon, Sunjun, et al. "Asclepius-R : Clinical Large Language Model Built On MIMIC-III Discharge Summaries" (version 1.0.1). PhysioNet (2024), https://doi.org/10.13026/s5rz-1j65.
APA	Kweon, S., Kim, J., Kim, J., Im, S., Cho, E., Bae, S., Oh, J., Lee, G., Moon, J. H., You, S. C., Baek, S., Han, C. H., Jung, Y. B., Jo, Y., & Choi, E. (2024). Asclepius-R : Clinical Large Language Model Built On MIMIC-III Discharge Summaries (version 1.0.1). PhysioNet. https://doi.org/10.13026/s5rz-1j65.
Chicago	Kweon, Sunjun, Kim, Junu, Kim, Jiyoun, Im, Sujeong, Cho, Eunbyeol, Bae, Seongsu, Oh, Jungwoo, Lee, Gyubok, Moon, Jong Hak, You, Seng Chan, Baek, Seungjin, Han, Chang Hoon, Jung, Yoon Bin, Jo, Yohan, and Edward Choi. "Asclepius-R : Clinical Large Language Model Built On MIMIC-III Discharge Summaries" (version 1.0.1). PhysioNet (2024). https://doi.org/10.13026/s5rz-1j65.
Harvard	Kweon, S., Kim, J., Kim, J., Im, S., Cho, E., Bae, S., Oh, J., Lee, G., Moon, J. H., You, S. C., Baek, S., Han, C. H., Jung, Y. B., Jo, Y., and Choi, E. (2024) 'Asclepius-R : Clinical Large Language Model Built On MIMIC-III Discharge Summaries' (version 1.0.1), PhysioNet. Available at: https://doi.org/10.13026/s5rz-1j65.
Vancouver	Kweon S, Kim J, Kim J, Im S, Cho E, Bae S, Oh J, Lee G, Moon J H, You S C, Baek S, Han C H, Jung Y B, Jo Y, Choi E. Asclepius-R : Clinical Large Language Model Built On MIMIC-III Discharge Summaries (version 1.0.1). PhysioNet. 2024. Available from: https://doi.org/10.13026/s5rz-1j65.

Additionally, please cite the original publication:

Kweon, S., Kim, J., Kim, J., Im, S., Cho, E., Bae, S., Oh, J., Lee, G., Moon, J. H., You, S. C., Baek, S., Han, C. H., Jung, Y. B., Jo, Y., & Choi, E. (2023). Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes (arXiv:2309.00237). arXiv.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

The development of large language models tailored for handling patients’ clinical notes is often hindered by the limited accessibility and usability of these notes due to strict privacy regulations. To address these challenges, we first create synthetic large-scale clinical notes using publicly available case reports extracted from biomedical literature. We then use these synthetic notes to train our specialized clinical large language model, Asclepius. While Asclepius is trained on synthetic data, we assess its potential performance in real-world applications by evaluating it using real clinical notes. We benchmark Asclepius against several other large language models, including GPT-3.5-turbo and other open-source alternatives. To further validate our approach using synthetic notes, we also compare Asclepius with its variants trained on real clinical notes. Our findings convincingly demonstrate that synthetic clinical notes can serve as viable substitutes for real ones when constructing high-performing clinical language models. This conclusion is supported by detailed evaluations conducted by both GPT-4 and medical professionals. All resources—including weights, codes, and data—used in the development of Asclepius are made publicly accessible for future research. Specifically, this repository contains Asclepius-R, a variant of Asclepius that was trained on MIMIC-III discharge summaries. All other resource are also publicly accessible.

Background

Clinical notes are a valuable source of patient-specific information. They can be analyzed using Natural Language Processing (NLP) techniques to aid medical professionals' decision-making processes. Large Language Models (LLMs) like OpenAI's GPT series excel in analyzing these notes. However, health organizations face two key challenges when using API-based external LLMs. First, transmitting sensitive patient information to these models raises privacy and security concerns. Hospitals must take careful steps to de-identify clinical notes and establish secure transmission protocols. Second, health organizations often prefer a model tailored to their unique needs, affecting their autonomy over LLMs.

To overcome these challenges, a clinical LLM that can operate securely offline while still offering robust online LLMs' capabilities is needed. Developing such a model requires a specific training dataset composed of instruction-answer pairs from real clinical notes. However, acquiring these notes is challenging due to privacy regulations, and creating a clinical instruction set is either labor-intensive or involves privacy and security challenges.

To tackle such challenges, we developed Asclepius, a clinical LLM designed using synthetic clinical notes and corresponding instruction-answer pairs. These synthetic notes, generated from PMC-Patients, contain anonymized case reports from a publicly available biomedical literature archive, PubMed Central. The utilization of synthetic notes allows us to leverage advanced online LLMs, share resources and models as open-source, and ensure clinical accuracy through the involvement of medical professionals.

However, the practicality of a model trained with synthetic clinical notes could be limited if there's a significant performance gap compared to models trained with real clinical notes. To evaluate this, we introduce Asclepius-R, a version of Asclepius trained with 57k MIMIC-III discharge summaries. We compare Asclepius-R-7B and Asclepius-R-13B, which have been pre-trained and fine-tuned on these real notes, with Asclepius-7B and Asclepius-13B, which were trained on synthetic notes.

Specifically, this repository contains Asclepius-R, and the Asclepius is accessible via Github [1] and Huggingface [2].

Model Description

Data Preprocessing

We trained Asclepius-R with MIMIC-III real clinical notes, minimally filtering them with two conditions: 1) notes where the category is 'Discharge Summary', and 2) notes that are selected for use in evaluation datasets. Note that the tokenizer used is identical to that of the original LLaMA [5].

Clinical Instruction Generation

We created a clinical language model for diverse healthcare tasks using a specific instruction-answer dataset.
We based on popular clinical NLP tasks [3] and refined with professional consultations, resulting in eight tasks: Named Entity Recognition, Relation Extraction, Temporal Information Extraction, Coreference Resolution, Question Answering, Abbreviation Expansion, Summarization, and Paraphrasing. Using GPT-3.5-turbo and MIMIC-III clinical notes, we developed instruction-answer pairs for these tasks via a three-step process:

We began with five expert-verified examples per task as starting data.
We input these examples with the notes into GPT-3.5-turbo for task-related instruction creation, diversifying content using a bootstrapping method [4].
These instructions, in turn, prompted the model to produce the answers.

As a result, we can generate 57k high-quality clinical instruction-answer pairs. Note that all of the above process were performed with HIPAA-certified platform (Azure GPT). At the first step, the clinicians are verified the quality of seed examples. They involved in the prompt tuning process to generate clinicaly realistic question and answers.

Model Training

Recently, the efficacy of fine-tuning instruction datasets on foundation models like LLaMA [5] was highlighted. Motivated by these, we developed a clinical large language model based on LLaMA. While considerintg that the language models trained on general domain strugges with clinical text peculiarities [6], we pre-trained LLaMA-7B and LLaMA-13B on notes [7], followed by fine-tuning with clinical instructions. This resulted in two models, Asclepius-R-7B and Asclepius-R-13B.

File Description

Ascleipus-R-13B: This directory serves Asclepius-R-13B model, as a Huggingface Transformers format.
- *.json files describe the model configurations
- *.bin files contain the model parameters
- Instructions on how to use these files are described in the "Usage Notes" section.
Ascleipus-R-7B: This directory serves Asclepius-R-7B model, as a Huggingface Transformers format.
README.md: This file containes how to reproduce Asclepius-R.
mimiciii_discharge.csv: This file contains note-instruction-answer pairs used for the model training.

Examples

This model accepts clinical notes and instructions as inputs, and then generates corresponding answers. The table below presents examples of instructions from the dataset, along with responses generated by Asclepius-R-13B. Since the notes corresponding to the question-answer pairs originate from MIMIC-III, we are unable to display them here.

Question	Asclepius-R-13B Answer
What does 'CAD' stand for in the patient's discharge diagnosis?	'CAD' stands for coronary artery disease in the patient's discharge diagnosis.
What anticoagulant medication was resumed after the initial bleed and at what PTT goal was it initially set?	Heparin was resumed after the initial bleed with a close monitoring of PTT, and the initial goal PTT was 40.
Based on the sensitivities listed in the discharge summary, would repeat culture and sensitivity testing be warranted if third generation cephalosporins were used for future serious infections in this patient?	Yes, repeat culture and sensitivity testing may be warranted for future serious infections if third generation cephalosporins were used in this patient, as isolates that are initially susceptible may become resistant within three to four days after initiation of therapy.

Technical Implementation

Asclepius-R-7B and 13B are are pre-trained and fine-tuned using 8xA100 80G GPUs, operating in a local environment. The pre-training was performed with one epoch, and fine-tuning was performed with three epochs, with a maximum sequence length of 2048. Each stage roughly consumed 3.5 hours for pre-training and 5 hours for fine-tuning for the 7B model, and 4 hours for pre-training and 7.5 hours for fine-tuning for the 13B model. For detailed information on the configuration and hyperparameters used for training, please refer to the README file.

Installation and Requirements

conda create -n asclepius python=3.9 -y
conda activate asclepius
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install pandarallel pandas jupyter numpy datasets sentencepiece openai wandb accelerate transformers==4.32.0

Usage Notes

The instruction-answer pairs used for train Asclepius-R is saved on `mimiciii_discharge.csv`

Column Name	Description
ROW_ID	Same with MIMC-III NOTEEVENTS.csv ROW_ID
Question	GPT-3.5 Genearted instruction
Answer	GPT-3.5 Answer corresponding to question
Task	Category of the task, one of previously mentiond eight tasks

To use the model, set up the above conda environment, and download the model checkpoints.

Then, change {DOWNLOADED_PATH} in below code to the model checkpoint path.

prompt = """You are an intelligent clinical languge model.
Below is a snippet of patient's discharge summary and a following instruction from healthcare professional.
Write a response that appropriately completes the instruction.
The response should provide the accurate answer to the instruction, while being concise.

[Discharge Summary Begin]
{note}
[Discharge Summary End]

[Instruction Begin]
{question}
[Instruction End] 
"""

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained({DOWNLOADED_PATH}, use_fast=False)
model = AutoModelForCausalLM.from_pretrained({DOWNLOADED_PATH})

note = "This is a sample note"
question = "What is the diagnosis?"

model_input = prompt.format(note=note, question=question)
input_ids = tokenizer(model_input, return_tensors="pt").input_ids
output = model.generate(input_ids)
print(tokenizer.decode(output[0]))

Release Notes

1.0.0 - Initial Release

1.0.1 - Instruction Typo

Ethics

N/A

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

Wu, H., Wang, M., Wu, J., Francis, F., Chang, Y. H., Shavick, A., ... & Dobson, R. J. (2022). A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. NPJ digital medicine, 5(1), 186.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Laparra, E., Bethard, S., & Miller, T. A. (2020). Rethinking domain adaptation for machine learning over clinical language. JAMIA open, 3(2), 146-150.
Alsentzer, E., Murphy, J., Boag, W., Weng, W. H., Jindi, D., Naumann, T., & McDermott, M. (2019, June). Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop (pp. 72-78).
Asclepius Github [Online]. Available from: https://github.com/starmpcc/Asclepius [Accessed Nov 2023]
Asclepius Huggingface[Online]. Available from: https://huggingface.co/starmpcc [Accessed Nov 2023]
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2023, July). Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13484-13508). Toronto, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.754