Model Credentialed Access

Me-LLaMA: Foundation Large Language Models for Medical Applications

Qianqian Xie Qingyu Chen Aokun Chen Cheng Peng Yan Hu Fongci Lin Xueqing Peng Jimin Huang Jeffrey Zhang Vipina Keloth Xinyu Zhou Huan He Lucila Ohno-Machado Yonghui Wu Hua Xu Jiang Bian

Published: June 5, 2024. Version: 1.0.0

When using this resource, please cite: (show more options)
Xie, Q., Chen, Q., Chen, A., Peng, C., Hu, Y., Lin, F., Peng, X., Huang, J., Zhang, J., Keloth, V., Zhou, X., He, H., Ohno-Machado, L., Wu, Y., Xu, H., & Bian, J. (2024). Me-LLaMA: Foundation Large Language Models for Medical Applications (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Xie, Q., Chen, Q., Chen, A., Peng, C., Hu, Y., Lin, F., ... & Bian, J. (2024). Me LLaMA: Foundation Large Language Models for Medical Applications. arXiv preprint arXiv:2402.12749.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a medical LLM family that includes foundation models – Me-LLaMA 13/70B, along with their chat-enhanced versions – Me-LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our methodology leverages a comprehensive domain-specific data suite, including a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and we proposed a new medical evaluation benchmark (MIBE) across six critical medical tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me-LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me-LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me-LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me-LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications.


Large Language Models (LLMs) have recently shown remarkable performance on a variety of natural language processing (NLP) tasks in different domains. However, general purpose models have shown limitations on domain specific tasks, due to idiosyncrasies in text and language that arise from domain to domain. Thus, there have been efforts to develop LLMs for use in specific domains, such as the biomedical and clinical domains. However, many of the most powerful models such as Med-PaLM 2 [1] and GPT-4 [2] are closed-source, which affects their usability for research and their customizability to the aforementioned domains. Recent open-source models for specifically the clinical domain have included GatorTronGPT [3] and Clinical-LLaMA [4]. GatorTronGPT was trained from scratch with model sizes of 5B and 20B parameters, but was not instruction tuned and thus comparatively lacks conversational capabilities. Clinical-LLaMA was trained on classification tasks, and only features a 7B parameter model. Our model Me-LLaMA has also been trained using clinical texts and a multitude of tasks, and features model sizes of 13B and 70B. Our model is thus the largest and most flexible of these open source clinical LLMs. Our evaluation scripts are available at our Github page [5], and the datasets are available at our HuggingFace page [6]. Details on the model, including training and evaluation, can be found in our paper [7].

Model Description

The Me-LLaMA model consists of two foundation models Me-LLaMA 13B and 70B, and their chat-enhanced counterparts, Me-LLaMA 13B-chat, and Me-LLaMA 70B-chat designed for superior chat and instruction following ability. The Me-LLaMA 13B and 70B were continually pretrained from the base LLaMA 2 13B and 70B models [8] with the addition of biomedical, clinical, and general domain data. The chat versions were developed by further instruction tuning of their respective foundation models with comprehensive medical instruction tuning data. The pretraining data consists of biomedical, clinical, and general domain data in a 15:1:4 ratio, which helps in maintaining a strong focus on the medical domain while also incorporating a broad spectrum of general knowledge and mitigating catastrophic forgetting.

We developed Me-LLaMA through the process of continual pre-training and instruction tuning of LLaMA2, which incorporates 129B tokens and 214K instruction tuning samples from general, biomedical, and clinical domains. The 129B token pretraining dataset is composed of biomedical literature, clinical notes, and general domain data in a 15:1:4 ratio. The biomedical papers integrates a vast collection of biomedical literature from PubMed Central and PubMed Abstracts, sourced from the Pile dataset [9]. The clinical notes include de-identified free-text clinical notes from MIMIC III [10], MIMIC-IV [11], and MIMIC-CXR [12]. Our general domain data uses a subset from the RedPajama [13] dataset, a replication of LLaMA’s pretraining data.

The instruction tuning dataset is again sourced from the general, biomedical, and clinical domains. The general domain consists of the Alpaca [14], Dolly [15], and ShareGPT [16] datasets. The biomedical portion comes from HealthCareMagic [17], Icliniq [17], MedInstruct [18], Medical Flash Cards [19], MEDIQA [20], MedicationQA [21], LiveQA [22], WikiDocPatient [19], Guideline QA, Pubmed Central, Pubmed [23], and the UMLS Knowledge graph [24]. The clinical domain texts are from MIMIC-III [10] and MIMIC-IV [11].

The evaluation data are composed of 12 datasets with a range of tasks. Specifically, we used the PubMedQA [25], MedQA [26], MedMCQA [27], the EmrQA [28] question-answering (QA) datasets, the 2010 i2b2 [29] named entity recognition (NER) dataset, the 2013 DDI [30] relation extraction dataset, the HoC [31] and MTSample [32] classification (CF) datasets, the PubMed [33] and MIMIC-CXR [12] text summarization (TS) datasets, and the BioNLI [34] and MedNLI [35] natural language inference (NLI) datasets. The performance of Me-LLaMA was evaluated on zero-shot, few-shot, and supervised fine-tuning settings.

We found that the Me LLaMA 13B model surpassed the similar-sized medical foundation model PMC-LLaMA 13B on 11 out of 12 datasets and outperformed the general foundation model LLaMA2 13B on 10 out of 12 datasets, with the exception of DDI and HoC. Moreover, it is noticed that the Me LLaMA 13B model was competitive with LLaMA2 70B and Meditron 70B, which have significantly larger parameter sizes, on 8 out of 12 datasets (PubMedQA, EmrQA, 2010 i2b2, MTsample, PubMed, MIMIC-CXR, BioNLI, and MedNLI).  As for 70B models, Me LLaMA 70B outperformed LLaMA2 70B and Meditron 70B on 9 out of 12 datasets (except for MedMCQA, 2010 i2b2 and MIMIC-CXR).

In the zero-shot setting, Me LLaMA models outperformed ChatGPT on 5 of 8 datasets without privacy concerns, but only on 1 against GPT-4. With task-specific instruction tuning, Me LLaMA models surpassed ChatGPT on 7 and GPT-4 on 5 out of the 8 datasets. It’s crucial to highlight that Me LLaMA’s model size is significantly smaller—13/70B parameters versus at least 175B for ChatGPT and GPT-4. Despite this size discrepancy, Me LLaMA models have showcased an impressive performance and a strong ability for supervised learning and in-context learning across a broad spectrum of medical tasks, underscoring its efficiency and potential in the field.

Included in this repository are four models:

  • Me-LLaMA 13B: The Me-LLaMA model initialized and continual pretrained from LLaMA 2 13B.
  • Me-LLaMA 70B: The Me-LLaMA model initialized and continual pretrained from LLaMA 2 70B.
  • Me-LLaMA 13B chat: This model was initialized from Me-LLaMA 13B. It was further instruction tuned from a variety of the general, biomedical, and clinical datasets.
  • Me-LLaMA 70B chat: This model was initialized from Me-LLaMA 70B. It was further instruction tuned from a variety of the general, biomedical, and clinical datasets.

Each of the models contains several files, which are standard with the transformers library [36].

  • config.json: Information about the model
  • model-x-of-y.safetensors: model weights
  • generation_config.json: Settings for text generation
  • special_tokens_map.json: Special tokens that are used in training
  • tokenizer.json: Mapping from indices to tokens.
  • Tokenizer_config.json: Configuration file for the tokenizer.

Technical Implementation

To train the model, we trained the model on next-token-prediction using the AdamW optimizer for one epoch on 160 A100 80GB GPUs from the University of Florida’s HiPerGator AI supercomputer. We used hyperparameters of β1 = 0.9, β2 = 0.95, weight decay of 1e-5, a learning rate of 8e-6, a warmup ratio of 0.05 for the cosine learning rate scheduler, bf16 precision, and 16 step gradient accumulation. We used DeepSpeed [37] for model parallelism. For instruction fine-tuning, we used LoRA-based parameter-efficient fine-tuning [38] for 3 epochs on 8 H100 GPUs, with parameters of a learning rate of 1e-5, a weight decay of 1e-5, and a warmup ratio of 0.01. 

Installation and Requirements

Before proceeding, ensure you've downloaded the Me-LLaMA model files to a local directory. This directory path will be referred to as "FOLDER_PATH_TO_MODEL" in the code snippets. Ensure the torch and transformers libraries are installed in your environment.

For generating text with the Me-LLaMA model in a local setup, use the pipeline as follows, specifying the path to your local model:

from transformers import pipeline

# Replace "FOLDER_PATH_TO_MODEL" with the actual path to your local model directory
pipe = pipeline("text-generation", model="FOLDER_PATH_TO_MODEL")

# Generate text
generated_text = pipe("The medical condition is characterized by", num_return_sequences=1)

For more control, such as fine-tuning or custom inference, load the tokenizer and model directly using the local path:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model from the local model directory
# Replace "FOLDER_PATH_TO_MODEL" with the actual path to your local model directory
tokenizer = AutoTokenizer.from_pretrained("FOLDER_PATH_TO_MODEL")
model = AutoModelForCausalLM.from_pretrained("FOLDER_PATH_TO_MODEL")

# Tokenize text
input_ids = tokenizer("[INPUT SENTENCE]", return_tensors="pt").input_ids

# Generate output
generated_tokens = model.generate(input_ids, max_length=50)  # Adjust max_length as necessary

# Decode and print the generated text
generated_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

Usage Notes

When utilizing Me-LLaMA models, particularly for fine-tuning and inference tasks, it's important to consider the computational resources required to efficiently handle these processes. Me-LLaMA models, given their large language model architecture, are versatile and can be fine-tuned for a wide array of NLP tasks, including named-entity recognition, sequence classification, question answering, and more. For detailed implementation guidance and code examples, we recommend consulting resources similar to those found in prominent NLP research and documentation.

  • GPU Requirements: Fine-tuning Me-LLaMA models necessitates substantial computational power. We advise using a GPU with at least 24GB of memory. This specification ensures that the models can be fine-tuned with adequate batch sizes and sequence lengths to optimize performance without compromising speed or accuracy.
  • Batch Size and Sequence Length: For effective fine-tuning, a balance between batch size and sequence length is crucial. With a 24GB GPU, you can comfortably fine-tune the models using techniques such as LoRA, which allows for parameter-efficient training while maintaining the model's integrity and performance.
  • LoRA for Fine-Tuning: Leveraging LoRA (Low-Rank Adaptation) is highly recommended for fine-tuning Me-LLaMA models. LoRA enables the adaptation of pre-trained models with minimal additional parameters, making it an efficient method for customizing Me-LLaMA models to specific tasks or datasets.

We also provide all prompt templates for our evaluation datasets:

Task Name Description Input Format
PubMedQA Answer biomedical questions using the provided abstract with answers limited to yes, no, or maybe. INPUT: {Text} CONTEXT: {Text} OUTPUT:
MedQA Simulate taking the US Medical Licensing Examination. Answer a multiple-choice question based on medical knowledge and current practices. Question: {text} Options: {text} Answer:
MedMCQA Answer real-world medical entrance exam questions by selecting the correct answer from multiple choices based on clinical and basic medical science. Question: {text} Options: {text} Answer:
EmrQA Extract the relevant text segment from a given medical context that directly answers an open-ended question. Context: {text} Answer:
2012 i2b2 Identify clinically relevant entities in a sentence from clinical narrative notes and mark them with HTML tags. Input Text: {text} Output Text:
2013 DDI Predict relationships between two drug entities within a sentence. Identify types of drug-drug interactions. INPUT: {text} OUTPUT:
HoC Decide which of the Hallmarks of Cancer topics an article's abstract relates to. Articles may relate to multiple topics. INPUT: {text} OUTPUT:
MTSample Determine the medical specialty or domain of a medical transcription from a list of 40 options. INPUT: {text} OUTPUT:
PubMedSum Summarize a biomedical literature piece in six sentences. INPUT: {text} OUTPUT:
MIMIC-CXR Derive the impression from findings in a radiology report. INPUT: {text} OUTPUT:
BioNLI Classify the relationship between a given premise and hypothesis as entailment, or contradiction. INPUT: {text} OUTPUT:
MedNLI Classify the relationship between a given premise and hypothesis as entailment, contradiction, or neutral. INPUT: {text} OUTPUT:

All evaluation datasets are available on [6].

Release Notes

The current and first release is version v1.0.0.


LLM models have been shown to susceptible to leakage of sensitive information. Users should not attempt to use this model to produce leakage. These models were trained using MIMIC-III and MIMIC-IV data, and thus exist under the same IRB and are subject to the same regulations. The output of generative AI is synthetic and should not be used to deceive, and may be inaccurate and not treated as factual without validation. 

The Me-LLaMA models contained within this project are research tools intended for use in computational linguistics and medicine. They are not intended to replace the expertise of healthcare professionals, and should not be used as diagnostic or clinical decision making tools without appropriate validation and regulatory approval.  In no event shall the authors, contributors, or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the software or the use or other dealings in the software. 

Me-LLaMA is a derivative model developed from the Llama 2 architecture, which was released on July 18, 2023. Me-LLaMA adheres to the terms and conditions set forth in the Llama 2 Community License (included in the project folder).



This work received support from the National Institutes of Health (NIH), NIH National Center for Advancing Translational Sciences (NCATS), Patient-Centered Outcomes Research Institute (PCORI) and NIH National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP) under grant numbers: 1RF1AG072799, 1R01AG078154, R01AG073435, R01LM013519, RF1AG084178, R01AG083039, R01CA284646, R01AI172875, R01AG080991, R01AG080624, 1K99LM01402, NIH/NCATS UL1 TR001427, PCORI RI-FLORIDA-01-PS1, 1U18DP006512-01, PCORI ME-2018C3-14754. We express our sincere appreciation to the creators of datasets such as the MIMIC, the Pile, and RedPajama for making these valuable resources available to the research community. We extend our gratitude to the UF Research Computing team, under the leadership of Dr. Erik Deumens, for their generous provision of computational resources through the UF HiperGator-AI cluster. 

Conflicts of Interest

The authors have no conflicts of interest to declare.


  1. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, Clark K, Pfohl S, Cole-Lewis H, Neal D, Schaekermann M. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617. 2023 May 16.
  2. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023 Mar 15.
  3. Peng C, Yang X, Chen A, Smith KE, PourNejatian N, Costa AB, Martin C, Flores MG, Zhang Y, Magoc T, Lipori G. A Study of Generative Large Language Model for Medical Research and Healthcare. arXiv preprint arXiv:2305.13523. 2023 May 22.
  4. Gema A, Daines L, Minervini P, Alex B. Parameter-efficient fine-tuning of LLaMA for the clinical domain. arXiv preprint arXiv:2307.03042. 2023 Jul 6.
  5. Me LLaMA Github repository. 2024 [cited 2024 May 12]. Available from:
  6. Yale BIDS Xu Lab Huggingface page. 2024 [cited 2024 May 12]. Available from:
  7. Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, Peng X, Huang J, Zhang J, Keloth V, He H. Me LLaMA: Foundation Large Language Models for Medical Applications. arXiv preprint arXiv:2402.12749. 2024 Feb 20.
  8. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. 2023 Jul 18.
  9. Gao, L. et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
  10. Johnson, A. E. et al. MIMIC-III, a freely accessible critical care database. Scientific data 3, 1–9 (2016).
  11. Johnson, A. et al. MIMIC-IV. PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021) (2020).
  12. Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6, 317 (2019).
  13. Computer, T. Redpajama: an open dataset for training large language models (2023).
  14. Taori, R. et al. Stanford alpaca: An instruction-following LLaMA model. (2023)
  15. Conover, M. et al. Free dolly: Introducing the world’s first truly open instruction-tuned llm (2023)
  16. Zheng, L. et al. Judging llm-as-a-judge with mt-bench and chatbot arena (2023). 2306.05685
  17. Li, Y. et al. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (LLaMA) using medical domain knowledge. Cureus 15 (2023)
  18. Zhang, X. et al. AlpaCare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558 (2023)
  19. Han, T. et al. MedAlpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247 (2023)
  20. Ben Abacha, A., Shivade, C. & Demner-Fushman, D. Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. In ACL-BioNLP 2019 (2019)
  21. Ben Abacha, A. et al. Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019 (2019)
  22. Abacha, A. B., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at trec 2017 liveqa. In TREC, 1–12 (2017)
  23. Yu, B., Li, Y. & Wang, J. Detecting causal language use in science findings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4664–4674 (2019)
  24. Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Pmc-llama: Further finetuning LLaMA on medical papers. arXiv preprint arXiv:2304.14454 (2023)
  25. Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146. 2019 Sep 13.
  26. Zhang X, Wu J, He Z, Liu X, Su Y. Medical exam question answering with large-scale reading comprehension. InProceedings of the AAAI conference on artificial intelligence 2018 Apr 27 (Vol. 32, No. 1).
  27. Pal A, Umapathi LK, Sankarasubbu M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning 2022 Apr 6 (pp. 248-260). PMLR.
  28. Pampari A, Raghavan P, Liang J, Peng J. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732. 2018 Sep 3.
  29. Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informatics Association. 2013 Sep 1;20(5):806-13.
  30. Herrero-Zazo M, Segura-Bedmar I, Martínez P, Declerck T. The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of biomedical informatics. 2013 Oct 1;46(5):914-20.
  31. Baker S, Silins I, Guo Y, Ali I, Högberg J, Stenius U, Korhonen A. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics. 2016 Feb 1;32(3):432-40.
  32. Kaggle - Clinical Text Classification. July 15, 2020 [cited 2024 May 12]. Available from:
  33. Cohan, A. et al. A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685 (2018).
  34. Bastan, M., Surdeanu, M. & Balasubramanian, N. Bionli: Generating a biomedical nli dataset using lexico-semantic constraints for adversarial examples. arXiv preprint arXiv:2210.14814 (2022).
  35. Romanov, A. & Shivade, C. Lessons from natural language inference in the clinical domain. arXiv preprint arXiv:1808.06752 (2018).
  36. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. M. (2020). Transformers: State-of-the-Art Natural Language Processing [Conference paper]. 38–45.
  37. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial).
  38. Hu, E. J. et al. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (2022).

Parent Projects
Me-LLaMA: Foundation Large Language Models for Medical Applications was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.