Name: MIMIC-Ext-DrugDetection
Published: Sept. 25, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Fabrice Harel-Canada , Nanyun Peng , David Goodman , Ruby Romero , Allan Nguyen , Brandon Moghanian , Anabel Salimian

Published: Sept. 25, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Harel-Canada, F., Peng, N., Goodman, D., Romero, R., Nguyen, A., Moghanian, B., & Salimian, A. (2025). MIMIC-Ext-DrugDetection (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/0kyx-r485

MLA	Harel-Canada, Fabrice, et al. "MIMIC-Ext-DrugDetection" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/0kyx-r485
APA	Harel-Canada, F., Peng, N., Goodman, D., Romero, R., Nguyen, A., Moghanian, B., & Salimian, A. (2025). MIMIC-Ext-DrugDetection (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/0kyx-r485
Chicago	Harel-Canada, Fabrice, Peng, Nanyun, Goodman, David, Romero, Ruby, Nguyen, Allan, Moghanian, Brandon, and Anabel Salimian. "MIMIC-Ext-DrugDetection" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/0kyx-r485
Harvard	Harel-Canada, F., Peng, N., Goodman, D., Romero, R., Nguyen, A., Moghanian, B., and Salimian, A. (2025) 'MIMIC-Ext-DrugDetection' (version 1.0.0), PhysioNet. RRID:SCR_007345. Available at: https://doi.org/10.13026/0kyx-r485
Vancouver	Harel-Canada F, Peng N, Goodman D, Romero R, Nguyen A, Moghanian B, Salimian A. MIMIC-Ext-DrugDetection (version 1.0.0). PhysioNet. 2025. RRID:SCR_007345. Available from: https://doi.org/10.13026/0kyx-r485

Additionally, please cite the original publication:

Harel-Canada, F., Salimian, A., Moghanian, B., Clingan, S., Nguyen, A., Avra, T., Poimboeuf, M., Romero, R., Funnell, A., Petousis, P., Shin, M., Peng, N., Shover, C. L., & Goodman-Meza, D. (2025). Enhancing substance use detection in clinical notes with large language models [Preprint]

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

This project shares a large, annotated drug detection dataset created from MIMIC-III/IV discharge summaries. The dataset was developed to address the challenge of identifying substance use behaviors in Electronic Health Records (EHRs), where critical details are often embedded in unstructured notes requiring contextual interpretation. The primary aim was to support future systemic substance use surveillance. The data consists of medical notes tokenized into sentences, annotated for eight substance use categories: heroin, cocaine, methamphetamine, illicit use of prescription opioids and benzodiazepines, cannabis, Injection Drug Use (IDU), and general drug use. The dataset was used to evaluate the performance of various large language models (LLMs) for detecting these substance use categories, demonstrating that LLMs, particularly a fine-tuned model, can significantly enhance detection accuracy and show promise for clinical decision support and research.

Background

The detailed nuances of substance use are predominantly documented within free-text notes in EHRs, making this critical information largely inaccessible for broader usage trend monitoring and policy-making by researchers, hospital administrators, and public health agencies. Natural Language Processing (NLP) offers a viable solution to extract actionable insights from this unstructured text data [1,2]. Recent advancements with Large Language Models (LLMs) show promise in this area [3,4], but their application to substance use detection in EHRs is under-explored. Prior studies have demonstrated the utility of NLP and machine learning for identifying people who inject drugs in EHRs [5] and for classifying substances involved in overdose deaths [6].

This dataset was created to evaluate contemporary NLP models in identifying substance use from unstructured EHR text, aiming to improve patient care and inform public health strategies by transforming detailed clinical notes into analyzable data, as described in [7]. The motivation for sharing this resource is to support future systemic substance use surveillance and catalyze community-driven improvements in clinical NLP.

Methods

A retrospective study was conducted using de-identified, publicly available data from the MIMIC-III [8] (2001-2012) and MIMIC-IV [9] (2008-2019) datasets, which consist of EHRs from patients admitted to Beth Israel Deaconess Medical Center. The University of California, Los Angeles Institutional Review Board (IRB) determined this study to be exempt from IRB oversight as it involved only de-identified, publicly available data. The task was framed as a multi-label text classification problem to capture concurrent substance use. Eight substance classes were targeted: heroin, cocaine, methamphetamine, illicit use of prescription opioids and benzodiazepines, cannabis, IDU, and general drug use (Any).

Data Collection & Annotation: 1,151 notes containing keywords relevant to the drug classes were identified. Five team members were trained to recognize explicit and nuanced mentions of substance use based on a pre-specified annotator guide (Appendix A in the paper). Annotators highlighted text spans and classified them. Inter-annotator agreement was ensured (kappa > 0.80) before proceeding to single-annotation, followed by a final review of all annotations.
Data Processing: Due to the length of medical notes, span-level annotations were used. The task was then reframed as multi-label sentence classification. Annotated medical notes were tokenized into sentences, resulting in a dataset of 274,602 rows, with 3,948 containing drug mentions. Class-balanced dataset splits containing approximately 10% in TRAIN, 10% in VALIDATION, and 80% in TEST were created.

Data Description

The data is derived from MIMIC-III and MIMIC-IV discharge summaries. It consists of medical notes that have been tokenized into individual sentences. Each sentence is treated as a data instance.

File Structure & Format: The data is stored in CSV (Comma Separated Values) files. Each row in a CSV file represents a sentence extracted from a medical note. The dataset is structured for a multi-label sentence classification task. The output for each sentence is a binary vector indicating the presence or absence of eight substance use categories.
File Content: The CSV files contain sentences from medical notes. The columns are:
- doc_id: An identifier for the original document (note) from which the sentence was extracted.
- sentence_id: An identifier for the sentence within the document.
- text: The text of the sentence.
- heroin: Binary label (0 or 1) indicating the presence of heroin use.
- cocaine: Binary label (0 or 1) indicating the presence of cocaine use.
- methamphetamine: Binary label (0 or 1) indicating the presence of methamphetamine use.
- benzodiazepine: Binary label (0 or 1) indicating the illicit use of benzodiazepines.
- rx_opioid_misuse: Binary label (0 or 1) indicating the illicit use of prescription opioids.
- cannabis: Binary label (0 or 1) indicating the presence of cannabis use.
- injection_drug_use: Binary label (0 or 1) indicating illicit injection drug use.
- general_drug_use: Binary label (0 or 1) indicating illicit general drug use (if any of the above are present or if non-specific drug use is mentioned).
Summary Statistics: The dataset is split into TRAIN, VALIDATION, and TEST sets. The counts for each drug class, instances with any drug mention ("Any"), instances with no drug mentions ("None"), and total instances per split are as follows:
- TRAIN Split: Total 804 instances.
  - Heroin: 93, Cocaine: 65, Methamphetamine (Meth.): 9, Benzodiazepine (Benzo.): 26, Prescription Opioids Misuse (Rx. Opioids): 13, Cannabis: 13, IDU: 128, Any: 402, None: 402.
- VALIDATION Split: Total 806 instances.
  - Heroin: 94, Cocaine: 66, Meth.: 9, Benzo.: 26, Rx. Opioids: 14, Cannabis: 14, IDU: 128, Any: 403, None: 403.
- TEST Split: Total 6443 instances.
  - Heroin: 749, Cocaine: 528, Meth.: 72, Benzo.: 232, Rx. Opioids: 122, Cannabis: 121, IDU: 1041, Any: 3143, None: 3300.
- TOTAL Dataset: Total 8053 instances.
  - Heroin: 936, Cocaine: 659, Meth.: 90, Benzo.: 284, Rx. Opioids: 149, Cannabis: 148, IDU: 1297, Any: 3948, None: 4105.

Usage Notes

This dataset is useful for the community for developing and benchmarking NLP models, particularly LLMs, for detecting substance use in clinical text.

(1) How the data has already been used: This dataset, its creation, annotation process, and initial benchmarking are detailed in the preprint, "Enhancing Substance Use Detection in Clinical Notes with Large Language Models" [7]. In this work, the data was used to evaluate a range of NLP models (BERT-style encoders [10] and GPT-style decoders/LLMs [11]) in zero-shot, few-shot, and fine-tuning configurations for detecting the eight specified substance use categories. The study found that a fine-tuned LLM, Llama-DrugDetector-70B, achieved near-perfect F1-scores (≥0.95) for most individual substances and strong scores for complex tasks like prescription opioid misuse (F1=0.815) and polysubstance use (F1=0.917). These findings demonstrate that LLMs significantly enhance detection capabilities.
(2) The reuse potential of the dataset: The dataset can be used for:
- Further research into NLP methods for substance use detection.
- Developing clinical decision support tools.
- Public health surveillance to monitor drug use trends.
- Training and evaluating new and existing language models on a specialized clinical task.
- Investigating methods to handle class imbalance and nuanced language in clinical texts.
- The open-sourcing of models (like Llama-DrugDetector, as mentioned in [7]) and benchmarks based on this dataset aims to catalyze community-driven improvements.
(3) Known limitations that users should be aware of when using the resource: These are discussed in detail in [7] and include:
- The data is from a single medical center in Boston (MIMIC-III/IV), spanning 2001-2019, which may limit generalizability as drug use patterns evolve and vary regionally. Fentanyl, for instance, was infrequent in this dataset.
- The analysis focused on short medical notes (sentences) extracted from larger profiles, which could lead to loss of critical context during sentence tokenization, potentially affecting interpretation in complex cases. Other relevant medical documentation (e.g., lab results, toxicology reports) may have been copied and pasted into the notes as well.
- The dataset reflects the terminology and documentation practices of the source institution during the specified period.
- Class imbalance exists, with some substances like methamphetamine being less frequent than heroin or cocaine.
(4) Any complementary code or datasets that might be of interest to the user community:
- The preprint [7] mentions the fine-tuned models, Llama-DrugDetector-8B and Llama-DrugDetector-70B, which are open-sourced and available on Hugging Face (e.g., [12]).
- A Python package and GitHub repository, drugdetector [13], provides a wrapper for performant zero-shot drug detection using LLMs, including the fine-tuned models mentioned in the preprint. This can be useful for researchers looking to apply similar detection methods.
- The original MIMIC-III and MIMIC-IV datasets are complementary resources, though access requires specific credentials and training [8,9].
- The annotation guide and prompt templates used in the study are available as appendices in [7].

Release Notes

v 1.0.0 - Initial release matching [7].

Ethics

This analysis involved only de-identified, publicly available data from the MIMIC dataset. The study protocol, was approved by UCLA IRB, which determined this study to be exempt from IRB oversight. The data used (MIMIC-III and MIMIC-IV) are comprehensive, publicly available repositories of de-identified EHRs. Investigators wishing to use the MIMIC-Ext-DrugDetection data (derived from MIMIC) must adhere to MIMIC’s licensing terms, including completing the required CITI Data or Specimens Only Research training. The project aims to enhance substance use detection for improved patient care and public health surveillance, which are key benefits. Risks associated with the misuse of such detection technologies are implicitly managed by using de-identified data and focusing on research and surveillance applications rather than direct, unvalidated patient interventions without oversight.

Acknowledgements

CLS was supported by a grant from the National Institutes of Health and the National Institute on Drug Abuse (K01-DA050771). DGM was supported by a grant from the National Institute of Health and the National Institute on Drug Abuse (K08-DA048163-03). All authors, except NP, were supported by a grant from the National Institute of Health and the National Institute on Drug Abuse (R01-DA57630). The funders had no role in the design, conduct, or decision to publish the manuscript detailing this dataset [7].

Conflicts of Interest

The author(s) have no conflicts of interest to declare.

References

Velupillai S, Suominen H, Liakata M, Roberts A, Shah AD, Morley K, et al. Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances. J Biomed Inform. 2018;88:11-19.
Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008;17(01):128-144.
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940.
Liang EN, Pei S, Staibano P, van der Woerd B. Clinical applications of large language models in medicine and surgery: a scoping review. J Int Med Res. 2025;53(7):3000605251347556. doi:10.1177/03000605251347556.
Goodman-Meza D, Tang A, Aryanfar B, Vazquez S, Gordon AJ, Goto M, et al. Natural language processing and machine learning to identify people who inject drugs in electronic health records. Open Forum Infect Dis. 2022;9(9):ofac471.
Goodman-Meza D, Shover CL, Medina JA, Tang AB, Shoptaw S, Bui AA. Development and validation of machine models using natural language processing to classify substances involved in overdose deaths. JAMA Netw Open. 2022;5(8):e2225593.
Harel-Canada F, Salimian A, Moghanian B, Clingan S, Nguyen A, Avra T, et al. Enhancing substance use detection in clinical notes with large language models. Res Sq. 2025;rs.3.rs-6615981. doi:10.21203/rs.3.rs-6615981/v1.
Johnson AE, Pollard TJ, Shen L, Lehman LWH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):160035.
Johnson AE, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers). 2019. p. 4171-4186.
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901.
Fabriceyhc/Llama-DrugDetector-70B. Available from: https://huggingface.co/fabriceyhc/Llama-DrugDetector-70B [Accessed 27 August 2025]
Drugdetector. Available from: https://pypi.org/project/drugdetector/ [Accessed 27 August 2025]