Database Credentialed Access

MIMIC-IV-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp

Jing Wang Xing Niu Tong Zhang Jie Shen Juyong Kim Jeremy Weiss

Published: Sept. 29, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Wang, J., Niu, X., Zhang, T., Shen, J., Kim, J., & Weiss, J. (2025). MIMIC-IV-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/dkj6-r828

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Clinical risk prediction based on machine learning algorithms plays a vital role in modern healthcare. A crucial component in developing a reliable prediction model is a high-quality dataset with time series clinical events. In this work, we release such a dataset that consists of 22,588,586 clinical time series events, which we term MIMIC-IV-Ext-22MCTS. Our source data are discharge summaries selected from the well-known yet unstructured MIMIC-IV-Note. We then extract clinical events as short text spans from the discharge summaries, along with the timestamps of these events as temporal information by contextual retrieval and Llama-3.1-8B.


Background

Forecasting clinical risk using Electronic Health Record (EHR) data is a cornerstone of modern precision medicine. It supports early identification of deteriorating patients, enhances hospital resource allocation, improves healthcare quality, and helps inform and refine clinical practice guidelines [1,2]. The clinical risk prediction models relies on accurate representation of temporal sequences of clinical events, such as the time order of diagnoses, treatment, medications, vital signs, and lab results, which is often referred to as clinical trajectories.

The most complete clinical trajectories are saved in free-text clinical notes, such as discharge summaries of MIMIC-IV. The time-stamped clinical events are crucial for modeling disease progression, identifying risk windows, and understanding causality in clinical decision-making [3,4]. However, extracting structured clinical events with timestamps from such narratives remains a challenging task. This is primarily due to the significant domain expertise and considerable manual effort. Specifically, it demands not only medical knowledge to identify meaningful events but also contextual reasoning to infer event timing, which is often implicit or vague in narrative documentation [5]. This annotation bottleneck has historically limited the availability of large-scale, high-quality temporal datasets in healthcare.

Recent advances in Large Language Models (LLMs) such as GPT-4 [6] have demonstrated strong performance across diverse tasks [7]. In the medical domain, LLMs have been applied for tasks such as discharge summary generation, medication extraction, and cohort identification. However, most current applications are limited by simplistic, ad-hoc prompting strategies that fail to capture medical nuance and lack robust timestamp reasoning [8].

In particular, temporal extraction and reasoning about when a clinical event occurred, remains an under-explored and error-prone task for LLMs. Without medically-informed prompts and time-aware supervision, naïve applications of LLMs often result in hallucinated or misaligned timestamps, which compromise the downstream utility of the generated data [9].

Our contributions. To address this gap, we propose a structured framework for temporal annotation of clinical events using retrieval and LLM, with medically grounded prompting and consistency checks. We apply this framework to the publicly available MIMIC-IV-Note dataset and extract a comprehensive set of clinical events and their associated relative timestamps for each discharge summary. Anchoring each timeline at the admission event (t = 0), we generate a large-scale temporal time-series dataset: MIMIC-IV-Ext-22MCTS.

This dataset enables researchers to:

  1. Reconstruct and analyze patient trajectories over time.
  2. Study disease progression patterns and treatment timelines.
  3. Build machine learning models for risk prediction, next event prediction, healthcare supply chain optimization, and temporal reasoning in clinical settings.

Each clinical event is paired with a time (in hours) relative to the admission timestamp, and also mapped to a discrete time bin to support classification-style modeling tasks. These temporal structures capture not only what happened, but also when and in what order, which are key elements for understanding clinical causality, treatment outcomes, and decision points [10,11].

By releasing MIMIC-IV-Ext-22MCTS, we aim to:

  1. Lower the barrier to entry for temporal modeling in clinical machine learning.
  2. Facilitate benchmarking and reproducibility for event prediction tasks.
  3. Accelerate the development of LLMs in medicine through high-quality, time-structured datasets.

The goal of releasing this resource is to catalyze further research in temporal modeling, help develop clinically reliable AI systems, and improve patient care outcomes.


Methods

To solve the challenges, we propose an end-to-end framework that produces reliable annotation of time series clinical events and related timestamps for discharge summary based on retrieval and Llama-3.1-8B [12]. Our discharge summary is from MIMIC-IV-Note dataset [13]. The concise introduction of the patient's hospital course from MIMIC-IV-Ext-BHC dataset [14] is used as query. Here is the brief description of the framework:
  • Each discharge summary form is split into a list of chunks with 5 tokens, the previous 5 tokens and the following 5 tokens are treated as the context.
  • The contextual BM25 retrieves top 100 chunks with probability to contain clinical events with brief hospital courses as query.
  • We utilize BGE-Large-en model [15] to learn the embedding of query (brief hospital courses) and contextual chunks, compute the correlation similarity score between query and chunks, and retrieve the chunks with correlation score higher than 0.75. The threshold 0.75 is based on empirical evaluation.
  • The chunks retrieved by contextual BM25 and semantic search are combined without duplications.
  • We design the prompt to teach the large language model, Llama-3.1-8B to identify chunks that contain clinical events, and estimate the relative timestamps of events.

Effectiveness of the prompt

To evaluate the effectiveness of our prompt, we applied the prompt to GPT-4 on 10 clinical case reports and compared the generated result with human annotation [16]. It shows the consistency between LLM with our prompt and the human annotation. Specifically, Table 1 [16] demonstrates that GPT-4, when guided by our prompt, generates between 20 and 297 event-timestamp pairs per document, with an average of 46 pairs. On average, GPT-4 annotations include six distinct timestamp values, matching the count produced by the human expert. As illustrated in Figure 4 [16], GPT-4 with our prompt achieves a 91.2% event concordance rate. Moreover, 50–75% of the matched events have identical relative timestamps as those manually annotated, and 70–85% of the events have timestamp errors of less than 24 hours.

Table 1. Descriptive statistics of the manual and LLM annotations.

Statistic Manual GPT-4
Events 32 [14, 70] 46 [20, 297]
Distinct Times 6 [2, 13] 6 [1, 16]

Data Description

The dataset consists of 22,588,586 clinical events and related timestamps from 267,284 discharge summaries from MIMIC-IV-Note dataset. There is at least 1 clinical event and timestamp for each summary. The maximal number of clinical event-timestamp pairs per summary is 244. The average number of annotations per summary is 84. The average number of tokens per clinical event is 3. There are at most 299 tokens per clinical event. The timestamp of the event is negative if the event is historical information, that happens before admitted to the hospital, and positive if happens after hospital admission. The timestamp is in hours. There are 36.99% of historical clinical events, 51.19% of clinical events during admission, and 11.80% future events.

There are five columns in the table: Hadm_id, Event, Time, and Time_bin. The column Hadm_id serves as a unique identifier for each discharge summary, ensuring that each discharge summary (MIMIC-IV-Note) and its associated query (MIMIC-IV-Ext-BHC) can easily be referenced. The column Event shows the clinical event in text format, the column Time shows the timestamp that the event happens. Time_bin is obtained by mapping the continuous temporal annotations into discrete categories. That is, the temporal annotations, timestamps are discretized into predefined intervals (bins). Each timestamp is mapped to the interval it falls into, and that interval is assigned an integer index. The intervals are:

  • Bin 0: (-∞, -60)
  • Bin 1: [-60, -30)
  • Bin 2: [-30, -15)
  • Bin 3: [-15, 0)
  • Bin 4: [0, 15)
  • Bin 5: [15, 30)
  • Bin 6: [30, 60)
  • Bin 7: [60, 120)
  • Bin 8: [120, ∞)

For example, a timestamp of -120 falls into Bin 0 (since it is less than -60), and its Time_bin value is 0. For another example, a timestamp of 45 falls into Bin 6 (since it is between 30 and 60), and its Time_bin value is 6.

Brief summary

The file "csv" file provides structured annotations of clinical events and their associated timestamps extracted from discharge summaries in MIMIC-IV-Note. Each row represents one clinical event-timestamp pair.

Statistics of the dataset

Total discharge summaries 267,284
Total records (event–timestamp pairs) 22,588,586
Min / Max events per summary 1 / 244
Average events per summary 84
Average tokens per event 3
Max tokens per event 299
Temporal distribution of events 36.99% before admission (historical)
51.19% during admission
11.80% after discharge

Usage Notes

Our dataset can be used for fine-tuning models for downstream tasks related to healthcare, such as clinical trial matching [17]. For example, we have fine-tuned BERT [18] to explore the causal relationship of two clinical events: reason/consequence/no correlation. We have shown that the fine-tuned BERT achieved significant improvement in question answering, clinical trial matching, that is 10% improvement on PubmedQA and 3% improvement on TREC 2021 and TREC 2022 [19]. The related code is publicly available [20].

Our time series dataset can also be used to fine-tune Transformer-based decoders, such as GPT-2, for question answering. The experimental results demonstrate that the fine-tuned GPT-2 provides more clinically oriented outputs, closer to a chemotherapy context.

Limitations

While MIMIC-IV-Ext-22MCTS provides a large-scale and structured resource for temporal clinical event modeling, users should be aware of the following limitations:

  1. Inconsistencies may exist between the generated annotations and the original discharge summaries.
  2. The LLM may hallucinate events that are not actually present in the original note or introduce content from unrelated summaries due to model generalization errors.
  3. Timestamp estimation may be imprecise.
  4. No Ground Truth Labels. The dataset does not include gold-standard or manually validated event-time labels for all records. Therefore, it is more suitable for pretraining, weak supervision, or semi-supervised learning tasks rather than strict evaluation benchmarks.
  5. Event granularity and overlap. Multiple similar events (e.g., "diagnosed with pneumonia" and "pneumonia confirmed") may be annotated separately depending on phrasing.

Recommendation: Users should apply caution when using this dataset for high-stakes clinical modeling or evaluation tasks, and consider validating subsets of the data with human experts when possible.


Release Notes

This is the initial release of the MIMIC-IV-Ext-22MCTS. This first version aims to provide a valuable resource for researchers and practitioners in natural language processing and clinical documentation. Future updates may include additional records, further preprocessing improvements, and expanded metadata


Ethics

This study utilized data from the publicly available Medical Information Mart for Intensive Care (MIMIC) database. The use of MIMIC data for research purposes is governed by the Health Insurance Portability and Accountability Act (HIPAA) in the United States, and researchers are required to adhere to strict ethical guidelines when accessing and using this data.

All data generation and large language model (LLM) inference were performed entirely within the NIH Secure High Performance Computing (HPC) environment. Specifically:

  1. The LLaMA-3.1-8B model was downloaded and executed locally in a firewalled, access-controlled, non-networked HPC cluster.
  2. No patient data or model outputs were transmitted to or processed on any third-party cloud or external server.
  3. All scripts, data, and checkpoints remain stored on NIH-approved secure storage with restricted user access.

This setup ensures full compliance with institutional data security policies and best practices for handling sensitive health data. We confirm that the project builds upon previous de-identified datasets (e.g., MIMIC-IV-Note). The project complies with all necessary data use agreements and approvals.


Acknowledgements

This research was supported by the Division of Intramural Research of the National Library of Medicine (NLM), National Institutes of Health. This work utilized the computational resources of the NIH HPC Biowulf cluster.


Conflicts of Interest

None to declare.


References

  1. Obermeyer Z, Emanuel EJ. Predicting the future: big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–9.
  2. Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016;6:26094.
  3. Zhou SM, Dube K, Isaac J, Rahman A, Smith M, Williams J. Causal discovery from observational data in healthcare: a review. J Biomed Inform. 2007;40(6):749–60.
  4. Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. A review of challenges and opportunities in machine learning for health. JAMA. 2020;323(14):1399–404.
  5. Sun W, Rumshisky A, Uzuner Ö. Temporal reasoning over clinical text: the state of the art. J Am Med Inform Assoc. 2013;20(5):814–9.
  6. OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774. 2023.
  7. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7973):172–80.
  8. Xie S, Wang Y, Chen Z, Li X, Chen K, Ren X, et al. Prompting pitfalls: why ad-hoc LLM prompting fails in clinical settings. arXiv preprint arXiv:2401.01234. 2024.
  9. Murugesan GK, McCrumb D, Aboian M, Verma T, Soni R, Memon F, et al. AI-generated annotations dataset for diverse cancer radiology collections in NCI image data commons. Sci Data. 2024;11(1):1165.
  10. Suresh H, Guttag JV. A framework for understanding unintended consequences of machine learning. Commun ACM. 2020;63(11):62–71.
  11. Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. Sci Data. 2019;6(1):96.
  12. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. 2023.
  13. Johnson A, Pollard T, Horng S, Celi LA, Mark RG. MIMIC-IV-Note: Deidentified free-text clinical notes [dataset]. PhysioNet. 2023.
  14. Aali A, Van Veen D, Arefeen Y, Hom J, Bluethgen C, Reis EP, et al. MIMIC-IV-Ext-BHC: Labeled clinical notes dataset for hospital course summarization [dataset]. PhysioNet. 2025.
  15. Xiao S, Liu Z, Zhang P, Muennighoff N. C-Pack: Packaged resources to advance general Chinese embedding. arXiv preprint arXiv:2309.07597. 2023.
  16. Wang J, Weiss JC. A large-language model framework for relative timeline extraction from PubMed case reports. AMIA Annu Symp Proc. 2025.
  17. Jin Q, Wang Z, Floudas CS, Chen F, Gong C, Bracken-Clarke D, et al. Matching patients to clinical trials with large language models. Nat Commun. 2025;15(1):9074.
  18. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc 2019 Conf North Am Chapter Assoc Comput Linguist Hum Lang Technol (NAACL); 2019. p. 4171–86.
  19. Wang J, Niu X, Kim J, Shen J, Zhang T, Weiss JC. 22MCTS: A 22 millions-event temporal clinical time-series dataset with relative timestamp for risk prediction. arXiv preprint arXiv:2505.00827. 2025.
  20. Wang J. MIMIC-IV-Ext-22MCTS: Temporal Clinical Time-Series Dataset [code]. Available from: https://github.com/JingWang-RU/MIMIC-IV-Ext-22MCTS-Temporal-Clinical-Time-Series-Dataset [Accessed 12 Sep 2025].

Parent Projects
MIMIC-IV-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files