Database Credentialed Access

Insulin4RL: Real-Time Insulin Infusions For Offline Reinforcement Learning

Thomas Frost Steve Harris

Published: June 15, 2026. Version: 1.0.0


When using this resource, please cite:
Frost, T., & Harris, S. (2026). Insulin4RL: Real-Time Insulin Infusions For Offline Reinforcement Learning (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/swen-q904

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Offline reinforcement learning (RL) holds promise for improving healthcare decision-making. However, most existing approaches rely on discretising patient trajectories into strictly regular time intervals. This is done to improve compatibility with neural networks and standard Markov Decision Process (MDP) frameworks, but in doing so, it introduces artefacts of missingness and risks out-of-distribution errors at deployment by training on fictitious representations of the data. Additionally, recent research suggests that retrospective evaluation using binned data can lead to significantly biased estimates of model performance.

We present Insulin4RL, the first freely available dataset designed to encourage RL research on healthcare data with naturally irregular time intervals – both for inputs and labelled decisions. Each label corresponds to a decision regarding the insulin infusion rate, inferred based on proximity to recent blood glucose measurements. This yields over 375,000 labelled decisions from 12,214 patients in the Medical Information Mart for Intensive Care (MIMIC-IV). Clinical rules are applied to robustly identify clinician decisions and subsequent patient states, ensuring immediate compatibility with standard RL methods. Because these decisions occur at irregular intervals, the resulting trajectories can be treated as a semi-Markov Decision Process (SMDP).

The associated state information consists of a sequence of retrospective medical events - including laboratory results, nutritional data, and medication changes - represented as tuples (f,v,t)(f, v, t) capturing feature, value, and time data. This event-based representation allows direct processing of irregularly sampled data using embedding networks, eliminating the need for discretisation, imputation, or artificially regularised time steps. We hope this approach will encourage research in healthcare RL using data that aligns more closely with real-world practice, using inpatient glycaemic control as a benchmarked example.


Background

Both high blood glucose (hyperglycaemia) and low blood glucose (hypoglycaemia) are independently associated with increasing morbidity and mortality in critically ill patients [1, 2]. The current standard of care involves titrating intravenous insulin infusions to keep blood glucose levels within a pre-specified 'normal' range. Yet the optimisation of blood glucose control in the Intensive Care Unit (ICU) remains a contested area, with a lack of clarity from randomised clinical trials around exact targets for different patient subgroups [2, 3].

Recent advancements in offline reinforcement learning (ORL) now offer a promising pathway for developing data-driven decision support tools by learning optimal policies from retrospective Electronic Health Record (EHR) datasets, such as MIMIC-IV [4, 5]. However, reinforcement learning algorithms are conventionally based around regular intervals between decisions. The precise application of these algorithms to the naturally sporadic decisions of a clinical setting is therefore not well-defined. To address this, most healthcare ORL research relies on temporal resampling (or 'binning') of clinical time-series data into fixed, uniform time intervals (e.g., 1-hour or 4-hour windows) [6, 7].

Recent evidence from in silico trials [6] suggests that binning of healthcare data for offline reinforcement learning introduces several critical issues:

  • Counterfactual Artefacts: Aggregating data over broad windows can mask rapidly changing physiology and even induce causal inversion, where the recorded sequence of clinical events contradicts the ground-truth physical interactions.

  • Performance Degradation: Models trained on resampled data can suffer a performance collapse—up to 60% in some simulated environments—when deployed back into naturally irregular clinical settings.

  • Evaluation Bias: Standard retrospective evaluation methods (Off-Policy Evaluation) conducted on binned data may become biased and significantly overestimate model returns, potentially masking risk to patients.

It is clear that more research is required into ORL methods applied to naturally irregular healthcare data. However, the process of preparing such data is non-trivial and no openly available healthcare datasets currently exist for the purposes of ORL with naturally irregular data.

Insulin4RL is intended to fill this gap. Unlike traditional datasets that rely on fixed-interval binning, Insulin4RL provides both input data and clinical decisions that are naturally irregular, for the purposes of optimising insulin infusions in the ICU. By allowing ML models to account for the variable durations between independent clinical decisions, we hope to mitigate some of the risks described above. The data is provided both as a .parquet dataframe, and as .safetensors in the conventional (state, action, reward, done) format familiar to reinforcement learning researchers.

By providing a high-fidelity benchmark dataset for inpatient glycaemic control, we hope to encourage research into the following areas:

  1. ORL using irregular EHR data: Designing methods for training clinical ORL models that expect and handle irregular intervals between clinical decisions and input features.

  2. Accurate off-policy evaluation: Benchmarking evaluation methods using raw, un-binned data to improve the accuracy of predicting real-world performance for these models.

  3. Improved insulin management for ICU patients: Contribute to the development of personalised insulin titration policies optimised for long-term patient-centred outcomes like mortality and length-of-stay.


Methods

Data Source and Cohort Selection

The dataset is derived from the MIMIC-IV (v3.1) clinical database [5], which contains de-identified data for patients admitted to the Beth Israel Deaconess Medical Center between 2008 and 2022. Cohort inclusion criteria are as follows:

  1. Patients must have received at least one insulin infusion during their ICU stay.
  2. Episodes must contain at least one blood glucose measurement associated with an insulin titration decision.

Each episode begins 24 hours before the insulin infusion begins, and terminates 24 hours after the infusion is stopped.

Input features are currently limited to laboratory results, intravenous drug infusions/boluses, steroids, nutrition, and demographic information. Demographic information (age, sex, race/ethnicity, admission type, insurance status, marital status, comorbidities) is also extracted into metadata/demographics.parquet. Comorbidities are identified using ICD-9 and ICD-10 codes according to Quan et al. [8].

Event-Based State Representation

In contrast to traditional hourly binning, Insulin4RL uses an event-driven representation based on the Medical Event Data Standard (MEDS). The state at any given decision epoch is represented as a sequence of medical events. Each event is stored as a tuple (f,v,t)(f, v, t), where:

  • ff represents the integer feature code (e.g., Sodium level, Heart Rate).

  • vv represents the numerical value associated with that feature.

  • tt represents the time of the event relative to the current decision (e.g., 5 minutes ago, 30 minutes ago)

These sequences are processed using a context window (defaulting to the most recent 400 events over the past 7 days) to capture the longitudinal history of the patient. This allows the model to capture trends and temporal dependencies without any imputation or aggregation artefacts. For drugs and nutrition, each event represents a change in the infusion rate or a one-off bolus administration.

Action Space and Decision Labelling

Decisions are labelled based on when a model is likely to be queried for dosing advice. In a clinical setting, insulin titration is typically reactive to blood glucose measurements. We therefore define a decision point at each blood glucose measurement within the episode, whose intervals vary stochastically. 

Specifically, we aggregate glucose measurements into 5-minute windows and take the latest value within that window. We then examine the period of time after the blood glucose check (30 minutes, or the next blood glucose check, whichever is sooner) and identify the following mutually exclusive actions:

  • Maintain: Binary indicator; insulin rate staying the same (<0.25 units/hr change between adjacent rates).

  • Stop: Binary indicator; insulin rate changing from any rate >0.1 units/hour to <0.1 units/hour.
  • Change: Binary indicator; insulin rate changed by >0.25 units/hour and not stopped.

    • Delta Change: The numerical difference (in units/hr) between the previous and current infusion rates.

Very rarely, insulin may be changed (but not stopped) by ≥ 5.5 units/hour in one go – these large changes are treated as outlier events and removed as valid labelled decisions.

If multiple insulin changes occur within the eligible period, the final rate is used to derive the labels.

Outcome and Reward Labelling

We provide multiple potential signals for rewards:

  1. Glycemic Targets: The current blood glucose and next blood glucose levels.

  2. Mortality Indicators: The survival flags for 1, 3, 7, 14, and 28 days. These are given relative to the decision, as well as relative to the final state of the trajectory (affixed with '-final').

Other Data Cleaning

  • Antibiotic administration is converted to binary events (presence/absence of a dose).
  • Physiological values outside of the 0.1-99.9 percentile range are removed.
  • Drug infusions or boluses greater than the 99.5 percentile range are clipped to the 99.5 percentile.
  • Drug infusions are consolidated using a sweep line algorithm, to give the net infusion rate for each drug as experienced by the patient at any point in time.

Data Description

Data Description

The Insulin4RL dataset is compressed into a single data.tar.gz file of approximately 970MB. When fully decompressed, the data/insulin4rl folder should be approximately 4.2GB.

The dataset is provided in two primary formats: an all_data.parquet DataFrame for analysis, and *.safetensors files for reinforcement learning. We have provided an tutorial_notebook.ipynb notebook which provides a demonstration of these two datasets. Please note that the states are pre-standardised in the .safetensors files (using log-transformation), but are not standardised in the all_data.parquet file.

All data is automatically divided into training, validation, and test sets (according to patient ID). There is approximately an 80/10/10 split, which is stratified to ensure roughly equal proportions of gender and mortality in each data segment.

We also provide five "metadata" files: demographics.parquet (which contains demographic information for each patient); integer encodings for each input feature in feature_mapping.yaml; statistics for normalisation/standardisation of data, contained in feature_stats.yaml; a list of features for each labelled decision in label_features.yaml; and the thresholds used for removing/clipping outliers, in outlier_thresholds.yaml

Available Features

The following are columns in the all_data.parquet file.

  • Unique identifiers
    • data_segment - whether the data is train/val/test.
    • subject_id - the unique MIMIC-IV subject_id for this patient.
    • label_id - the unique ID for this labelled decision
    • label_id_next - the unique ID for the next labelled decision (if it exists).
    • episode_num - the unique ID for this episode.
    • step_num - the step number for this labelled transition, starting from 1.
    • labeltime - the MIMIC-IV timestamp for this labelled decision (rounded to the nearest 5 minutes).
    • labeltime_next - the MIMIC-IV timestamp for the next labelled decision (if it exists).
  • Temporal context
    • steps_per_episode - the number of labelled decisions for this episode.
    • steps_remaining - the number of future labelled decisions remaining in this episode.
    • minutes_remaining - the minutes remaining until the final labelled decision in this episode.
    • is_done - whether this decision is the final decision of the episode.
  • Physiological information
    • current_bm - the current blood glucose that has triggered this decision.
    • prev_bm - the blood glucose at the previous decision (if it exists).
    • next_bm - the blood glucose at the next decision (if it exists).
    • time_since_prev_bm - the time (in minutes) since the previous decision (if it exists).
    • time_until_next_bm - the time (in minutes) until the next decision (if it exists).
  • Insulin information
    • insulin_changetime - the MIMIC-IV timestamp for any changes made to the insulin.
    • insulin_old_rate - the insulin infusion rate before the labelled decision.
    • insulin_new_rate - the insulin infusion rate after the labelled decision.
    • insulin_maintain - (Binary) whether the insulin was left unchanged. 1 = maintained, 0 = not maintained.
    • insulin_change - (Binary) whether the insulin was changed but not stopped. 1 = changed, 0 = not changed.
    • insulin_stop - (Binary) whether the insulin was stopped. 1 = stopped, 0 = not stopped.
    • insulin_delta_change - the change in insulin rate (insulin_new_rate - insulin_old_rate)
    • (The above four actions are repeated using _prev and _next suffixes, for the previous and next labelled decisions.)
  • Mortality outcomes
    • 1/3/7/14/28-day-alive - whether the patient is alive (1) or dead (0) at x days into the future, relative to this decision.
    • 1/3/7/14/28-day-alive-final - whether the patient is alive (1) or dead (0) at x days into the future, relative to the final state of the trajectory.
  • Input states
    • feature - the integer feature code of a past medical event.
    • time - the timestamp of a past medical event, in minutes relative to the current labelled decision (i.e., t=0t=0 is now, and t=60t=60 is 60 minutes in the past. 
    • value - the value of a past medical event.
    • (The above inputs are repeated using _next suffixes, for the input state of the next labelled decision.

The provided Jupyter notebook (tutorial_notebook.ipynb) shows how these have been partitioned into state/action/reward/done/info safetensors files.

Cohort Statistics

The Insulin4RL dataset contains clinical trajectories for 12,209 unique patients, which results in 13,652 episodes and 376,310 labelled insulin decisions. The cohort represents a diverse ICU population requiring intravenous insulin therapy. Approximately 92.4% of patients were limited to a single episode, with 5.6% of patients having two episodes and 1.9% having three or more episodes. 

Per-episode Duration Total Decisions Hypoglycaemic (<4mmol/L) events Hyperglycaemic (≥10mmol/L) events
Median 28 hours 17 0 1.0
Mean 42 hours 28 0.4 5.8
Min n/a 1 0 0
Max 35 days 598 17 198

91.4% of patients survived (as defined by survival 28 days after end of their final episode). 92.5% of patients had a single episode in the dataset. A further 5.6% had two episodes. 1.9% had three or more episodes.


Usage Notes

We have provided a Jupyter notebook (tutorial_notebook.ipynb) to give a demonstration of the data and how it could be used. The primary intention is for offline reinforcement learning research, to develop better approaches for handling naturally irregular healthcare data, including off-policy evaluation methodology.

To reproduce the dataset, follow the instructions at the GitHub repo (https://github.com/tdgfrost/insulin4rl). Please note that reproducing the dataset from scratch may result in a different set of patients in each segment, even with a fixed random seed. To avoid this, we recommend creating the insulin4rl folder and placing the provided patient_ids within it, which the code will use instead.


Release Notes

1.0.0 - First release


Ethics

The authors declare no ethics concerns.


Acknowledgements

TF was funded by the Engineering and Physical Sciences Research Council as part of the UK Research and Innovation Centre for Doctoral Training in AI for Healthcare (grant EP/S021612/1). SH supported the research as a researcher based at the National Institute for Health and Care Research (NIHR) University College London Hospitals (UCLH) Biomedical Research Centre (BRC). Funders played no role in study design, data collection and analysis, decision to publish, or writing. 


Conflicts of Interest

The authors declare no competing financial interests.


References

  1. NICE-Sugar Study Investigators. (2012). Hypoglycemia and risk of death in critically ill patients. New England Journal of Medicine, 367(12), 1108-1118.
  2. Nice-Sugar Study Investigators. (2009). Intensive versus conventional glucose control in critically ill patients. New England Journal of Medicine, 360(13), 1283-1297.
  3. Bohé, J., Abidi, H., Brunot, V., Klich, A., Klouche, K., Sedillot, N., Tchenio, X., Quenot, J. P., Roudaut, J. B., Mottard, N., Thiollière, F., Dellamonica, J., Wallet, F., Souweine, B., Lautrette, A., Preiser, J. C., Timsit, J. F., Vacheron, C. H., Ait Hssain, A., Maucort-Boulch, D., … CONTROLe INdividualisé de la Glycémie (CONTROLING) Study Group (2021). Individualised versus conventional glucose control in critically-ill patients: the CONTROLING study-a randomized clinical trial. Intensive care medicine, 47(11), 1271–1283.
  4. Jayaraman, P., Desman, J., Sabounchi, M., Nadkarni, G. N., & Sakhuja, A. (2024). A primer on reinforcement learning in medicine for clinicians. NPJ digital medicine, 7(1), 337.
  5. Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L. H., Celi, L. A., & Mark, R. G. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific data, 10(1), 1.
  6. Frost, T., Vaidya, H., & Harris, S. (2026). The hidden risks of temporal resampling in clinical reinforcement learning. arXiv preprint arXiv:2602.06603.
  7. Sun, Y., Tang, S. (2025). Exploring time-step size in reinforcement learning for sepsis treatment. RLC 2025 Workshop on Practical Insights into Reinforcement Learning for Real Systems. (arXiv preprint arXiv:2511.20913.)
  8. Quan, H., Sundararajan, V., Halfon, P., Fong, A., Burnand, B., Luthi, J. C., ... & Ghali, W. A. (2005). Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Medical care, 43(11), 1130-1139.

Parent Projects
Insulin4RL: Real-Time Insulin Infusions For Offline Reinforcement Learning was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Project Views

2

Current Version

2

All Versions
Project Views by Unique Registered Users
Corresponding Author
You must be logged in to view the contact information.

Files