Database Credentialed Access
MIMIC-IV-Ext-MedicalBench: Evaluating Large Language Models Towards Improved Medical Concept Extraction
Zhichao Yang , Gregory Lyng , Sanjit Batra , Robert Tillman
Published: March 23, 2026. Version: 1.0.0
When using this resource, please cite:
Yang, Z., Lyng, G., Batra, S., & Tillman, R. (2026). MIMIC-IV-Ext-MedicalBench: Evaluating Large Language Models Towards Improved Medical Concept Extraction (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/j98m-g356
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts, such as diagnosis, are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts and provide limited coverage of cases in which medically relevant concepts must be inferred. We present MedicalBench, a new benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates concept extraction as a verification task over medical note–concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by dual medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with human assessments. Annotators provide sentence-level evidence spans and concise medical rationales. In total, the dataset contains 405 high-quality examples, covering a broad range of ICD-10 chapters. By providing ground-truth evidence and confusable alternatives, MedicalBench enables rigorous evaluation of not only whether a model can extract the correct concept, but also why — rewarding solutions that can highlight relevant evidence and reject plausible-but-incorrect diagnosis and procedures.
Background
Medical concept extraction aims to convert medically meaningful information in narrative notes (e.g., discharge summaries, progress notes, radiology/oncology reports) into structured concepts that can support downstream medical research and applications such as cohort identification, outcome prediction, and automated documentation [1]. Despite steady progress, concept extraction from real-world medical note remains challenging because medically relevant information is frequently implicit, fragmented, and context-dependent [2, 3]. Notes often describe findings, measurements, or interventions (e.g., “low hemoglobin” or “BMI 37”) that imply diagnoses such as anemia or obesity without explicitly naming them, and prior work suggests that under-documentation of conditions in structured fields is common [4].
A growing line of work reframes concept extraction as a verification and grounding problem: given a candidate concept, models must determine whether it is actually supported by the note and identify the evidence spans that justify the decision [5]. While recent evidence-annotated datasets have highlighted the importance of grounding, they often emphasize explicitly stated concepts and provide limited coverage of cases that require medical inference or involve semantically confusable alternatives [6, 7].
To address these gaps, we introduce MedicalBench, a benchmark for evidence-grounded medical concept verification in realistic hospital discharge summaries. MedicalBench formulates the task as: given a medical note and a candidate medical concept, predict whether the concept is supported by the note and, if supported, extract the relevant evidence spans and provide a brief justification. The benchmark is intentionally designed to stress-test models on (i) implicit documentation, (ii) hard negatives that are semantically close to true concepts, and (iii) cases where strong LLMs disagree with expert adjudication. By releasing a carefully annotated set of note–concept pairs with gold labels, evidence spans, and rationales, MedicalBench aims to enable more faithful evaluation of medical NLP systems and to support the development of models that are both accurate and well-grounded in medical note.
Methods
MedicalBench is constructed from de-identified discharge summaries and structured billing codes in MIMIC-IV and MIMIC-IV-Note [8, 9, 10]. For each hospital admission (hadm_id), we extract the discharge summary and map it to all associated ICD-10 diagnosis and procedure codes, which serve as candidate medical concepts along with their textual descriptions and hierarchical metadata.
To identify challenging cases where LLMs may disagree with human experts, we performed a two-stage triage process. In the first stage, a non-reasoning LLM (GPT-4o-mini) was prompted with the discharge summary and the candidate medical concepts, and asked to classify each note–concept pair into one of three categories: Explicit (the concept is explicitly supported by text spans in the note), Implicit (the concept is supported but requires clinical inference, e.g., low hemoglobin level implying anemia), or Unrelated (the concept is not supported). Each pair is prompted 3 times, and the final answer is determined by majority vote. In the second stage, previous disagreements were re-evaluated using a reasoning LLM (o3). We retained only those cases where both LLMs judged the concept as Unrelated, even though the concept (ICD-10 code) was assigned in MIMIC-IV for that admission. This subset (putative positive) represents candidate false negatives for LLMs and forms the basis for challenging positive examples.
To construct difficult negative examples, we designed two complementary sampling strategies.
- Prevalence-weighted negatives: For each discharge summary, we sampled candidate medical concepts according to their prevalence in MIMIC-IV. To characterize their difficulty, we additionally recorded the number of hops in the ICD hierarchy between each sampled negative and the closest related concept.
- Semantically similar negatives: To increase semantic confusion, we embedded medical concept name and sampled the most similar but unrelated concept. For example, Obesity and BMI 30–39 both capture obesity-related conditions but belong to different categories in the ICD hierarchy (disease vs. BMI classification). These hard negatives are designed to test fine-grained reasoning about whether a concept is truly documented in the note.
Together, these strategies minimize the presence of trivial negatives and enrich the dataset with medically challenging contrasts.
We recruited 9 independent expert annotators with medical training. For each note and concept pair, two annotators independently reviewed the discharge summary and labeled the concept as Related or Unrelated. For Related pairs, annotators additionally highlighted sentence-level evidence spans (recorded as character offsets) and provided a short clinical justification.
Disagreements between the two annotators were not included without adjudication. Specifically, if the initial labels disagreed (Related vs. Unrelated), the example was flagged for resolution and reviewed to determine a single final label; pairs that could not be resolved to a clear final decision were excluded from the release. The final dataset therefore contains only examples with a verified gold label and accompanying evidence when applicable.
This release contains 167 gold positives and 238 gold negatives, emphasizing implicit documentation, semantic ambiguity, and LLM–human disagreement, and includes concept metadata, gold labels and evidence spans.
Data Description
data/ is the directory containing our annotation, preprocessing script to transform into inference-friendly format:
- dataset_raw.csv: CSV file containing raw data used in the project. Including our annotation of evidence and summary (See data schema below for detail).
- mimic_data.py: Python script for preprocessing data. It will use the annotation file and MIMIC-IV discharge notes to map evidence span index to actual text snippet, and output a CSV containing full discharge note and extracted evidence. The output file will be used for inference in the next step.
- create_data.sh: A bash script example to run mimic_data.py.
inference/ is the directory containing code for inference tasks:
- run_sample.py: Python script to run a baseline model for inference. It will save the prediction into CSV format, zip it and upload to benchmark website for evaluation. Update function renew_model_gpt_azureml with your Azure subscription.
- infer.sh: A bash script example to run run_sample.py.
- outputs/: Directory intended for storing inference outputs. An example is created when running GPT-4.1 as base model.
Data Schema of dataset_raw.csv:
- task_id
Type: Object (string)
Description: Unique identifier for each case, often combining patient/hospital IDs.
Example:
12479866_23707166 - code
Type: Object (string)
Description: ICD-10-CM or ICD-10-PCS code assigned to the medical note. Represents diagnoses or procedures.
Example:
T79.7XXA - desc
Type: Object (string)
Description: Human-readable description corresponding to the code.
Example:
“Traumatic subcutaneous emphysema, initial encounter” - gold
Type: Object (boolean)
Description: Whether the medical note documents the code.
Example:
“True” - reasoning_summary
Type: Object (string)
Description: A concise explanation of why this code is supported by the medical note.
Example:
“Submandibular and right neck lymphadenopathy are documented” - evidence_spans
Type: Object (list of int)
Description: A list of character index pairs (start, end) indicating where supporting evidence appears in the note text.
Example:
[(8012, 8047), (6350, 6379)]
Data Schema after running mimic_data.py (additional):
- text
Type: Object (string, long free text)
Description: Full medical note describing the patient’s case, including admission reason, history, procedures, hospital course, discharge summary, and follow-up.
Example (truncated):
“Patient: Adult female with no known allergies, admitted under Orthopaedics for gunshot wound (GSW)...”
“Colonoscopy Procedure: A colonoscopy was performed via a natural opening...” - evidence
Type: Object (list of string, short free text)
Description: A list of evidence in the note text.
Example:
["Extreme Fatigue", "Increased Thirst"]
Usage Notes
To prepare the data for inference, update parameters in data/create_data.sh and run the shell script to run mimic_data.py. This will reconstruct the text field from MIMIC-IV-Note.
To infer the data, update the parameters in inference/infer.sh and run the script.
Evaluations
For this dataset, there are two complementary evaluations: (A) medical concept extraction and (B) sentence-level evidence retrieval.
A. Medical concept extraction
Given a medical note and a candidate medical concept, the system checks if the note documents the concept. We compute Precision, Recall, and F1 over all cases:
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 x (Precision x Recall) / (Precision + Recall)
Here, a true positive (TP) is a case where the submission predicts True and the ground truth label gold is True. False positive (FP) predicts True when gold=False, and false negative (FN) predicts False when gold=True. We report micro-averaged scores across all cases.
B. Sentence-Level Evidence Retrieval
For the subset of cases where human annotators provided evidence sentences, we evaluate how well a system retrieves those sentences. We report Recall at the sentence level:
Sentence Recall = |Predicted ∩ Gold| / |Gold|
Gold evidence is a set of sentence identifiers curated by human annotators. Predicted evidence is the set returned by the system for the same case. We macro-average recall across all cases with gold evidence. If a system does not return any sentences for a case that has gold evidence, recall for that case is 0.
Intended Usage
This dataset is intended as an evaluation benchmark for medical concept extraction and evidence grounding. This dataset is not intended for large-scale model training or fine-tuning. The dataset is relatively small by design and has been carefully curated to emphasize challenging, high-quality cases rather than comprehensive coverage of clinical conditions. This dataset is intended for research and evaluation purposes only and is not suitable for direct clinical or billing use. Predictions made by models evaluated on this benchmark should not be used in real-world clinical settings without extensive validation.
Release Notes
Version 1.0.0: Initial public release.
Ethics
This project is derived from the de-identified clinical data in the MIMIC-IV and MIMIC-IV-Note repositories, which are publicly released for research via PhysioNet under a credentialed access process and a Data Use Agreement (DUA). Accordingly, MedicalBench follows the same access policy, license, and DUA requirements as MIMIC-IV and MIMIC-IV-Note; only credentialed users who have completed the required training and signed the PhysioNet DUA may access the underlying notes or any derived files distributed under this credentialed health data framework. The annotations we added (evidence and explanations) contain no patient-identifying information, only medical insights.
In creating and releasing MedicalBench, we have taken care to ensure patient privacy is fully preserved. The benefits of this dataset are to advance clinical NLP research – by improving automated coding systems, we aim to assist clinicians and medical coders, reduce coding errors, and enhance the utility of electronic health records for patient care and research. Potential risks are minimal since the data are de-identified; however, users of the dataset should avoid any attempt to re-identify individuals. We also caution that models trained on this data should be evaluated thoroughly: an erroneous code prediction in a real clinical setting could have billing or treatment implications, so improving model interpretability and accuracy (the goal of this dataset) is crucial before any clinical deployment. By providing a resource that encourages transparent reasoning, we hope to mitigate the risk of “black-box” AI errors in healthcare. All researchers using MedicalBench must comply with the usage policies of MIMIC-IV and PhysioNet, ensuring ethical handling of the data.
Acknowledgements
We thank the team of annotators for their work in reviewing notes and concepts. Their clinical expertise and careful attention were instrumental in creating this dataset. We are also grateful to the MIT Laboratory for Computational Physiology for maintaining the MIMIC-IV database, and to the PhysioNet team for providing a platform to share this resource with the community. We also acknowledge helpful discussions with colleagues and domain experts that improved the design of the dataset.
Conflicts of Interest
The author(s) have no conflicts of interest to declare.
References
- Mullenbach J, Wiegreffe S, Duke J, Sun J, Eisenstein J. Explainable prediction of medical codes from clinical text [Internet]. arXiv. 2018 Feb 15 [cited 2026 Feb 27]. Available from: https://arxiv.org/abs/1802.05695
- Soroush A, Glicksberg BS, Zimlichman E, Barash Y, Freeman R, Charney AW, et al. Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI. 2024 Apr 25;1(5):AIdbp2300040.
- Perera S, Mendes P, Sheth A, Thirunarayan K, Alex A, Heid C, et al. Implicit entity recognition in clinical documents. In: Palmer M, Boleda G, Rosso P, editors. Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics; 2015 Jun; Denver (CO). Stroudsburg (PA): Association for Computational Linguistics; 2015. p. 228-238. doi:10.18653/v1/S15-1028.
- Yang Z, Batra SS, Stremmel J, Halperin E. Surpassing GPT-4 medical coding with a two-stage approach [Internet]. arXiv. 2023 Nov 22 [cited 2026 Feb 27]. Available from: https://arxiv.org/abs/2311.13735
- Dong H, Falis M, Whiteley W, Alex B, Matterson J, Ji S, et al. Automated clinical coding: what, why, and where we are? NPJ Digit Med. 2022 Oct 22;5(1):159.
- Cheng H, Jafari R, Russell A, Klopfer R, Lu E, Striner B, et al. MDACE: MIMIC documents annotated with code evidence. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2023 Jul; Toronto (ON), Canada. Stroudsburg (PA): Association for Computational Linguistics; 2023. p. 7534-7550. doi:10.18653/v1/2023.acl-long.416.
- Beckh K, Studeny E, Gannamaneni SS, Antweiler D, Rueping S. The anatomy of evidence: an investigation into explainable ICD coding. In: Che W, Nabende J, Shutova E, Pilehvar MT, editors. Findings of the Association for Computational Linguistics: ACL 2025; 2025 Jul; Vienna, Austria. Stroudsburg (PA): Association for Computational Linguistics; 2025. p. 16840-16851. doi:10.18653/v1/2025.findings-acl.864.
- Johnson A, Bulgarelli L, Pollard T, Gow B, Moody B, Horng S, Celi L A, Mark R. MIMIC-IV (version 3.1). PhysioNet. 2024. Available from: https://doi.org/10.13026/kpb9-mt58.
- Johnson A, Pollard T, Horng S, Celi L, Mark R. MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet; 2023. Available from: https://doi.org/10.13026/1n74-ne17.
- Johnson A, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, Pollard TJ, Hao S, Moody B, Gow B, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/j98m-g356
DOI (latest version):
https://doi.org/10.13026/0k22-xc18
Project Views
0
Current Version0
All VersionsCorresponding Author
Versions
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project