Database Credentialed Access
MIMIC-IV-Ext-PE: Pulmonary Embolism Labels for CT Pulmonary Angiography Radiology Reports
Barbara Lam , Omid Jafari , Peiqi Wang , Iuliia Kovalenko , Steven Horng , Ang Li , Shengling Ma
Published: March 23, 2026. Version: 1.0.0
When using this resource, please cite:
Lam, B., Jafari, O., Wang, P., Kovalenko, I., Horng, S., Li, A., & Ma, S. (2026). MIMIC-IV-Ext-PE: Pulmonary Embolism Labels for CT Pulmonary Angiography Radiology Reports (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/5qq1-zv65
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
Pulmonary embolism (PE) is a leading cause of preventable in-hospital mortality. Advances in diagnosis, risk stratification, and prevention can improve outcomes. Large, publicly available datasets are needed to move research forward but are lacking in the field of hemostasis and thrombosis. In this study, we added PE labels to computed tomography pulmonary angiography (CTPA) radiology reports in MIMIC-IV. We used Regular Expression (RegEx) to identify CTPA radiology reports (n=19,942) and extracted sentences containing PE-related words (“snippets”). Two physicians manually reviewed these snippets, referring to the full report as needed, and labeled each report as PE positive or negative. Positive labels included any acute PE (n=1,591). Acute PE that only involved subsegmental arteries were labeled as subsegmental. Negative labels included chronic PE, equivocal findings, and no PE (n=18,351). Using this as a gold standard, we then compared the performance of a finetuned transformer model to diagnosis codes in their ability to classify the reports as PE positive or negative.
Background
Pulmonary embolism (PE) is a leading cause of preventable in-hospital mortality [1] . Early detection and treatment can reduce the risk of death, but diagnosis can be challenging because the typical symptoms of tachycardia, shortness of breath, and chest pain are nonspecific. Much work has been done to identify patients at risk of developing PE and tailor treatment for those who have been diagnosed with PE, but better risk assessment models are needed [2]. Research groups are exploring the use of machine learning techniques to improve PE detection and treatment [3]. External validation of these models in a different healthcare system or dataset is a critical step in advancing the field, but there are few large publicly available datasets with PE labels.
The Radiological Society of North America (RSNA) Pulmonary Embolism CT Dataset includes over 12,000 CT studies, shared as DICOM images with metadata tags, labeled with PE type [4]. INSPECT includes 23,248 CT scans labeled with PE type, along with radiology reports and longitudinal electronic health record (EHR) data such as demographics, diagnoses, procedures, vitals, and medications [5]. This work seeks to add another multimodal dataset to the public domain by labeling CTPA reports in MIMIC-IV [6], thus linking PE cases to data unique to the MIMIC dataset, including EHR data, echocardiograms, chest x-rays, and electrocardiograms.
Identifying PE in medical charts is not only important for furthering research in thromboembolism but also for public health monitoring. The majority of hospital-acquired PE are thought to be preventable, and reporting the incidence of PE has become an important hospital quality metric [1, 7]. Historically, PE diagnoses were identified by manual chart review, which was labor intensive and difficult to scale. Attempts to use International Classification of Diseases (ICD) diagnosis codes revealed poor predictive value, especially in the outpatient and emergency room settings [8, 9].
Natural language processing (NLP) potentially offers a more accurate and automated method for phenotyping large datasets for quality review and clinical research [10]. Based on a recent systematic review of NLP methods for PE identification, very few research groups have completed an external validation of their work and none that we know of have attempted using a transformer language model [11]. Transformer language models represent the most recent iteration of NLP and are changing the landscape of numerous fields including healthcare [12]. Maghsoudi et al previously customized the transformer language model known as Bio_ClinicalBERT by finetuning its ability to identify venous thromboembolism (VTE) including deep vein thrombosis (DVT) and PE in a cohort of 800 cancer patients with 3000 notes [13]. This finetuned model is referred to as the VTE-BERT model from here forward.
In this study, two physicians labeled all available CTPA reports in MIMIC-IV with PE outcomes. Using this as a gold standard, we then compared the performance of VTE-BERT to diagnosis codes in their ability to classify reports as PE positive or negative.
Methods
Identification of CTPA reports
We identified CTPA radiology reports through a combination of querying, RegEx, and manual review. A step-by-step of our process is described below, and our code is shared in Supplementary File 1 and on GitHub [14]. All relevant terminology is listed in Supplementary Table 1.
- Identify all notes of subtype “RR” (radiology report)
- Segment reports into “History”, “Indication”, “Procedure”, “Examination”, “Study”, and “Technique” sections
- Include reports that contain terminology related to CTPA in “Procedure”, “Examination”, “Study”, or “Technique” sections
- Include reports that contain terminology related to PE in “History” or “Indication” sections
- Include reports that contain a separate section with a heading related to CTPA
- Manual review
Two physicians (BDL and IK) confirmed that each radiology report described a CTPA. Any type of imaging study that also included a CT angiogram of the chest was included. For example, imaging studies such as CT angiograms of the entire torso or CT angiograms of the cardio-vasculature were included if a CT angiogram of the chest was described in the technique section. Ventilation-perfusion scans were not included within the scope of this study.
Text processing
CTPA radiology reports were further preprocessed using a RegEx algorithm that identified PE keywords, isolated relevant sentence(s), and merged them into a final note file (Supplementary Table 2). These sentences were used as input to VTE-BERT, which was asked to determine whether the compiled note described an acute PE (PE positive) or not (PE negative). If no sentence was isolated by the preprocessing algorithm for evaluation by VTE-BERT, the classification was labeled as negative (no PE). The performance of VTE-BERT was measured using sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
VTE-BERT model
VTE-BERT was developed by finetuning the Bio_ClinicalBERT transformer model using split-sample cross-validation on positive and negative VTE keyword-containing sentences. The sentences derive from the clinical notes (progress notes, discharge summaries, and radiology reports) of adult patients with active cancer from the Harris Health System. VTE-BERT was customized to identify both DVT and PE events in written text and has been externally validated in a Veterans Affairs cohort of adult patients with cancer. Further details of the model’s development, testing, and validation processes are described in separate abstracts [13, 15]. This study focuses on applying VTE-BERT to a general cohort of adult patients to identify PE events in MIMIC-IV.
Manual physician review
The isolated sentence and VTE-BERT prediction were reviewed by one physician (BDL). The second physician reviewed the isolated sentence only and was blinded to the VTE-BERT output (IK). The physicians reviewed the entire radiology report if there was no isolated sentence. Both physicians used the following criteria as the gold standard for categorizing PE:
- Positive PE
- Acute = Acute PE or a mix of acute and chronic PE
- Subsegmental = Acute PE in subsegmental arteries only
- Negative PE
- Chronic = Chronic PE, PE of unclear chronicity, or PE similar to last scan
- Equivocal = Equivocal findings such as motion artifact versus PE
- Negative = No PE, imaging suboptimal for identifying PE, or no description of the pulmonary arteries
All conflicts were discussed until agreement was reached for every report. To assess interrater agreement, we calculated Cohen’s kappa (k) after grouping the 5 labels into binary outcomes of positive and negative PE.
Diagnosis code assessment
We also assessed the accuracy of using ICD codes (Supplementary File 2) to identify PE cases in a subgroup analysis. The ICD codes associated with radiology reports are available for hospitalizations with billable discharges. Therefore, we only assessed the performance of ICD codes on radiology reports that had an associated hospital stay identification number. ICD 9 and 10 codes related to PE were included; codes describing septic, cement, or fat emboli were excluded. One patient could have multiple CTPAs during their hospital stay; if this occurred then only one CTPA report was included for comparison. If one of the reports showed an acute PE, that report was preferentially included. All analyses were conducted using Python 3.11.
Data Description
Dataset characteristics
Of 2,321,355 radiology reports in MIMIC-IV, we identified 21,948 reports as likely CTPAs (Figure 1). After manual review, we confirmed 19,942 distinct CTPA reports from 15,875 patients. The median age was 63 years and approximately half of the patients identified as female (56.7%) and white (58.8%) (Table 1).
Two physicians reviewed the 19,942 CTPA reports. There was adjudication discrepancy for 504 reports (k = 0.82). After reviewing these 504 reports together, the two reviewers resolved all disagreements. This manual abstraction process (gold standard) identified 1,591 reports describing acute PE (of which 233 involved subsegmental arteries only) and 18,351 reports describing negative PE (of which 345 described chronic PE and 104 described equivocal findings).
The MIMIC-IV-Ext-PE.csv file includes the following columns:
- note_id: the unique identifier for the CTPA note
- subject_id: the unique identifier for an individual patient
- hadm_id: the unique identifier for a patient hospitalization
- text: the full text of the CTPA radiology report
- Type_of_PE: Acute, Subsegmental, Chronic, Equivocal, or Negative
- charttime: the time at which the note was charted
- storetime: the time at which the note was stored in the database
VTE-BERT performance
Our preprocessing algorithm identified and isolated PE-containing relevant sentences from 18,748 reports. Among the remaining 1,194 reports where no relevant keywords were identified, only one described an acute PE. The VTE-BERT model demonstrated a sensitivity of 0.92 (95% CI: 0.91-0.94) and a PPV of 0.88 (95% CI: 0.86-0.89) (Table 2). The most common error was prediction of a report to be positive when it described chronic PE findings only.
Diagnosis code performance
Among the 19,942 CTPAs identified, 12,355 were associated with 11,990 unique hospital stays (365 represented multiple images from the same stay). When comparing the inpatient discharge ICD codes to the physician-adjudicated gold standard, we found that 308 reports were incorrectly labeled by ICD code: 61 reports described an acute PE that had no relevant ICD code associated; and of those with an ICD code indicating acute PE, 108 reports described chronic PE only, 115 were negative for PE, and 24 were deemed equivocal findings. Of the 1,276 reports that were correctly identified as PE positive by ICD code, 169 described PE involving the subsegmental arteries only. Four of these 169 reports had an ICD code that specified PE in the subsegmental arteries only. Overall, ICD codes demonstrated a sensitivity of 0.95 (95% CI: 0.94-0.96) and a PPV of 0.84 (95% CI: 0.82-0.86) for identifying PE in inpatient visits (Table 2).
Usage Notes
In this dataset, we extracted CTPA radiology reports from MIMIC-IV and labeled them with the PE finding to create MIMIC-IV-Ext-PE. Each record is identified by several unique identifiers that can link this data to other MIMIC-IV datasets. Our work adds a large dataset for PE to the literature, which is critical for the expansion of machine learning research in hemostasis and thrombosis [5]. Information on optimal PE risk stratification, diagnosis, and treatment lies in various types of data. Datasets such as MIMIC-IV-Ext-PE enable research into multimodal approaches to PE management by linking PE cases to longitudinal EHR data, clinical notes, echocardiograms, chest x-rays, and electrocardiograms [6, 17]. They also enable researchers to easily test the performance of their models on external data, an important validation step prior to clinical deployment. Of note, PE labels in this dataset are based on the radiology report only, which may not reflect how the case was ultimately treated (e.g., an equivocal finding on the radiology report may have been treated as an acute PE, with anticoagulation).
Limitations
ICD codes demonstrated a sensitivity of 95.4% and 83.8% PPV for identifying PE in the chart, but in this dataset could only be applied to radiology reports associated with an inpatient stay; 7,587 CTPAs could not be evaluated using diagnosis codes. As reported previously in the literature, ICD codes have poor predictive value for identifying PE, especially in the emergency room and outpatient settings [8, 9]. Furthermore, only four hospital stays had an ICD code that specified PE in the subsegmental arteries only. This level of classification is important for research given controversy around the clinical relevance of subsegmental PE [18, 19].
One limitation of our evaluation of ICD codes, however, is the lack of insight into the patient’s clinical presentation at the time of the radiology study. For example, a patient with equivocal findings on the radiology report but significant symptoms may have ultimately been treated for PE and therefore labeled with an ICD diagnosis code of PE. We also excluded other types of imaging studies that can diagnose PE, such as ventilation-perfusion scans. Future work can be done to label reports of other imaging types and investigate which patients were treated with anticoagulation as further validation of the PE diagnosis on the patient or hospital stay level. This may also better capture the range of patients who are diagnosed with PE; our focus on CTPAs only limits representation of patients who are unlikely to receive CTPAs such as pregnant patients and patients with contrast allergies.
Our PE labels were confirmed by dual physician adjudication. It is possible that the physician reviewer could be biased by the VTE-BERT prediction, particularly in cases where the findings were equivocal. We attempted to minimize this bias by blinding the second physician reviewer to the VTE-BERT prediction. However, there can still be human errors in manual adjudication and subjectivity in interpreting radiology reports. We utilized the label of equivocal findings to flag the CTPA reports with less definitive language. We invite others to replicate our work and iterate on the final dataset, which can undergo continued improvements in the public space.
Release Notes
Version 1.0.0: Initial public release of the dataset.
Ethics
We derived this dataset from MIMIC-IV, and no new patient data was collected.
Acknowledgements
This material is based upon work supported by the Google Cloud Research Credits Program with the award GCP19980904.
Conflicts of Interest
The authors have no conflicts of interest to report
References
- Beckman MG, Hooper WC, Critchley SE, Ortel TL. Venous thromboembolism: a public health concern. Am J Prev Med. 2010;38(4 Suppl):S495-501.
- Becattini C, Agnelli G. Acute treatment of venous thromboembolism. Blood. 2020;135(5):305-16.
- Chiasakul T, Lam BD, McNichol M, Robertson W, Rosovsky RP, Lake L, et al. Artificial intelligence in the prediction of venous thromboembolism: A systematic review and pooled analysis. Eur J Haematol. 2023;111(6):951-62.
- Colak E, Kitamura FC, Hobbs SB, Wu CC, Lungren MP, Prevedello LM, et al. The RSNA Pulmonary Embolism CT Dataset. Radiol Artif Intell. 2021;3(2):e200254.
- Huang S-C, Huo Z, Steinberg E, Chiang C-C, Lungren MP, Langlotz CP, et al. INSPECT: A multimodal dataset for pulmonary embolism diagnosis and prognosis. arXiv preprint arXiv:2311.10798; 2023. Available from: https://arxiv.org/abs/2311.10798 Accessed October 29, 2025.
- Johnson A, Bulgarelli L, Pollard T, Gow B, Moody B, Horng S, Celi LA, Mark R. MIMIC-IV (version 3.1). PhysioNet. 2024. RRID:SCR_007345. Available from: https://doi.org/10.13026/kpb9-mt58
- Raskob GE, Silverstein R, Bratzler DW, Heit JA, White RH. Surveillance for deep vein thrombosis and pulmonary embolism: recommendations from a national workshop. Am J Prev Med. 2010;38(4 Suppl):S502-9.
- Zhan C, Battles J, Chiang YP, Hunt D. The validity of ICD-9-CM codes in identifying postoperative deep vein thrombosis and pulmonary embolism. Jt Comm J Qual Patient Saf. 2007;33(6):326-31.
- Fang MC, Fan D, Sung SH, Witt DM, Schmelzer JR, Steinhubl SR, et al. Validity of Using Inpatient and Outpatient Administrative Codes to Identify Acute Venous Thromboembolism: The CVRN VTE Study. Med Care. 2017;55(12):e137-e43.
- Wendelboe A, Saber I, Dvorak J, Adamski A, Feland N, Reyes N, et al. Exploring the Applicability of Using Natural Language Processing to Support Nationwide Venous Thromboembolism Surveillance: Model Evaluation Study. JMIR Bioinform Biotech. 2022;3(1).
- Lam BD, Chrysafi P, Chiasakul T, Khosla H, Karagkouni D, McNichol M, et al. Machine learning natural language processing for identifying venous thromboembolism: Systematic review and meta-analysis. Blood Adv. 2024.
- Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review. Ann Intern Med. 2024;177(2):210-20.
- Maghsoudi A, Zhou E, Guffey D, Ma S, Xiao X, Peng B, et al. A Transformer Natural Language Processing Algorithm for Cancer Associated Thrombosis Phenotype. Blood; 2023. p. 1267.
- Lam BD. MIMIC CTA report processing code [Internet]. GitHub; 2025. Available from: https://github.com/barbaralam/mimicpe Accessed December 6, 2025.
- Li A, Jafari O, Ma S, Maghsoudi A, Lam BD, Ryu J, et al. Optimized multi-class VTE-BERT large language model for prediction of cancer associated thrombosis phenotype. Blood; 2024.
- Higashiya K, Ford J, Yoon HC. Variation in Positivity Rates of Computed Tomography Pulmonary Angiograms for the Evaluation of Acute Pulmonary Embolism Among Emergency Department Physicians. Perm J. 2022;26(1):58-63.
- Cahan N, Klang E, Marom EM, Soffer S, Barash Y, Burshtein E, et al. Multimodal fusion models for pulmonary embolism mortality prediction. Sci Rep. 2023;13(1):7544.
- Yoo HH, Nunes-Nogueira VS, Fortes Villas Boas PJ. Anticoagulant treatment for subsegmental pulmonary embolism. Cochrane Database Syst Rev. 2020;2(2):CD010222.
- Wiener RS, Schwartz LM, Woloshin S. Time trends in pulmonary embolism in the United States: evidence of overdiagnosis. Arch Intern Med. 2011;171(9):831-7.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/5qq1-zv65
DOI (latest version):
https://doi.org/10.13026/f77d-vr35
Project Views
3
Current Version3
All VersionsCorresponding Author
Versions
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project