Database Credentialed Access
Annotated Social Determinants of Health Dataset for Adverse Pregnancy Outcomes
Nidhi Soley , MaKhaila Bentil , Jash Shah , Masoud Rouhizadeh , Casey Taylor
Published: Aug. 4, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Soley, N., Bentil, M., Shah, J., Rouhizadeh, M., & Taylor, C. (2025). Annotated Social Determinants of Health Dataset for Adverse Pregnancy Outcomes (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/qk2y-wx30
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
This project presents an annotated dataset derived from MIMIC-III and MIMIC-IV discharge summaries, focusing on key Social Determinants of Health (SDoH) factors—social support, occupation, and substance use—and their association with adverse pregnancy outcomes. Leveraging a combination of manual annotation and advanced Natural Language Processing (NLP) techniques, we developed and validated multiple models (rule-based, Word2Vec, and Clinical BERT) to automate the extraction of these features. The resulting de-identified dataset, along with our code scripts for data preprocessing, model development, and validation, is made publicly available through this PhysioNet project.
Background
Adverse pregnancy outcomes such as preterm birth, low birth weight, and preeclampsia pose significant public health challenges in the US [1–4]. Social determinants of health (SDoH), which include environmental and societal conditions, are critical for maternal and infant health [5,6]. However, SDoH information is often embedded in unstructured EHR notes, making manual extraction both resource-intensive and costly [7]. Recent advances in NLP—using approaches like Clinical BERT, Word2Vec, and rule-based methods—offer promising avenues for automating SDoH extraction [8–15]. Despite this progress, many studies are limited by single-dataset validation and a lack of linkage between extracted features and clinical outcomes. Our project addresses these gaps by comparing multiple NLP strategies on MIMIC-III discharge summaries and validating the results on MIMIC-IV data, ultimately translating unstructured clinical text into actionable social risk profiles for maternal health.
Methods
Data Extraction & Inclusion Criteria:
Our study utilized discharge notes from two publicly available critical care databases: MIMIC-IIIL for model training and testing, and MIMIC-IV for model validation. From MIMIC-III, discharge notes were selected for female patients meeting the following inclusion criteria: (1) ICD-9 codes indicative of pregnancy (normal pregnancies: 650–659; complications: 630–639, 660–669)35, (2) non-empty "Social History" section, and (3) availability of discharge summaries for the latest pregnancy in case of multiple. For validation, notes were included from MIMIC-IV using identical criteria.
Annotation Protocol & Agreement Analysis
- MIMIC-III Annotations: Manually annotated by a single annotator following predefined guidelines.
- MIMIC-IV Annotations: Independently annotated by three annotators, with each note reviewed by at least two annotators.
- Inter-Annotator Agreement: Measured using Cohen's kappa scores to assess consistency across annotators. The agreement metrics and kappa score calculations are reported in the archived paper.
Annotation Categories:
- Social Support: Annotated as present (1) when the note mentioned cohabitation or strong familial support, and absent (0) otherwise.
- Occupation: Marked as present (1) for evidence of active employment; absent (0) if not mentioned or noted as unemployed.
- Substance Use: Coded as present (1) when current or past substance use was identified; absent (0) if denied or undocumented.
Model Development:
We implemented and compared three NLP approaches on MIMIC III:
- Rule-Based Approach: Utilizing keyword processors to extract explicit mentions.
- Word2Vec Embeddings: Combined with machine learning classifiers to capture semantic relationships.
- Clinical BERT: Fine-tuned for clinical text to capture context-sensitive features.
The optimal model for each SDoH category was determined based on performance metrics (accuracy, precision, recall, F1-score) and then applied to the validation set (MIMIC IV).
Code and Scripts:
All code scripts for data preprocessing, model development, and validation are hosted in our GitHub repository, which is linked here for full reproducibility.
Data Description
The dataset consists of de-identified, manually annotated discharge summaries from MIMIC-III and MIMIC-IV. It includes the following key fields:
- subject_id: Anonymized identifier linking to the original discharge summaries.
- text: discharge summary notes
- social_support: Binary indicator (1/0) based on the presence of supportive relationships.
- occupation: Binary indicator (1/0) representing employment status.
- substance_use: Binary indicator (1/0) for evidence of substance use.
- complication: Label indicating adverse pregnancy outcomes (1: complicated, 0: normal).
Descriptive Analysis:
After applying inclusion and exclusion criteria, 86 discharge summaries were selected from MIMIC-III, and 171 from MIMIC-IV. In MIMIC-III, 78 summaries were associated with patients experiencing pregnancy complications, and fewer than 10 represented normal pregnancies. The MIMIC-IV dataset included 154 complication cases and 17 normal pregnancies. Across the annotated samples, we observed meaningful differences in the frequency of reported SDoH. In MIMIC-III, 47.7% of discharge summaries mentioned social support, 22.1% included occupation-related information, and 39.5% contained documentation of substance use. In comparison, the MIMIC-IV annotations showed that 50.3% of records included social support, 24.0% mentioned occupation, and 27.5% mentioned substance use.
Dataset Statistics:
Total Patients | Number of annotated discharge summaries with the mention of SDoH | MIMIC III (86) | MIMIC IV (171) |
Social Support | Number of summaries with identified social support | 41 | 86 |
Occupation | Number of summaries with occupation details in MIMIC | 19 | 41 |
Substance Use | Number of summaries indicating substance use | 34 | 47 |
Three annotators (NS, MB, JS) annotated the notes, with each note examined by two annotators. The disagreement in annotation was agreed upon. Inter-annotator agreement was measured using Cohen's kappa. In cases of disagreement between annotators, discrepancies were resolved through consensus discussions among annotators (NS, MB, JS) to reach a unified decision, ensuring annotation consistency and quality.
For detailed dataset statistics and further distribution information, please refer to the accompanying documentation in the repository.
Usage Notes
This dataset was developed to facilitate research in extracting SDoH from clinical notes using NLP. The associated paper provides detailed documentation on annotation guidelines, inter-annotator agreement analysis, and dataset characteristics. The dataset includes annotations for social support, occupation, and substance use, extracted from MIMIC-III and MIMIC-IV discharge summaries.
To ensure data consistency, missing and NA
values have been consolidated under a single category for each SDoH variable. Users should carefully review the dataset structure and annotation guidelines before applying this dataset in their research.
Important Considerations:
- The dataset has been de-identified to comply with MIMIC data access policies.
- The annotation definitions follow a strict protocol, which may not align with broader interpretations.
- Researchers are encouraged to refer to the archived paper for details on annotation methodology, agreement metrics, and validation procedures.
For complete documentation on data preprocessing, model development, and validation, visit our GitHub repository.
Release Notes
Version 1.0.0 – Initial Release
- Public release of the annotated MIMIC-III and MIMIC-IV dataset.
- Code scripts for data extraction, annotation, and model development/validation.
- Comprehensive documentation and usage instructions.
Ethics
This project builds upon the publicly available MIMIC-III and MIMIC-IV datasets, both of which consist of de-identified health information collected under Institutional Review Board (IRB) protocols approved by the Massachusetts Institute of Technology (MIT) and Beth Israel Deaconess Medical Center. Access to these datasets is credentialed and governed to protect patient confidentiality.
The secondary analysis conducted in this project utilized only de-identified data and complies with all relevant data use agreements. In addition, this study received approval from the IRB at the Johns Hopkins University School of Medicine (IRB Protocol IRB00467867).
Conflicts of Interest
The authors declare no conflicts of interest. This work was conducted independently without any financial or personal relationships that could inappropriately influence the project outcomes.
References
- Walker SL, et al. Examining the relationship between social determinants of health and adverse pregnancy outcomes in Black women. Am J Perinatol 2023 Jul 21; doi:10.1055/s-0043-1771256.
- Osterman M, Hamilton B, Martin JA, Driscoll AK, Valenzuela CP. Births: final data for 2020. Natl Vital Stat Rep 2021;70(17):1–50.
- Yee LM, Miller EC, Greenland P. Mitigating the long-term health risks of adverse pregnancy outcomes. JAMA 2022;327(05):421–422.
- Kramer MS. The epidemiology of adverse pregnancy outcomes: an overview. J Nutr 2003;133(5 Suppl 2):1592S–1596S.
- US Office of Disease Prevention and Health Promotion. Social determinants of health. 2022. Available from: https://www.healthypeople.gov/2020/topics-objectives/topic/social-determinants-of-health?topicid=39.
- World Health Organization. A conceptual framework for action on the social determinants of health. 2007. Available from: https://apps.who.int/iris/handle/10665/44489.
- Lituiev DS, Lacar B, Pak S, Abramowitsch PL, De Marchis EH, Peterson TA. Automatic extraction of social determinants of health from medical notes of chronic lower back pain patients. J Am Med Inform Assoc 2023;30(8):1438–1447. doi:10.1093/jamia/ocad054.
- Feller DJ, Bear Don’t Walk Iv OJ, Zucker J, et al. Detecting social and behavioral determinants of health with structured and free-text clinical data. Appl Clin Inform 2020;11(1):172–181.
- Lituiev DS, Lacar B, Pak S, Abramowitsch P, De Marchis E, Peterson T. Automatic extraction of social determinants of health from medical notes of chronic lower back pain patients. medRxiv 2022.03.04.22271541; doi:10.1101/2022.03.04.22271541.
- Han S, Zhang RF, Shi L, et al. Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing. J Biomed Inform 2022;127:103984.
- Stemerman R, Arguello J, Brice J, et al. Identification of social determinants of health using multi-label classification of electronic health record clinical notes. JAMIA Open 2021;4(3):ooaa069.
- Lybarger K, Yetisgen M, Uzuner Ö. The 2022 n2c2/UW shared task on extracting social determinants of health. J Am Med Inform Assoc 2023; ocad012. doi:10.1093/jamia/ocad012.
- Richie R, Ruiz VM, Han S, Shi L, Tsui FR. Extracting social determinants of health events with transformer-based multitask, multilabel named entity recognition. J Am Med Inform Assoc 2023;30(8):1379–1388. doi:10.1093/jamia/ocad046.
- Patra BG, Sharma MM, Vekaria V, et al. Extracting social determinants of health from electronic health records using natural language processing: a systematic review. J Am Med Inform Assoc 2021;28(12):2716–2727. doi:10.1093/jamia/ocab170
- Girardi G, Longo M, Bremer AA. Social determinants of health in pregnant individuals from underrepresented, understudied, and underreported populations in the United States. Int J Equity Health 2023;22(1):186. doi:10.1186/s12939-023-01963-x.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/qk2y-wx30
DOI (latest version):
https://doi.org/10.13026/eahw-b275
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project