Model Credentialed Access

Characterization of Stigmatizing Language in Medical Records

Keith Harrigian Ayah Zirikly Brant Chee Alya Ahmad Anne Links Somnath Saha Mary Catherine Beach Mark Dredze

Published: Nov. 6, 2023. Version: 1.0.0

When using this resource, please cite: (show more options)
Harrigian, K., Zirikly, A., Chee, B., Ahmad, A., Links, A., Saha, S., Beach, M. C., & Dredze, M. (2023). Characterization of Stigmatizing Language in Medical Records (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Harrigian, K., Zirikly, A., Chee, B., Ahmad, A., Links, A. R., Saha, S., … Dredze, M. (2023). Characterization of Stigmatizing Language in Medical Records. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL).

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Widespread disparities in clinical outcomes exist between different demographic groups in the United States. A new line of work in medical sociology has demonstrated physicians often use stigmatizing language in electronic medical records within certain groups, such as black patients, which may exacerbate disparities. The first step to addressing the presence of stigmatizing language in medical records is identifying it and characterizing its impact. Towards this end, we release a suite of neural and non-neural classifiers trained in a supervised manner to recognize three types of stigmatizing language found in discharge notes from the MIMIC-IV dataset. We also release the set of 5,043 annotations from 4,710 notes (4,259 patients) used to train and evaluate our models. These resources provide a foundation for NLP researchers to contribute timely insights to a problem domain brought to the forefront by recent legislation regarding clinical documentation transparency.


Widespread and well-documented disparities in healthcare outcomes between demographic groups exist within the United States. The sources of these disparities are diverse and complex, with numerous interacting factors contributing to worse outcomes for minority patients [1,2]. One source of disparities may stem from latent biases of healthcare providers. Multiple studies have highlighted the tendency for providers to prescribe different treatment plans to black patients compared to white patients despite having similar clinical dispositions. Elevated implicit bias scores have been associated with these decisions and have been further linked with decreased levels of patient-provider communication. A major challenge with these biases is that they are invoked unconsciously.

Prior studies of stigmatizing language in clinical notes have relied on qualitative methods or refrained from analyzing computational nuances of the problem domain [3]. [4] was the first to use machine learning to analyze stigmatizing language in medical records. The authors identified sentences with possible bias using a manually-curated word list and then annotated whether each match was positive, negative, or out-of-context. A logistic regression classifier trained on a bag-of-words representation of the text achieved good performance (F1 of 0.935). Unfortunately, the authors did not provide a baseline to indicate how valuable context around the seed terms is for classification. They were also unable to release their model due to it being trained on private, identifiable patient data.

In [5], we introduce models and insights which significantly expand the foundation established by [4]. Specifically, we make three main contributions:

  1. We demonstrate that characterizing stigmatizing language in medical records can be thought of as a word-sense-disambiguation task more than a sequence classification task.
  2. We show that semantic representations of terms which typically "anchor" stigmatizing language in medical records do not encode sex or racial group information.
  3. We highlight linguistic nuances that arise disproportionately across different patient populations which can limit model generalization.

We release our models (and annotations used to train them) under credentialed access for use by the larger research community. We have opted to keep our labels and models behind a gate for a few reasons. First, although we do not expect our training procedure to encode sensitive information regarding the MIMIC dataset, the risk is nonzero and worth respecting. Furthermore, if we release models in the future which do allow end-users to extract sensitive information, existing end-users will be able to acquire them with minimal additional overhead.

Model Description

We provide a mixture of contextual (e.g., BERT) and non-contextual (e.g., bag of words) models.

  • Majority Overall - This is a count based classifier which outputs the majority label seen at training time.
  • Majority Per Anchor - This is a count based classifier which outputs the majority label for each anchor seen at training time.
  • Logistic Regression (Context Only TF-IDF) - This is a statistical classifier which uses 10 words to the left and 10 words to the right of a stigmatizing language anchor to make predictions.
  • Logistic Regression (Context + Anchor TF-IDF) - This is a statistical classifier which uses 10 words to the left and 10 words to the right of a stigmatizing language anchor, in addition to the anchor itself, to make predictions.
  • Base BERT - This is a BERT encoder with a linear output layer stacked on top. The BERT encoder was pretrained on web data (bert-base-uncased) [11]. The linear output layer ingests a mean-pooled representation of the embeddings for each token that makes up the stigmatizing anchor.
  • Clinical BERT - This is a BERT encoder with a linear output layer stacked on top. The BERT encoder was pretrained on the MIMIC-III dataset, in addition to other biomedical text [12]. The linear output layer ingests a mean-pooled representation of the embeddings for each token that makes up the stigmatizing anchor.

Each class of model has three classifiers associated with it, one for each stigmatizing language classification task.

Task Keyword Category Classes Description
Credibility and Obstinacy Adamant Difficult, Disbelief, Exclude Insinuation of doubt regarding a patient's testimony or describes the patient as obstinate.
Compliance Compliance Negative, Neutral, Positive Patient does not appear to follow medical advice.
Descriptors Other Negative, Neutral, Positive, Exclude Evaluates descriptions of patient behavior and demeanor.
Anchors (Stigmatizing Keywords)

Prior work has shown that stigmatizing language in medical records is often anchored by a set of seemingly innocuous trigger words [3]. We follow the lead of [4], building a two-stage system which first identifies candidate instances of stigmatizing language based on keywords and then classifies the nature of the language into one of the classes described above.

We take the union of keywords from [3] and [4] as our initial keyword set. We then add appearance-related terms that were identified during a qualitative exploration of the data (e.g., disheveled, unkempt, poorly-groomed). Our final list contains the following 89 keywords.

  "adamant": [
  "compliance": [
  "other": [
    "drug seeking",
    "narcotic seeking",
    "poorly groomed",
    "secondary gain",
    "well groomed",
Annotation Procedure

To train the models, we manually curated a set of 5,043 annotations (4,710 unique notes; 4,259 unique patients). Broadly, the annotation procedure consisted of two phases. During the first phase, we ran a search for instances of the stigmatizing keywords. Then, during the second phase, annotators labeled a random sample of the identified instances. A complete overview of the annotation protocol is provided in [5].

Search: Using regular expressions, we search the MIMIC-IV discharge note dataset for instances of the stigmatizing keywords enumerated above. We cache the starting character index, ending character index, and 10 words to the left and right of the matched span (based on whitespace). Duplicate text instances (left context, keyword, right context) are removed.

Sampling and Annotation: We undertake three rounds of annotation. Each keyword category (task) is completed separately.

  1. For each keyword in a given keyword category, we randomly select 30 instances from the pool of matches (or fewer if there are less than 30 instances available). If fewer than 500 total samples have been drawn, we randomly select additional instances from the pool of matches for the keyword category until 500 instances have been sampled. Two annotators independently label the instances and then meet to resolve differences via discussion. The resolved labels were used for training the classifiers.
  2. The same sampling procedure from above is applied to select an additional set of instances. However, having observed high inter-annotator agreement (Cohen's Kappa > 0.75, see Figure 2 in [5]) during the first round, only one annotator labels this sample of instances. The single annotator's labels were used for training the classifiers.
  3. At this point in time, we identified additional verb tenses and parts of speech for our keyword list. We updated the candidate pool of matches to reflect these additional linguistic variations. We then randomly sampled 50 instances for each additional keyword (or fewer if there are less than 50 instances available). A single annotator labeled this sample of instances. The single annotator's labels were used for training the classifiers.

Annotators were provided up to 10 words to the left and right of the stigmatizing keyword, as well as the keyword itself, when assigning labels. Pilot annotation runs suggested that this context size was sufficient for allowing the annotator to appropriately label the vast majority of instances, while also not increasing cognitive overhead by showing them the entire note. Three annotators were responsible for labeling all data used in our study -- one clinician and two research assistants. Each annotator's original label, as well as the resolved label, is provided in the annotation file that we distribute.

The final distribution of labels used for training the models is enumerated below.

Label Distribution
Task Class Label # Spans # Notes # Patients
Credibility & Obstinacy Difficult 526 510 496
  Disbelief 609 593 583
  Exclude 115 114 114
Compliance Negative 893 867 833
  Neutral 439 433 426
  Positive 271 266 262
Descriptors Negative 1221 1137 1046
  Neutral 96 95 95
  Positive 377 373 369
  Exclude 496 482 480
Choosing a Model

Each classifier should only be applied to candidate instances of stigmatizing language containing anchors (keywords) associated with the classifier's respective task. To learn more about the task structure and output classes, please read Table 4 in [5].

If compute allows, we recommend using one of the BERT models due to their improved accuracy over the non-contextual models. The Base BERT and Clinical BERT models perform roughly equivalently. Future work is necessary to understand whether clinical knowledge is necessary to characterize stigmatizing language in medical records. All model training procedures and performance estimates are listed in full within [5].

Technical Implementation

The logistic regression, non-contextual models use scikit-learn [6] for data transformations and classifier training. For the TF-IDF representations, we use an  2 \ell_2  row-wise norm. As a classifier, we use multinomial logistic regression optimized using lbfgs [7]. We balance class weights and perform a grid search over the following 2 \ell_2  regularization parameters: 0.01, 0.03, 0.1, 0.3, 1, 3, 5, 10. The model which maximizes macro F1-score in each training split’s associated development set is chosen for application on the test set.

We use Hugging Face’s transformers library [8] to initialize all BERT models and fine-tune them using code written in PyTorch [9]. We train all models using a batch size of 16, a fixed learning rate of 5e-5, a dropout probability of 0.1, and class-balanced cross-entropy loss. As an optimizer, we use AdamW [10]. We evaluate the model every 50 updates and save the model which maximizes macro F1-score on the training split’s associated development data. Due to compute limitations in our HIPAA-compliant environment (i.e., limited GPU access), we do an initial exploration of the 2 \ell_2  regularization strength on one split of the data for each classification task. We find the regularization strength to have minimal effect on performance for decay values of 1e-5, 1e-4, and 1e-3; we set a decay weight of 1e-5 for all remaining experiments.

All models were trained in a HIPAA-compliant remote computing environment secured with OS-level group permissions. We used servers outfitted with NVIDIA Tesla M60 GPUs (2 x 8 GB VRAM) and Intel Xeon E5-3698 CPUs (2.20 GHz).

Installation and Requirements

Code (Training and Inference)

To interact with our models and annotations, we strongly recommend using our complementary Python package [13]. The repository will allow you download these resources directly from PhysioNet (after completing the appropriate usage agreement). We cannot guarantee our models will behave appropriately outside of this toolkit. Note that we have opted to keep our code in a public GitHub repository to facilitate community interaction (e.g., pull-requests) and expedite hot-fixes.

Compute Environment

Our models and supporting code were developed and tested using Python 3.10. We cannot guarantee that other versions of Python will support the entirety of our models or codebase. That said, we expect the majority of functionality to be preserved as long as you are using Python >= 3.7.

We strongly recommend using a virtual environment manager (e.g., conda) when working with this codebase. This will help limit unintended consequences that arise due to e.g., dependency upgrades.

Package Installation

Once your environment has been created and activated, you can install the stigma toolkit with a single command (executed from the root of the repository):

pip install -e .

This command will install all external dependencies, as well as the stigma package itself. It is extremely important to keep the  -e environment flag, as it will ensure default data and model paths are preserved.

Acquiring Resources

Once the stigma python package has been installed, you can download resources for model training and inference.


To replicate our experiments or train new models, you will need access to the MIMIC-IV [14] and MIMIC-IV-Notes [15] datasets (v2.2). Both of these resources are hosted on PhysioNet and require completion of IRB-related training.

Once you have completed the credentialing process, you can easily acquire the minimally necessary data resources using our utility script ./scripts/acquire/ You will be asked for your PhysioNet username and password. Files will be downloaded to data/resources/datasets/mimic-iv/.

Labels and Models

Once you have completed our data usage agreement, you can use our utility script ./scripts/acquire/ to download the pretrained models and annotations. Models will be downloaded to data/resources/models/, while annotations will be downloaded to data/resources/annotations/.

If you have downloaded the MIMIC-IV dataset, you can create an augmented annotated dataset for training new models using scripts/acquire/

python scripts/acquire/ \
    --annotations_dir data/resources/annotations/ \
    --keywords data/resources/keywords/keywords.json \
    --load_chunksize 1000 \
    --load_window_size 10

If you'd only like to run the search procedure for notes in the annotated dataset, you can do so by adding the --load_annotations_only flag to the command above.


To validate that data and models were downloaded correctly and the package was installed appropriately, you can make use of our small test suite.

  • pytest -v -Wignore tests/ Ensures we are able to load the MIMIC-IV dataset and annotations as expected.
  • pytest -v -Wignore tests/ Ensures we are able to load default models and arrive at expected predictions.

Usage Notes


For a quick introduction to our API, we recommend exploring the quickstart notebook in our code repo [13]. We have abstracted most of the codebase into a few modules to make interacting with the pretrained models easy.

## Import API Modules
from stigma import StigmaSearch
from stigma import StigmaBaselineModel, StigmaBertModel

## Examples of Clinical Notes
examples = [
    Despite my best advice, the patient remains adamant about leaving the hospital today. 
    Social services is aware of the situation.
    The patient claims they have remained sober since their last visit, though I smelled
    alcohol on their clothing.

## Initialize Keyword Search Wrapper
search_tool = StigmaSearch(context_size=10)

## Run Keyword Search
search_results =

## Prepare Inputs for the Model
example_ids, example_keywords, example_text = search_tool.format_for_model(search_results=search_results,

## Initialize Model Wrapper
model = StigmaBertModel(model="mimic-iv-discharge_clinical-bert",

## Run Prediction Procedure
predictions = model.predict(text=example_text,

Model Identifiers

Each classifier can be uniquely identified by its model class and task (a.k.a. keyword category). The non-contextual models can be loaded using the StigmaBaselineModel class, while the contextual models can be loaded using the StigmaBertModel class. The following strings may be specified as the  model argument to initialize your classifier.

  • "mimic-iv-discharge_majority_overall" - Majority Overall

  • "mimic-iv-discharge_majority_keyword" - Majority Per Anchor

  • "mimic-iv-discharge_logistic-regression_context" - Logistic Regression (Context Only)

  • "mimic-iv-discharge_logistic-regression_keyword-context" - Logistic Regression (Context + Anchors)

  • "mimic-iv-discharge_base-bert" - Fine-tuned with Base BERT encoder

  • "mimic-iv-discharge_clinical-bert" - Fine-tuned with Clinical BERT encoder

Other Functionalities

Although the API shown above should be sufficient for most purposes, this repository contains a substantial amount of additional code which some users may find helpful. This includes scripts which may be used to reproduce our published results. The bash files contained in jobs/ showcase most of the functionalities. Please see the README file for more information about each set of commands.

Data Files

Three types of resources are contained within this PhysioNet project -- annotations, keywords, and models. These resource types are identified by their file structure. We provide a high-level overview of each resource type immediately below, and provide a file-by-file overview in the Repository Structure section thereafter.


Keywords used to identify candidate instances of stigmatizing language can be found in data/resources/keywords/keywords.json. There are 89 unique keywords (see full list above in the Model Description). The structure of the keyword file is shown below. The primary JSON keys ("keyword_category_X") delineate each type of stigmatizing language (task). Their respective values are a list of keywords associated with the type of stigmatizing language.


Annotations used to train the models released in this project can be found in data/resources/annotations/annotations.csv. There are a total of 5,043 annotations (4,710 unique notes; 4,259 unique patients). The file contains the following fields:

  • encounter_note_id (str) - This aligns with the "note_id" field from MIMIC-IV.
  • keyword_category (str) - Indicates which task the annotation is associated with. Either "adamant", "compliance", or "other".
  • start (int): Starting character in the cleaned and normalized version of the associated MIMIC-IV note. See the clean_excel_text and normalize_excel_text functions in the  stigma.text_utils  module.
  • end (int): Ending character in the cleaned and normalized version of the associated MIMIC-IV note. See the clean_excel_text and normalize_excel_text functions in the  stigma.text_utils  module.
  • annotator_X (str): Label assigned by annotator X. Annotator 1 is a clinician, while Annotators 2 and 3 are non-clinician research assistants. Not all instances have multiple annotators.
  • label (str): Label assigned after discussion by annotators.

Archives containing our models can be found in data/resources/models/. There are four archives. To learn about the individual files contained within each archive, please review the Repository Structure section below.

  • mimic-iv-discharge_baseline-majority/: Contains the Majority Overall models.
  • mimic-iv-discharge_baseline-statistical/: Contains the Majority Per Anchor, Logistic Regression (Context Only TF-IDF), Logistic Regression (Context + Anchor TF-IDF) models.
  • mimic-iv-discharge_base-bert/: Contains the Base BERT models.
  • mimic-iv-discharge_clinical-bert/: Contains the Clinical BERT models.

The easiest way to interact with our models is through the aforementioned API. As long as expected paths are maintained, you should be able to easily load models using straightforward keyword arguments.

Repository Structure

Upon downloading all resource data in your data/resources/ folder, you can expect the file structure enumerated in data/resources/structure.txt and replicated below. The annotations and keywords directories each include a single file which is described in detail above. Underneath the directory tree, we describe each file and path in the models subdirectory. 

├── annotations
│   └── annotations.csv
├── keywords
│   └── keywords.json
└── models
    ├── mimic-iv-discharge_base-bert
    │   ├── adamant_fold-0
    │   │   ├── checkpoint-250
    │   │   │   ├── init.pth
    │   │   │   └── model.pth
    │   │   ├── best_model.json
    │   │   ├── summary.json
    │   │   └── targets.json
    │   ├── compliance_fold-0
    │   │   ├── checkpoint-100
    │   │   │   ├── init.pth
    │   │   │   └── model.pth
    │   │   ├── best_model.json
    │   │   ├── summary.json
    │   │   └── targets.json
    │   └── other_fold-0
    │       ├── checkpoint-250
    │       │   ├── init.pth
    │       │   └── model.pth
    │       ├── best_model.json
    │       ├── summary.json
    │       └── targets.json
    ├── mimic-iv-discharge_clinical-bert
    |   ├── adamant_fold-0
    │   |   ├── checkpoint-50
    |   │   │   ├── init.pth
    |   │   │   └── model.pth
    |   │   ├── best_model.json
    |   │   ├── summary.json
    |   │   └── targets.json
    |   ├── compliance_fold-0
    |   │   ├── checkpoint-400
    |   │   │   ├── init.pth
    |   │   │   └── model.pth
    |   │   ├── best_model.json
    |   │   ├── summary.json
    |   │   └── targets.json
    |   └── other_fold-0
    |       ├── checkpoint-350
    |       │   ├── init.pth
    |       │   └── model.pth
    |       ├── best_model.json
    |       ├── summary.json
    |       └── targets.json
    ├── mimic-iv-discharge_baseline-majority
    │   ├── keyword
    │   │   └── majority
    │   │       ├── adamant_fold-0
    │   │       │   ├── classifier.joblib
    │   │       │   ├── preprocessor.joblib
    │   │       │   └── targets.txt
    │   │       ├── compliance_fold-0
    │   │       │   ├── classifier.joblib
    │   │       │   ├── preprocessor.joblib
    │   │       │   └── targets.txt
    │   │       └── other_fold-0
    │   │           ├── classifier.joblib
    │   │           ├── preprocessor.joblib
    │   │           └── targets.txt
    │   ├── model_settings.json
    │   ├── preprocessing.params.joblib
    │   └── vocabulary.joblib
    ├── mimic-iv-discharge_baseline-statistical
    │   ├── keyword
    │   │   └── linear
    │   │       ├── adamant_fold-0
    │   │       │   ├── classifier.joblib
    │   │       │   ├── preprocessor.joblib
    │   │       │   └── targets.txt
    │   │       ├── compliance_fold-0
    │   │       │   ├── classifier.joblib
    │   │       │   ├── preprocessor.joblib
    │   │       │   └── targets.txt
    │   │       └── other_fold-0
    │   │           ├── classifier.joblib
    │   │           ├── preprocessor.joblib
    │   │           └── targets.txt
    │   ├── keyword_tokens_tfidf
    │   │   └── linear
    │   │       ├── adamant_fold-0
    │   │       │   ├── classifier.joblib
    │   │       │   ├── preprocessor.joblib
    │   │       │   └── targets.txt
    │   │       ├── compliance_fold-0
    │   │       │   ├── classifier.joblib
    │   │       │   ├── preprocessor.joblib
    │   │       │   └── targets.txt
    │   │       └── other_fold-0
    │   │           ├── classifier.joblib
    │   │           ├── preprocessor.joblib
    │   │           └── targets.txt
    │   ├── tokens_tfidf
    │   │   └── linear
    │   │       ├── adamant_fold-0
    │   │       │   ├── classifier.joblib
    │   │       │   ├── preprocessor.joblib
    │   │       │   └── targets.txt
    │   │       ├── compliance_fold-0
    │   │       │   ├── classifier.joblib
    │   │       │   ├── preprocessor.joblib
    │   │       │   └── targets.txt
    │   │       └── other_fold-0
    │   │           ├── classifier.joblib
    │   │           ├── preprocessor.joblib
    │   │           └── targets.txt
    │   ├── model_settings.json
    │   ├── preprocessing.params.joblib
    └── └── vocabulary.joblib

Non-Contextual (Non-BERT) Models

The baseline, non-BERT models are found in the following directories:

  • mimic-iv-discharge_baseline-majority/keyword/majority/: The Majority Overall models.
  • mimic-iv-discharge_baseline-statistical/keyword/linear/: The Majority Per Anchor models.
  • mimic-iv-discharge_baseline-statistical/tokens_tfidf/linear/: The Context Only TF-IDF Logistic Regression models.
  • mimic-iv-discharge_baseline-statistical/keyword_tokens_tfidf/linear/: The Context + Anchor TF-IDF Logistic Regression Models.

Under the first folder level (e.g., mimic-iv-discharge_baseline-statistical/) are three files. To see how they are generated, we recommend reviewing the scripts/model/train/ script in our GitHub repository [13].

  • model_settings.json: A copy of the configuration used for training the models within the directory. Includes the feature set and any associated arguments (e.g., row-wise norm).
  • preprocessing.params.joblib: Parameters used for preprocessing the text data prior to training the models contained within the directory. Also includes (if applicable) any stigma.text_utils.PhraseLearner objects (i.e., transforms to merge separate tokens into phrases).
  • vocabulary.joblib: A list of vocabulary terms identified in the training dataset. Not explicitly necessary for downstream processing, as this is replicated within the preprocessor.

Under the final level (e.g., mimic-iv-discharge_baseline-statistical/keyword/linear/) are folders for the Credibility and Obstinacy (adamant_fold-0/), Compliance (compliance_fold-0/), and Descriptors (other_fold-0/) tasks. The fold-X suffix is an artifact of the general training process which lends support for cross validation. Because these models were trained on a single split of data, there is only one folder (with fold-0 as a suffix) for each task. Within each of these task folders are three files:

  • classifier.joblib: Either an instance of a scikit-learn linear_model.LogisticRegression classifier or an internal stigma.model.baseline.ConditionalMajorityClassifier.
  • preprocessor.joblib: An instance of stigma.model.util.FeaturePreprocessor with any learned transformations (e.g., TF-IDF term weights).
  • targets.txt: A newline delimited list of classes which align with the output probability array from the classifier.

Contextual (BERT) Models

The BERT Base and Clinical BERT task models are provided in the mimic-iv-discharge_base-bert/ and mimic-iv-discharge_clinical-bert/ folders, respectively. Within each of these folders, you will find three subfolders: adamant_fold-0/, compliance_fold-0/, and other_fold-0/. They correspond to models for the Credibility & Obstinacy, Compliance, and Descriptors tasks, respectively. If you train new models, the "fold-X" suffix will indicate the cross-validation iteration X upon which the model is trained. The models released here were trained on a single train/validation/test split, but maintain the suffix-naming pattern for consistency.

Within each of the task subfolders (e.g., compliance_fold-0/), you will find a checkpoint-X/ folder. The number in place of the X indicates which training update was identified as maximizing validation set performance (macro F1-score). For example, checkpoint-250 indicates that the model which had completed 250 gradient updates was selected as being the "optimal" model for the validation data. During training, we evaluate each task model on the validation data every 50 gradient updates. We only update the "optimal" model checkpoint if we achieve a greater than 1% increase in macro-F1 score on the training run's associated validation set.

The checkpoint-X/ folders contain two files, both of which can be loaded with  torch.load(...) . To see this process in action, please look at the  StigmaBertModel._initialize_model  method in our API.

  • init.pth: This is a dictionary which contains the training and validation loss history up to checkpoint-X, as well as the parameters passed to the model class object and the training parameters (e.g., number of training epochs, optimization metric, batch size).
  • model.pth: This contains the model weights (encoder and classification layer). These are passed to the model's load_state_dict method.

Within each of the task subfolders, you will also find three JSON files. These are not necessary for loading the models, but rather exist as artifacts of the training process. To learn more, we recommend checking out the scripts/model/train/ script in our GitHub repository [13].

  • best_model.json: The data in this file indicates which step at which the model maximized its validation set performance (e.g., macro F1 score). The initial "0" key is a legacy artifact from prior multi-task modeling experiments. The value in the "steps" field will correspond with the cached checkpoints folder.
  • summary.json: Very similar to init.pth files. This contains the training and validation loss over updates, training and validation performance (e.g., macro F1, accuracy) over updates, the parameters used to initialize the model class, and the parameters used for the training procedure (e.g., batch size, learning rate).
  • targets.json: This specifies the mapping from task name to the index of the classification head in our model, as well as the mapping from class label to output dimension index. In the example block below, we see that the Credibility & Obstinacy classification head is the first in the list of possible classification heads (again, an artifact of multi-task learning support). We also see that the probabilities output by the classification head will represent the Disbelief, Difficult, and Exclude classes in order.
  "task_2_id": {
    "adamant": 0
  "task_class_2_id": {
    "adamant": {
      "disbelief": 0,
      "difficult": 1,
      "exclude": 2


IRB Approval and Security

Our datasets were collected from real patients, contain protected health information (PHI), and are subject to HIPAA regulations. As a result, we took the utmost care to maintain data integrity and privacy. First, we obtained IRB approval to access and process the data. Second, we obtained permission and approval for all applications and libraries used to process the data. Third, data storage and computational experimentation was done on IRB-approved platforms.


Data Our relatively small dataset size limits our analysis, especially with the use of language models. Furthermore, the label distribution is skewed across the different specialties (domains), which affects model performance, robustness and generalizability. The differences in distribution might be the result of how the data was collected, which was not in light of the anchor words, or due to the domain’s nature and/or the medical providers’ language of that specialty. Furthermore, the time frame that the data was sampled from might manifest certain biases that are different from other time frames. Finally, our datasets are only representative of a small number of specialties from two medical institutions. Patient populations and providers may vary greatly across medical fields and additional institutions. 

Task The formulation of the labels for our task imposes limitations and challenges. Stigmatizing language is subjective and can vary between the perspective of the patient and the medical provider. As a result, we are aware that our medical experts’ annotations might impose a bias. Additionally, the negative connotations of language might be ambiguous and can change depending on a medical expert’s identity, background and specialty, which creates a bias that is hard to mitigate.

Computational Resources We only used IRB-approved servers to access the dataset and perform the experiments. Because these platforms had limited computational capacity and lacked the specifications required to build more complex neural models, we were not able to include more recent language models in our experiments that might have yielded better performance. In the future, we hope to have access to machines that support more recent and state-of-the-art models. 


We thank Yahan Li for adding CUDA accelerated training and inference.

This work was supported by the National Institute on Minority Health and Health Disparities under grant number R01 MD017048. The content is solely the responsibility of the authors and does not necessarily represent the official views of NIMHD, NIH, or Johns Hopkins University. 

Conflicts of Interest

We have no conflicts of interest to declare.


  1. Nazer LH, Zatarah R, Waldrip S, Ke JX, Moukheiber M, Khanna AK, Hicklen RS, Moukheiber L, Moukheiber D, Ma H, Mathur P. Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digital Health. 2023 Jun 22;2(6):e0000278.
  2. Holmes Fee C, Hicklen RS, Jean S, Abu Hussein N, Moukheiber L, de Lota MF, Moukheiber M, Moukheiber D, Anthony Celi L, Dankwa-Mullan I. Strategies and solutions to address Digital Determinants of Health (DDOH) across underinvested communities. PLOS digital health. 2023 Oct 12;2(10):e0000314.
  3. Park J, Saha S, Chee B, Taylor J, Beach MC. Physician use of stigmatizing language in patient medical records. JAMA Network Open. 2021 Jul 1;4(7):e2117052-.
  4. Sun M, Oliwa T, Peek ME, Tung EL. Negative Patient Descriptors: Documenting Racial Bias In The Electronic Health Record: Study examines racial bias in the patient descriptors used in the electronic health record. Health Affairs. 2022 Feb 1;41(2):203-11.
  5. Harrigian K, Zirikly A, Chee B, Ahmad A, Links AR, Saha S, Beach MC, Dredze M. Characterization of Stigmatizing Language in Medical Records. ACL. 2023, July 9;2:312–329.
  6. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011 Nov 1;12:2825-30.
  7. Zhu C, Byrd RH, Lu P, Nocedal J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on mathematical software (TOMS). 1997 Dec 1;23(4):550-60.
  8. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations 2020 Oct (pp. 38-45).
  9. Imambi S, Prakash KB, Kanagachidambaresan GR. PyTorch. Programming with TensorFlow: Solution for Edge Computing Applications. 2021:87-104.
  10. Loshchilov I, Hutter F. Decoupled Weight Decay Regularization. In International Conference on Learning Representations 2018 Sep 27.
  11. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.
  12. Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, Redmond WA, McDermott MB. Publicly Available Clinical BERT Embeddings. NAACL HLT 2019. 2019 Jun 7:72.
  13. GitHub Repository [Accessed 9/25/2023]
  14. Johnson A, Bulgarelli L, Pollard T, Horng S, Celi L A, Mark R. MIMIC-IV (version 2.2). PhysioNet. 2023. Available from:
  15. Johnson A, Pollard T, Horng S, Celi L A, Mark R. MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. 2023. Available from:

Parent Projects
Characterization of Stigmatizing Language in Medical Records was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.