Software Open Access

Transformer-DeID: Deidentification of free-text clinical notes with transformers

Callandra Moore Lucas Bulgarelli Tom Pollard Alistair Johnson

Published: Nov. 2, 2023. Version: 1.0.0


When using this resource, please cite: (show more options)
Moore, C., Bulgarelli, L., Pollard, T., & Johnson, A. (2023). Transformer-DeID: Deidentification of free-text clinical notes with transformers (version 1.0.0). PhysioNet. https://doi.org/10.13026/7dj5-7x85.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Sharing patient data such as clinical notes among clinicians and researchers is fundamental to advancements in clinical practice and biomedical research. Prior to distribution, it is often necessary and legally mandated to remove identifiers such as names, contact details, identification numbers, and dates to safeguard patient privacy. This process is known as deidentification. Here we provide a neural network model for the removal of patient identifiers from clinical text based upon Bidirectional Encoder Representations from Transformers (BERT) and two variations of BERT, namely Robustly Optimized BERT Approach (RoBERTa) and Distilled BERT (DistilBERT). The models demonstrated excellent performance on a publicly available benchmark and are made freely available to maximize reuse.


Background

The advent of large, open access text corpuses and the resurgence of neural networks has driven advances in state-of-the-art model performance in natural language processing [1-3]. However, progress has been hindered in the clinical text domain, primarily due to barriers in sharing clinical data. While many reasons for the lack of data sharing exist, chief among them is the risk of revealing patient information.

In the United States, the Health Insurance Portability and Accountability Act (HIPAA) provides a prescriptive framework for the safe and secure sharing of health information. HIPAA permits sharing of non-individually identifiable data through one of two mechanisms: expert statistical review or the "Safe Harbor" provision. The latter provision provides a set of identifiers which must be removed in order to consider a dataset "deidentified." Examples of these identifiers include patient names, medical record numbers, dates, and ages over 89 years [4]. Thus, deidentification of free-text clinical notes can be considered a specific form of named entity recognition (NER), where the entities correspond to protected health information (PHI). The i2b2 2014 challenge (now referred to as n2c2 2014) provided a benchmark dataset for deidentification [5].


Software Description

Model execution code is written in Python 3.10 and primarily relies on the pytorch [6] and transformers [7] libraries. A Transformer class provides functions to load in model weights and apply the model to free text. Entry points allow for training of the model directly from the command line.

The transformer_models directory contains fine-tuned weights for the three models trained over 100 epochs. Each of these directories consists of  c o n f i g . j s o n \mathtt{config.json}  (general model information),  p y t o r c h _ m o d e l . b i n \mathtt{pytorch\_model.bin}  (PyTorch dump of trained model weights), and t r a i n i n g _ a r g s . b i n \mathtt{training\_args.bin} (training arguments). 

The test directory contains unit tests used during development. 

The transformer_deid directory contains the files used to train the transformer-based models for deidentification. These include the main file used to fine-tune the model from the command line ( t r a i n . p y \mathtt{train.py} ) and utilities to load data from standoff format described below ( l o a d _ d a t a . p y \mathtt{load\_data.py} ) and tokenize and split text for each model ( t o k e n i z a t i o n . p y \mathtt{tokenization.py} ), classes to facilitate manipulation of entity labels ( l a b e l . p y \mathtt{label.py} ) and annotated datasets ( d a t a . p y \mathtt{data.py} ), and a map between common entity types (e.g., “patient name” or “city”) to those used in training (e.g., “name” or “location”) ( l a b e l . j s o n \mathtt{label.json} )

Finally, utilities are provided to convert from various existing dataset formats (e.g., the i2b2 2014 dataset) to the standoff format ( c o n v e r t _ d a t a _ t o _ g s . p y \mathtt{convert\_data\_to\_gs.py} ) and to establish a conda environment with which to run the package ( e n v i r o n m e n t . y m l \mathtt{environment.yml} ).


Technical Implementation

The algorithm takes advantage of pre-trained models which use deep learning techniques based on artificial neural networks to generate vector representations of each token in the vocabulary of a large dataset through masking or unsupervised learning [8]. These pre-trained models can then be fine-tuned to new projects using a smaller dataset to improve their performance on specific tasks and in specific contexts.In particular, Transformer-DeID uses three models which have been trained using transformers utilizing an encoder with multi-head self-attention available through the HuggingFace transformers library: BERT, DistilBERT, and RoBERTa.

The bidirectional encoder representations from transformers (BERT) model has been trained using masked language modeling and next sentence prediction on the BookCorpus and English Wikipedia texts [8]. The architecture consists of L identical layers applied sequentially, where each layer is a single transformer block [9]. The outputs of the final layer are passed to a single linear fully connected layer with C outputs, where C is the number of classes. A detailed description of the model is provided in [10].

DistilBERT is a distilled version of BERT which is trained on the same corpi as BERT, but is pretrained with distillation loss and cosine embedding loss to closely match BERT’s probabilities and hidden states in addition to masked language modeling [11].

RoBERTa is the robustly optimized BERT pre-training approach which was released by Washington University and Facebook which made some improvements on BERT, including dynamic masking, modifying the input format, and utilizing large-batch training [12].

Each of these models was fine-tuned on the i2b2 2014 training dataset [13] using the HuggingFace named-entity recognition toolkit. We use the training data defined by the i2b2 2014 competition. The training data are parsed from the original eXtensible Markup Language (XML) using  c o n v e r t _ d a t a _ t o _ g s . p y \mathtt{convert\_data\_to\_gs.py}  to text files of data and csv files of their associated annotations, which include the entity type and start and stop indices. The entity types (e.g., “patient” or “medicalrecord”) are mapped to one of seven entity types (“name,” “date,” “age,” “location,” “ID,” “contact,” or “profession”) using the mapping provided in l a b e l . j s o n \mathtt{label.json} . The annotations (the entity type and start and stop indices) are organized in L a b e l \mathtt{Label} objects defined in l a b e l . p y \mathtt{label.py} . As all three models can label at most 512 tokens at a time preprocessing includes separation of documents into sets of 512 tokens. Text is tokenized and encoded using the tokenizers used in the pre-training of the models provided by HuggingFace and each token is labeled with its entity type (listed above or “O” for non-PHI tokens) using t o k e n i z a t i o n . p y \mathtt{tokenization.py} . These encodings are organized in the D e i d D a t a s e t \mathtt{DeidDataset} class (defined in d a t a . p y \mathtt{data.py} ), a child class of a PyTorch dataset.

We trained each of the NER models over 100 epochs using a batch size of 8 using a 768-dimensional hidden layer and learning rate of 5e-05. We evaluated the three models using the test dataset defined by the i2b2 2014 competition [13]. Our evaluation was conducted token- rather than entity-wise (e.g., labeling “John” and “Smith” as separate entities is valid) and on binary labels (i.e., a label is correct if the model labels it as PHI regardless of the specific PHI type identified). Tokenization was conducted using the WordPunctTokenizer from the Python Natural Language Toolkit module [14]. Using this evaluation tool, we obtained F1-scores of 0.904, 0.905, and 0.924 and recall of 0.843, 0.844, and 0.869 for the BERT, DistilBERT, and RoBERTa models, respectively.


Installation and Requirements

The package requires Python 3.10 or higher with pip available for installing dependencies. We highly recommend creating a virtual environment to simplify the installation process. The e n v i r o n m e n t . y m l \mathtt{environment.yml} file allows for creation of a conda environment with the necessary dependencies. Trained models are available at [15].


Usage Notes

Data must be in CSV stand-off format: a subfolder (txt/) contains the documents in individual text files with the document identifier as the file stem and .txt as the extension. Another subfolder (ann/) contains a set of CSV files with the annotations with the same document identifier as the file stem and .gs as the extension. These annotations must include a row for each protected entity listing the PHI type, start and stop indices of the entity, and the entity itself. The document identifier for each document is a unique label which is used within the DeidDataset class to identify each clinical note. The tests/data subfolder contains an example of documents stored in this format.

To run from the repository directory, run

python transformer_deid/train.py -m <model_architecture> -i <dataset path> -o <output path> -e <number of epochs>

For more information, see [16]


Release Notes

v 0.1.0


Ethics

Data used in this study were deidentified and acquired through n2c2. IRB approval was therefore not required.


Acknowledgements

We would like to thank the organizers of the i2b2 challenges for the creation of extremely useful benchmark datasets. Permission to share our model was explicitly provided by the organizers of the i2b2 2014 challenge.


Conflicts of Interest

None to declare.


References

  1. Halevy, Alon, Peter Norvig, and Fernando Pereira. "The Unreasonable Effectiveness of Data." IEEE Intelligent Systems 24, no. 2 (2009): 8-12. doi: 10.1109/MIS.2009.36. Available at: https://ieeexplore.ieee.org/document/4804817. [Accessed 17/10/2023]
  2. Russakovsky, Olga, et al. "ImageNet Large Scale Visual Recognition Challenge." International Journal of Computer Vision, vol. 115, no. 3, 2015, pp. 211-252. Available at: https://arxiv.org/abs/1409.0575. [Accessed 17/10/2023]
  3. Darji, Harshil, et al. "German BERT Model for Legal Named Entity Recognition." Proceedings of the 15th International Conference on Agents and Artificial Intelligence, 2023. SCITEPRESS - Science and Technology Publications, doi:10.5220/0011749400003393. Available at: https://arxiv.org/abs/2303.05388. [Accessed 17/10/2023]
  4. "Health Insurance Portability and Accountability Act." Public Law 104-191, 1996. Available at: https://www.congress.gov/bill/104th-congress/house-bill/3103. [Accessed 17/10/2023]
  5. Stubbs, Amber, Christopher Kotfila, and Özlem Uzuner. "Automated Systems for the De-identification of Longitudinal Clinical Narratives: Overview of 2014 i2b2/UTHealth Shared Task Track 1." Journal of Biomedical Informatics, vol. 58, 2015, pp. S11-S19. doi:10.1016/j.jbi.2015.06.007. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4989908/. [Accessed 17/10/2023]
  6. Paszke, Adam, et al. "PyTorch: An Imperative Style, High-Performance Deep Learning Library." In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett. Vol. 32. 2019. Available at: https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf. [Accessed 17/10/2023]
  7. Wolf, Thomas, et al. "Transformers: State-of-the-Art Natural Language Processing." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38-45. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp-demos.6. Available at: https://aclanthology.org/2020.emnlp-demos.6. [Accessed 17/10/2023]
  8. Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171-4186. Available at: https://aclanthology.org/N19-1423/. [Accessed 17/10/2023]
  9. Vaswani, Ashish, et al. "Attention is All You Need." Advances in Neural Information Processing Systems, edited by I. Guyon et al., vol. 30, Curran Associates, Inc., 2017. Available at: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [Accessed 17/10/2023]
  10. Johnson, Alistair E W et al. “Deidentification of free-text medical records using pre-trained bidirectional transformers.” Proceedings of the ACM Conference on Health, Inference, and Learning vol. 2020 (2020): 214-221. doi:10.1145/3368555.3384455. Available at: https://dl.acm.org/doi/10.1145/3368555.3384455. [Accessed 17/10/2023]
  11. Sanh, Victor, et al. "DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter." 2020, arXiv preprint arXiv:1910.01108. Available at: https://arxiv.org/abs/1910.01108?context=cs. [Accessed 17/10/2023]
  12. Liu, Yinhan, et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach." 2019, arXiv preprint arXiv:1907.11692. Available at: https://arxiv.org/abs/1907.11692. [Accessed 17/10/2023]
  13. Stubbs, Amber and Özlem Uzuner. "Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus." Journal of Biomedical Informatics 58 (2015): S20-S29. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/. [Accessed 17/10/2023]
  14. Bird, Steven, Edward Loper, and Ewan Klein. Natural Language Processing with Python. O'Reilly Media Inc, 2009. Available at: https://www.nltk.org/book/. [Accessed 17/10/2023]
  15. “Kind Lab (SickKids Knowledge in Data Lab).” Hugging Face. Hugging Face. Available at: https://huggingface.co/KindLab. [Accessed 17/10/2023]
  16. Moore, Callandra, et al. Transformer-DeID, version 1.0.0, 2023. Available at: https://github.com/kind-lab/transformer-deid. [Accessed 17/10/2023]

Share
Access

Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.

License (for files):
MIT License

Discovery
Corresponding Author
You must be logged in to view the contact information.

Files

Total uncompressed size: 1.1 GB.

Access the files
Folder Navigation: <base>
Name Size Modified
tests
transformer_deid
transformer_models
CITATION.cff (download) 560 B 2023-09-08
LICENSE.txt (download) 1.1 KB 2023-10-27
README.md (download) 1.2 KB 2023-09-08
SHA256SUMS.txt (download) 3.4 KB 2023-11-02
convert_data_to_gs.py (download) 13.0 KB 2023-09-08
convert_to_xml.py (download) 2.8 KB 2023-09-08
environment.yml (download) 278 B 2023-09-08
eval_each_epoch.py (download) 5.7 KB 2023-09-08
setup.py (download) 494 B 2023-09-08
training_notebook_runbook.md (download) 786 B 2023-09-08