Model Credentialed Access
EntityBERT: BERT-based Models Pretrained on MIMIC-III with or without Entity-centric Masking Strategy for the Clinical Domain
Chen Lin , Steven Bethard , Guergana Savova , Timothy Miller , Dmitriy Dligach
Published: March 17, 2022. Version: 1.0.1
When using this resource, please cite:
(show more options)
Lin, C., Bethard, S., Savova, G., Miller, T., & Dligach, D. (2022). EntityBERT: BERT-based Models Pretrained on MIMIC-III with or without Entity-centric Masking Strategy for the Clinical Domain (version 1.0.1). PhysioNet. https://doi.org/10.13026/e7kt-q579.
Lin, Chen, Timothy Miller, Dmitriy Dligach, Steven Bethard, and Guergana Savova. "EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain." In BioNLP 2021.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus (MIMIC-III) along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process.
We curated the MIMIC-III corpus by annotating events (including diseases/disorders, signs/symptoms, medications, anatomical sites, and procedures) and time expressions (e.g. "yesterday", "this weekend", "02/31/2028"(an example date)) with special markers. Marked events and time expressions are randomly chosen together with other words in a certain ratio to be masked for training the entity-centric mask language model. Therefore, the models are infused with clinical entity information and good for entity-related clinical NLP tasks.
Transformer-based neural language models, such as BERT , XLNet , BART , etc., have achieved breakthrough performance for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. Many efforts have been made to continue pretraining general-domain models on a specific domain to improve the model performance. Yet, in biomedical/clinical domain, continued pretraining from generic language models is inferior to domain-specific pretraining from scratch because the large amount of domain-specific terms are not covered in the general-domain vocabulary. A pretained clinical domain language model from scratch would derive an in-domain vocabulary. Many of the biomedical terms, such as diseases, signs/symptoms, medications, anatomical sites, procedures would be represented in their original forms instead of being broken into sub-word pieces.
PubMedBERT  is such a model that is pretrained from scratch on biomedical text. Even though the vocabulary of PubMedBERT is transferable to the clinical domain, the language of the clinical text is quite different from the language used in biomedical literature. For the clinical domain, one still needs to pretrain a language model specific to the clinical domain. Besides, the representation of clinical entities should be deeper in the model itself, instead of on the vocabulary level.
Therefore, we propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus (MIMIC-III ) along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. Through
this entity-masking strategy, clinical entities, such as diseases, signs/symptoms, medications, anatomical sites, procedures, and time expressions, are deeply represented in the model and are helpful for many downstream clinical fine-tuning tasks.
Bidirectional Encoder Representations from Transformers (BERT) is a pretrained large neural language model based on transformers to represent input text with true bidirectionality. The pretraining tasks of BERT include a Masked Language Model (MLM) and Next Sentence Prediction. The applications of BERT include but are not limited to: question answering, abstract summarization, text classification, conversational response generation, relation detection, word sense disambiguation, natural language inference, sentiment classification, etc.
PubMedBERT is a BERT-based large language model pretrained from scratch using abstracts from PubMed and full-text articles from PubMedCentral. This model achieves state-of-the-art performance on several biomedical NLP tasks, as shown on the Biomedical Language Understanding and Reasoning Benchmark (BLURB). Our study shows PubMedBERT preserved 30% more biomedical words instead of breaking them into word-pieces in its vocabulary than BERT.
MIMIC-BIG and MIMIC-SMALL:
We process the MIMIC corpus sentence by sentence, and discard sentences that have fewer than two entities. Entities are clinical terms such as diseases, signs/symptoms, medications, anatomical sites, procedures, and time expressions. The resulting set (MIMIC-BIG) has 15.6 million sentences, 728.6 million words. In another setting, from the pool of sentences with at least one entity, we sample a smaller set (MIMIC-SMALL), resulting in 4.6 million sentences and 125.1 million words.
Entity-centric masking strategy:
We used an open-source clinical-NLP package, Apache cTAKES for annotating events and time expressions in the MIMIC-III corpus, and xml-style markers to mark those entities in an input sequence. All xml-style markers were added into the vocabulary and mapped to unique IDs. Then 40% of entities and 12% of non-entity words are randomly chosen respectively within each input sequence for corruption, following the same corruption strategy that BERT uses, i.e. 80% of the chosen tokens are replaced by the special masking token -- "[MASK]", 10% with a random word, and 10% keep the original words. We refer to this masking strategy as entity-centric masking.
We continuously pretrained the PubMedBERT base version (with all terms in lower case) on the MIMIC-BIG and MIMIC-SMALL corpora respectively with and without the entity-centric masking strategy, and released three models:
- Continuously pretrained PubmedBERT on MIMIC-SMALL with the entity-centric masking strategy
- Continuously pretrained PubmedBERT on MIMIC-BIG with the entity-centric masking strategy
- Continuously pretrained PubmedBERT on MIMIC-BIG without the entity-centric masking strategy
Conventional BERT-style MLM randomly chooses 15% of the input tokens (a token is a string of contiguous characters between two spaces, or between a space and punctuation marks) for corruption, among which 80% are replaced by a special token "[MASK]", 10% are left unchanged, and 10% are randomly replaced by a token from the vocabulary. The language model is trained to reconstruct the masked tokens.
We propose an entity-centric masking strategy. We process the MIMIC-III corpus with the sentence detection, tokenization, and temporal modules of Apache cTAKES  (an open-source natural language processing system for extraction of information from electronic medical record clinical free-text) to identify all entities (events and time expressions) in the corpus. Events are recognized by a cTAKES event annotator. Event types include diseases/disorders, signs/symptoms, medications, anatomical sites, and procedures. Time expressions are recognized by a cTAKES timex annotator. Time classes include: date, time, duration, quantifier, pre- and post- expressions (prepostesp e.g. "preoperative", "post-exposure", "post-surgery", "prenatal", "pre-prandial"), and set. Special XML tags  are inserted into the text sequence to mark the position of identified entities. Time expressions are replaced by their time class for better generalizability. All special XML-tags and time class tokens are added into the PubMedBERT vocabulary so that they can be recognized. Then 40% of entities and 12% of random token are chosen respectively within each sequence block for corruption. These chosen entities and tokens follow the same BERT corruption ratio. i.e. 80% of them are replaced by a special token "[MASK]", 10% are left unchanged, and 10% are randomly replaced by a token from the vocabulary. We refer to this masking strategy as entity-centric masking.
We did not use the Next Sentence Prediction (NSP) task in our pretraining. We used an NVIDIA Titan RTX GPU cluster for pretraining models through HuggingFace's Transformer API version 2.10.0. For pretraining, we set:
- the max steps: 200k
- the block size: 100
- per_gpu_train_batch_size: 82
Installation and Requirements
The user needs to install the Hugging Face transformers package  (https://huggingface.co/) -- a popular Python library providing pretrained models for a variety of natural language processing tasks. Our related models are pretrained using the Hugging Face transformers-2.10.0. The programming language is Python (specifically Python version 3.6.8).
To work with the models we provide, please set the "--model_name_or_path" parameter to the path that leads to one of the downloaded models, e.g. "--model_name_or_path PubmedBERTbase-MimicBig-EntityBERT"
Current Usage: Our models have been used in three clinical tasks: cross-domain negation detection, document time relation classification, and temporal relation extraction, all results are reported in .
Reuse Potential: Our models are pretrained using the Hugging Face transformers library, and can be fine-tuned or continually pretrained using the Hugging Face transformers API and command line scripts.
Known limitations: Our models are pretrained with clinical-entity centric masking strategy and are good for entity-centric clinical tasks like negation detection, document time relation classification, and temporal relation extraction. Our model may not be suitable for non-entity-centric tasks, nor tasks outside the clinical domain.
The statement is unchanged from the previous version of the project
The study was funded by R01LM10090, R01GM114355, U24CA248010 and UG3CA243120 from the Unites States National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
The authors would also like to acknowledge Boston Children's Hospital's High-Performance Computing Resources BCH HPC Cluster Enkefalos 2 (E2) made available for conducting the research reported in this publication. Software used in the project was installed and configured by BioGrids.
Conflicts of Interest
The authors declare no ethics concerns.
- Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
- Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding." Advances in neural information processing systems 32 (2019).
- Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
- Gu, Yu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. "Domain-specific language model pretraining for biomedical natural language processing." arXiv preprint arXiv:2007.15779 (2020).
- Johnson, Alistair EW, Tom J. Pollard, Lu Shen, H. Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. "MIMIC-III, a freely accessible critical care database." Scientific data 3, no. 1 (2016): 1-9.
- Savova, Guergana K., et al. "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications." Journal of the American Medical Informatics Association 17.5 (2010): 507-513.
- Dligach, Dmitriy, et al. "Neural temporal relation extraction." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017.
- Wolf, Thomas, et al. "Huggingface's transformers: State-of-the-art natural language processing." arXiv preprint arXiv:1910.03771 (2019).
- Lin, Chen, Timothy Miller, Dmitriy Dligach, Steven Bethard, and Guergana Savova. "EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain." In BioNLP 2021.
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
CITI Data or Specimens Only Research
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project