Model Credentialed Access
Transformer models trained on MIMIC-III to generate synthetic patient notes
Published: May 27, 2020. Version: 1.0.0
When using this resource, please cite:
(show more options)
Amin-Nejad, A., Ive, J., & Velupillai, S. (2020). Transformer models trained on MIMIC-III to generate synthetic patient notes (version 1.0.0). PhysioNet. https://doi.org/10.13026/m34x-fq90.
Amin-Nejad, A., Ive, J., Velupillai, S (2020). Exploring Transformer Text Generation for Medical Dataset Augmentation. Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.578
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Natural Language Processing can help to unlock knowledge in the vast troves of unstructured clinical data that are collected during patient care. Patient confidentiality presents a barrier to the sharing and analysis of such data, however, meaning that only small, fragmented and sequestered datasets are available for research. To help side-step this roadblock, we explore the use of Transformer models for the generation of synthetic notes. We demonstrate how models trained on notes from the MIMIC-III clinical database can be used to generate synthetic data with potential to support downstream research studies. We release these trained models to the research community to stimulate further research in this area.
Natural Language Processing (NLP) has enormous potential to advance many aspects of healthcare by facilitating the analysis of unstructured text . However a key obstacle to the development of more powerful NLP methods in the clinical domain is a lack of accessible data. This, coupled with the fact that state-of-the-art neural models are well known to require very large volumes of data in order to learn general and meaningful patterns, means that progress is hindered in this area. Data access is usually restricted due to the constraints on sharing personal medical information for confidentiality reasons, be they legal or ethical in nature .
In the machine learning community, similar problems are typically solved by using artificially generated data to augment or perhaps even replace an original dataset  in e.g. image processing. However, similar approaches to data augmentation are not easily applied to NLP. With language being inherently more complex than other domains, it is difficult to programmatically modify a sentence or document without altering the meaning and coherency. Natural Language Generation (NLG) can provide a more sophisticated approach to solving this problem and has already done so, e.g. in machine-translation with the technique known as back-translation . With newer, more capable, NLG models - utilising the Transformer architecture  - we posit that this general idea can now be extended beyond machine translation to longer passages of clinical text.
The replacement or augmentation of real-world training data with artificial training data remains understudied, particularly in the medical domain. Manual approaches to augmentation are costly and unscalable . Transformer models show promise, but since most research focuses on shorter sentence-level texts it is not clear whether these models can form sufficiently long range dependencies to be useful as a substitute for genuine training data. Therefore we believe that applying NLG approaches to medical text for augmentation purposes is a worthwhile task.
We build on the approaches of [7,8] in generating complex, hierarchical passages of text using a Transformer-based approach . We experiment with two Transformer architectures: the original vanilla architecture which achieved state of the art machine-translation results , and the more recent GPT-2, composed of a stack of Transformer decoders, which has achieved state of the art question-answering, language modelling and common-sense reasoning results . We use this artificial data in two clinically relevant downstream NLP tasks (unplanned readmission prediction and phenotype classification) to effectively assess its utility both as a standalone dataset, and as part of an augmented dataset alongside the original samples. Our ultimate aim is to ascertain whether using Transformer models can generate new samples of text that are useful for data augmentation purposes - particularly in low resource medical scenarios.
As mentioned, we model the generation of text as a seq2seq problem. Whilst language models can be used standalone to generate text, we generally prefer to use conditional language models e.g. seq2seq. These usually consist of two architectures in an encoder-decoder format  where a source sequence is encoded into a latent space before being decoded to the target sequence. Transformers follow this paradigm having 6 encoder and 6 decoder layers, whilst GPT-2 instead only consists of Transformer decoder layers.
We use the vanilla transformer implementation from the tensor2tensor library and we train each of our models for 3-4 epochs using a batch size of 4096 tokens and 4 Tesla K80 GPU chips, where each chip has 12GB of RAM and two chips make up one GPU. We do this using the transformer_base hyperparameters provided by tensor2tensor. Since the vanilla transformer model follows the encoder-decoder paradigm, it is already suited towards this task and needs no modification.
We use the Tensorflow GPT-2 implementation directly from the OpenAI repository. We fine-tune the pretrained GPT-2 small model (12 decoder layers). We choose to focus only the small model for quicker and cheaper training due to its fewer parameters. We use the fine-tuning scripts provided by nshepperd. We train the model for 60,000 steps using a batch size of 2 samples (2048 tokens) and 1 Tesla K80 GPU chip.
Since GPT-2 consists of only a stack of decoders, without any encoders, the problem needs modification for seq2seq tasks. Therefore, we follow the approach of the original GPT-2 authors  who demonstrate that seq2seq tasks can be modelled by the introduction of a special token to help the model infer the desired task. We follow this framework and fine-tune the GPT-2 model using a context of examples pairs of the format
context data = target note before conditioning the model with a prompt of
context data = to generate target note at inference time.
We use Electronic Health Records (EHR) from the publicly available MIMIC-III database , a large de-identified database for critical care hospital admissions at the Beth Israel Deaconess Medical Center, Boston MA. The version used for this research is the latest version (v1.4) which comprises over 58,000 hospital admissions for 38,645 adults and 7,875 neonates spanning June 2001 - October 2012.
We are primarily concerned with the NOTEEVENTS table which is a comprehensive collection of patient notes written by doctors, nurses and other healthcare professionals. We focus solely on Discharge Summaries, which provide a rich overview of patient stays in the ICU. The MIMIC-III database contains data for neonates and adult patients (defined as being >= 15 years of age). For the purposes of this research, we focus only on adult patients. After removing neonates, we are left with 55,404 discharge summaries for 37,400 unique patients.
We treat our text generation task as a conditional language modelling problem. More specifically, we model the task as a sequence-to-sequence (seq2seq) problem where we generate discharge summaries (output sequence) conditioned on some input representing key information regarding the patient and their ICU stay (input sequence) generating each summary at the full note level. This structured input is intended to represent what a doctor knows and pays most attention to at the time they write the patient's discharge summary. This leaves the attention mechanism of the model to entirely ascertain what portions of the input are relevant to what portions of the output. We believe that this is a viable approach given the advanced transformer architecture we are using.
In order to extract the relevant content from a patient's history, we explore the rest of the MIMIC-III dataset. Drawing on the approach of , we experiment with various configurations of the following context data classes in addition to a hint representing the first 10 tokens of the note. We settle on using all of the classes of data below in the order shown:
- Demographic data (G.A.E.): This is static data which is found at the subject level. We extract gender and ethnicity, and compute the age at the time of the note using the date of birth of the patient and the date of the note.
- Diagnoses (D): Intuitively, one can assume that diagnoses are a key element regarding a subject's stay in the ICU and would be extremely pertinent for writing the discharge summary. We include all International Classification of Diseases, Ninth Revision (ICD-9) codes for diagnoses pertaining to a patient's hospital admission ordered by priority with the highest priority items first.
- Procedures (P): Similar to diagnoses, procedures are also a key element of a subject's stay in the ICU. Again these are ICD-9 procedures but are instead ranked in the order in which they were performed.
- Medications (M): Medications prescribed to the patient within a 24hr context window prior to discharge are included as context data. We include the name of the drug, the strength and the units.
- Microbiology Tests (T): Nosocomial infections are those which are contracted during a hospital admission and and have a prevalence of 15% . We include the results of tests which test for these infections within a 72hr context window including the location of the test on the subject and the list of organisms detected at that location (if any).
- Laboratory Tests (L): Lastly, we also include lab tests measuring normal bodily functions within a 24hr context window. We extract the name of the test, the value, its unit of measurement, and if available the flag saying whether or not this value is abnormal.
A Backus Naur representation of this is as follows:
<Context> ::= <Hint><Demographic><DiagnosisList><ProcedureList><MedList>< MicrobioList><LabList> <Hint> ::= first-10-tokens-of-note "<H>" <Demographic> ::= <Gender><Age><Ethnicity> <Gender> ::= "M" | "F" "<G>" <Age> ::= age-in-years "<A>" <Ethnicity> ::= "white" | "black" | "hispanic" | "asian" | "other" "<E>" <DiagnosisList> ::= <Diagnosis> "<D>" | <Diagnosis> <Delim> <DiagnosisList> <Diagnosis> ::= ICD9-diagnosis-text <Delim> ::= "|" <ProcedureList> ::= <Procedure> "<P>" | <Procedure> <Delim> <ProcedureList> <Procedure> ::= ICD9-procedure-text <MedList> ::= <Medication> "<M>" | <Medication> <Delim> <MedList> <Medication> ::= drug-name "," drug-strength "," unit-of-measurement <MicrobioList> ::= <MicrobioTest> "<T>" | <MicrobioTest> <Delim> <MicrobioList> <MicrobioTest> ::= test-location ":" <OrganismList> <OrganismList> ::= <Organism> | <Organism> "," <OrganismList> <Organism> ::= organism-detected-from-test | "none" <LabList> ::= <Lab> "<L>" | <Lab> <Delim> <LabList> <Lab> ::= lab-result-name "," lab-result-value "," unit-of-measurement <LabFlag> <LabFlag> ::= ", abnormal" | ""
An example instantiation of this Backus Naur representation can be seen below. This will form the input to our model at both training and inference time.
First ten tokens ... <H> M <G> 65 <A> white <E> other pulmonary embolism and infarction | acute kidney failure, unspecified | diarrhea | hypotension, unspecified <D> other endoscopy of small intestine | gastroenterostomy without gastrectomy <P> warfarin , 1mg Tablet | polysaccharide iron complex , 150MG | bisacodyl , 10MG SUPP | milk of magnesia , 30ML UDCUP <M> blood culture : None | urine : staphylococcus species | mrsa screen : None | blood culture : None <T> Calcium, Total , 10.0 , mg/dL | Bicarbonate , 25 , mEq/L | Hematocrit , 28.7 , \% , abnormal <L>
Installation and Requirements
To run inference or fine-tune the models, replicate the environment (Python 3.x) using:
conda env create -f environment.yml
For optimal performance use a machine with a GPU and make it visible using:
Generating a note
First create an
input.txt file in the root directory containing your input sequence on one line (no newline characters). Then run either of the
gpt2.sh bash files to create an
output.txt file in the root directory representing the discharge summary. The hyperparameters are hardcoded in these files and can be modified at this stage.
An example discharge summary output is provided as follows:
admission date : [ 2143/8/14 ] discharge date : [ 2143/8/23 ] <PAR> <PAR> date of birth : [ 2077/4/24 ] sex : m <PAR> <PAR> service : surgery <PAR> <PAR> allergies : <PAR> patient recorded as having no known allergies to drugs <PAR> <PAR> attending : [ first name3 ( lf ) 148 ] <PAR> chief complaint : <PAR> abdominal pain <PAR> <PAR> major surgical or invasive procedure : <PAR> none <PAR> <PAR> history of present illness : <PAR> mr [ known lastname ] is a 68 year old male with a history of <PAR> hypertension , hyperlipidemia , and recent <PAR> abdominal pain who presents with abdominal pain and <PAR> abdominal pain he was in his usual state of health until <PAR> approximately 1 week ago when he developed nausea and <PAR> vomiting he was brought to the ed where he was <PAR> found to have a lipase of 19 and a lipase of <PAR> 19 he was given 2 liters of normal saline and was <PAR> transferred to the [ hospital1 18 ] for further management <PAR> <PAR> past medical history : <PAR> hypertension <PAR> hyperlipidemia <PAR> <PAR> social history : <PAR> lives with wife <PAR> <PAR> family history : <PAR> non - contributory <PAR> <PAR> physical exam : <PAR> vs : t : 97.8 bp : 124/70 hr : 80 r : 18 o2sats : 100 % ra <PAR> gen : nad , a & o x 3 <PAR> heent : perrl , eomi , perrl , anicteric sclera , op clear <PAR> neck : supple , no lymphadenopathy , no lymphadenopathy <PAR> cv : rrr , no m / r / g <PAR> lungs : ctab <PAR> abd : soft , nt , nd <PAR> ext : no c / c / c / e <PAR> <PAR> pertinent results : <PAR> [ 2143/8/
<PAR> indicates a newline character. This discharge summary is typical of the kinds of discharge summaries found in the MIMIC-III database beginning with demographic data, allergies, chief complaints, etc. in the correct order.
The models are intended for inference by providing an input sequence (as detailed in the technical implementation) and obtaining an output sequence using the command line. However they can also be further trained or fine-tuned on other tasks. Scripts showing how to run inference on the models are provided.
- Input and output sequences are both limited to 512 tokens (words, punctuation marks, etc.).
- Outputs can sometimes be quite repetitive. This is a common problem with seq2seq models, especially when generating multi-sentence texts .
- The output sequence is very sensitive to the contents of the input sequence. The more detailed the input sequence, the more the detailed the output sequence and vice versa.
- The transformer model is deterministic (when the hyperparameters are held constant) whereas the GPT-2 model is not necessarily. GPT-2's level of determinism can itself be modified with the `temperature` hyperparameter.
Evaluation in Downstream Tasks
We evaluated synthetic data generated by these models on two downstream tasks: readmission prediction and phenotype classification .
Unplanned Readmission Prediction
For this task, we attempt to reproduce the work of  who perform a suite of various clinically relevant tasks such as mortality prediction, 30-day unplanned readmission, prolonged length of stay and final discharge diagnoses using EHR data from two hospitals in the US.  use the entire data from the EHR to do this whereas we focus solely on using the discharge summaries. The authors report the AUC scores 0.93-0.94, 0.75-76, 0.85-0.86 and 0.90 respectively for those tasks. Since our only data point is, by definition, at the end of the subject's stay, most of these tasks become either unfeasible or trivial. In our view, the only remaining relevant task is the 30-day unplanned readmission prediction, which is also incidentally the hardest task judging by the reported aforementioned AUC scores.
We label each discharge summary as either positive or negative depending on whether the patient then has a readmission within 30 days. Since we are only concerned with unplanned readmissions, as these the are the only ones where there is clinical value in predicting their occurrence, we filter for only the EMERGENCY and URGENT admission types (ignoring ELECTIVE and NEWBORN). We then perform the classification using two different models: BERT (Bidirectional Encoder Representations from Transformers)  and a variant of BERT, termed BioBERT . BERT is a recent Transformer-based architecture which has achieved state of the art results across numerous NLP tasks whilst BioBERT is a version of BERT pretrained on biomedical corpora demonstrating state of the art results (including significant improvements over BERT) on biomedical text mining tasks. We use BioBERT v1.1 (+ PubMed 1M) which takes an already pretrained BERT-base and trains it for a further 1M steps on the 4.5B word PubMed corpus.
We train both types of classifications models on the synthetic discharge summaries from both of our text generation models (Transformer & GPT-2) alone as as well as combined with the original discharge summaries. We use the original discharge summaries with no augmentation as a baseline. We achieve significantly better performance at the 95% confidence level on this task when using the Transformer generation model to augment our original dataset. However, interestingly, this benefit only manifests itself when classifying with the BioBERT model, not the BERT model.
Our phenotype classification task is borrowed from  and is the same task conducted by . We model this as a multilabel classification task where subjects are categorised as demonstrating up to 13 different phenotypes ranging from the likes of Obesity and Alcohol Abuse to Advanced Cancer and Depression. The dataset is a carefully curated subset of 1610 discharge summaries from MIMIC-III with the annotations made by a panel of medical professionals.
Again, we perform the classification using the BERT and BioBERT models and for each of our text generation models, we compare the performance of our synthetic data as input to the models both standalone and combined with the original data using the original data with no augmentation as a baseline. In this task, our augmented data performs in line with our baseline with no significant differences in performance.
The work of Julia Ive is part-funded by EPSRC Healtex Feasibility Funding (Towards Shareable Data in Clinical Natural Language Processing: Generating Synthetic Electronic Health Records). Sumithra Velupillai is part-funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, and by the Medical Research Council (MRC) Mental Health Data Pathfinder Award to King's College London. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.
Conflicts of Interest
There are no conflicts of interest.
- Johnson, A., Pollard, T., Shen, L., Li-wei, H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L., & Mark, R. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3, 160035.
- Gehrmann, S., Dernoncourt, F., Li, Y., Carlson, E., Wu, J., Welt, J., Foote Jr, J., Moseley, E., Grant, D., Tyler, P., & others (2018). Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PloS one, 13(2), e0192360.
- Wang, Z., Ive, J., Velupillai, S., & Specia, L. (2019). Is artificial data useful for biomedical Natural Language Processing algorithms?. In Proceedings of the 18th BioNLP Workshop and Shared Task (pp. 240–249).
- Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112).
- Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., ... & Dean, J. (2019). A guide to deep learning in healthcare. Nature medicine, 25(1), 24-29.
- Chapman, W., Nadkarni, P., Hirschman, L., D'Avolio, L., Savova, G., & Uzuner, O. (2011). Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. Journal of the American Medical Informatics Association, 18(5), 540-543.
- Bachman, P. (2016). An architecture for deep, hierarchical generative models. In Advances in Neural Information Processing Systems (pp. 4826-4834).
- Sennrich, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 86–96). Association for Computational Linguistics.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, ., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
- Suominen, H., Zhou, L., Hanlen, L., & Ferraro, G. (2015). Benchmarking clinical speech recognition and information extraction: new data, methods, and evaluationsJMIR medical informatics, 3(2), e19.
- Liu, P. (2018). Learning to write notes in electronic health records. arXiv preprint arXiv:1808.02622.
- Melamud, O., & Shivade, C. (2019). Towards Automatic Generation of Shareable Synthetic Clinical Notes Using Neural Language Models. In Proceedings of the 2nd Clinical Natural Language Processing Workshop (pp. 35–45).
- Lakew, M. (2018). A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 641–652). Association for Computational Linguistics.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
- See, A., Liu, P. J., & Manning, C. D. (2017, July). Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1073-1083).
- Amin-Nejad, A., Ive, J., Velupillai, S (2020). Exploring Transformer Text Generation for Medical Dataset Augmentation. Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.578
- Sydnor, E., & Perl, T. (2011). Hospital epidemiology and infection control in acute-care settingsClinical microbiology reviews, 24(1), 141–173.
- Rajkomar, A., Oren, E., Chen, K., Dai, A., Hajaj, N., Hardt, M., Liu, P., Liu, X., Marcus, J., Sun, M., & others (2018). Scalable and accurate deep learning with electronic health recordsNPJ Digital Medicine, 1(1), 18.
- Devlin, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics.
- Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C., & Kang, J. (2019). BioBERT: a pre-trained biomedical language representation model for biomedical text miningarXiv preprint arXiv:1901.08746.
Only credentialed users who sign the specified DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0