Database Contributor Review
BRATECA (Brazilian Tertiary Care Dataset): a Clinical Information Dataset for the Portuguese Language
Henrique Dias , Ana Helena Dias Pereira dos Ulbrich
Published: July 14, 2022. Version: 1.1
When using this resource, please cite:
(show more options)
Dias, H., & Ulbrich, A. H. D. P. d. (2022). BRATECA (Brazilian Tertiary Care Dataset): a Clinical Information Dataset for the Portuguese Language (version 1.1). PhysioNet. https://doi.org/10.13026/ay8n-qf21.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Computational medicine research requires clinical data for training and testing purposes, so the development of datasets composed of real hospital data is of utmost importance in this field. Most such data collections are in the English language, were collected in anglophone countries, and do not reflect other clinical realities, which increases the importance of such national datasets for projects that hope to positively impact public health. This paper presents a new Brazilian Clinical Dataset containing over 70,000 admissions from 10 hospitals in two Brazilian states, composed of a sum total of over 2.5 million free-text clinical notes alongside data pertaining to patient descriptors, prescription information, and exam results. This data was collected, organized, deidentified, and is being distributed via credentialed access for the use of the research community. In the course of presenting the new dataset, we explore the new dataset’s structure, population, and potential benefits for use in clinical AI tasks.
One of the most widely used clinical datasets is MIMIC [1,2]. It has several versions, and its most current iteration, MIMIC-IV, is separated into 6 modules: core, hosp, icu, ed, cxr, and note [3,4]. Another example is the United Kingdom's National Health Service's (NHS) comprehensive dataset collection. The data is collected in order to support the analysis of specific policies of interest as well as the effects of particular policy initiatives, and it is separated into several different datasets, each with a different focus and different kinds of data. A more task-focused example of English language clinical dataset can be found in the National NLP Clinical Challenges (n2c2) datasets . These challenges have been proposed since 2006, starting with the i2b2 project, n2c2's predecessor. These two series of challenges have presented datasets for a variety of tasks, such as deidentification, obesity prediction, coreference, temporal relations, heart disease, clinical semantic textual similarity, and family history extraction.
However, these are English language datasets extracted from hospitals in certain anglophone countries, and do not conform to the clinical realities of Brazil. It is thus important to gather national data for local research projects which may be able to positively impact Brazilian public health. The development of national clinical resources has started in earnest in recent years, with work such as SemClinBR, a dataset with 1000 clinical notes annotated with over 65,000 entities and over 11,000 relations . The dataset was manually annotated and may be used for a variety of tasks, such as clinical named entity recognition and negation detection. It bears more resemblance to the n2c2 challenge datasets than to MIMIC.
BioBERTpt is a fine-tuned BERT model trained on clinical EHR texts as well as texts from the biomedical literature [7, 8]. It has three versions, each trained with a different corpus. The first was trained with more than 2 million clinical notes from Brazilian hospitals collected between 2002 and 2018. The second with titles and abstracts from Portuguese biomedical scientific papers published in PubMed and Scielo. A third version combining both corpora into one was also trained. The clinical note corpus does not seem to have been made available after its use in training the models.
The literature also covers a Brazilian healthcare image dataset, the labeled chest X-ray dataset BRAX . Although it is not a language resource, that dataset is nonetheless an example of a Brazilian healthcare dataset, and it is similar to MIMIC's CXR (chest X-ray), except that the images are not complemented by text-based healthcare resources like MIMIC's. Another example is a dataset composed of nearly 4 million tweets and about 18,000 news articles related to COVID-19 in Brazil . It has a different domain from the other datasets presented thus far and so has a different overall purpose, being more focused on public discourse and sentiment about public health issues rather than clinical information.
BRATECA is an edited and reorganized version of the Institute for Artificial Intelligence in Healthcare's own internal Brazilian tertiary care information database and is intended to be a public edition for use in machine learning research . The data was collected during the course of a research project undertaken with 10 hospitals in two Brazilian states. All data sharing was approved by each hospital participating in that research.
When extracting data for the public dataset, the prescription tables - which contained ward information - were the best way to ascertain that only adult patients from the desired wards were extracted from the database. As a result, only admissions with an associated prescription were included. The public dataset is therefore centered around its prescription tables.
Only admissions which both began and ended during a delimited time period of nine months were extracted. This time period was defined as sometime between 2020 and 2021. The exact period will not be specified so as to further enhance patient privacy.
IDs were extracted for the resulting patients and used to gather related data. This data was then assembled into 5 separate but interconnected tables: Admission, Exam, Clinical Note, Prescription, and Prescription Item. The SQL scripts used to extract the data are available in the project's GitHub page .
Though most columns in the datasets provide the exact information present in the original database, some had to be modified to further protect patient's sensitive information and attempt to prevent reverse engineering of identities from the provided data. All names in BRATECA's free text notes were deidentified using state-of-the-art deep learning methods (Bi-LSTM-CRF). Two corpora and three language models were evaluated on a Named Entity Recognition (NER) task focused on person names to evaluate which combination delivered the best performance.
The experiments revealed that using domain-specific corpora (focused on deidentification of clinical notes) and a contextualized embedding stacked with word embeddings achieved the best results: an F-measure of 0.94 and Recall of 0.95. In our case, this process was supplemented with manual review by the project team to remove all identifiers.
Dates present in the free text notes were also removed, though not using NER but rather regular expressions. The date removal script is available on the project's GitHub page . Furthermore, all dates not part of free-text notes were shifted randomly 5 to 10 years forward. Dates referring to the same admission were shifted the same amount of days forward (i.e. if admission ``1" of the Admission dataset was shifted 100 days forward, all dates of all entries in the other 4 datasets which refer to admission ``1" in their Admission ID field were shifted 100 days forward as well). This step ensured timeline coherence within the same admission. Note that multiple admissions of the same patient may not be in chronological order and do not maintain any sort of temporal relation in order to more thoroughly deidentify such patients.
All internal database IDs, such as those for Patient ID or Admission ID, were also deidentified. Each was assigned a random numerical ID, congruent between datasets (i.e. if Admission ID ``123456" is assigned the new ID ``789" in the Admission dataset, the Admission ID ``123456" was also assigned ``789" whenever it appeared in the other 4 datasets).
Finally, ward information was generated using the actual names of the wards of the hospitals from which the information was collected. Ward names were replaced with the aforementioned labels in order to better prevent hospital identification while maintaining some of the more relevant information. The generation was performed with the help of an active healthcare professional.
BRATECA is composed of 5 files in CSV (Comma Separated Values) format. These tables are as follows:
- Admission, a dataset of every individual admission, which includes patient demographic data;
- Exam, a dataset of exams and their respective results performed for each admission;
- Prescription, a dataset of prescription headers, which includes information such as patient/admission ID for the patient/admission which received the prescription, pharmacy assessments, prescription date, expiration date, ward information, whether the prescription includes special medication such as controlled substances, intravenously administered drugs (IV drugs), and antibiotics;
- Prescription Item, a dataset of prescribed medications which includes details of each prescribed medication, including name, dosage, and information on how the medication is to be administered, with each entry of this dataset being directly related to a prescription header in the Prescription dataset; and
- Clinical Note, a dataset of free-text clinical notes on details of the patient's stay and treatment.
All datasets have IDs that are used for identification of relations between entries in each file. These are:
- Hospital ID, the identification for the hospital from which the raw data was collected;
- Patient ID, the ID for a given patient in the database;
- Admission ID, the ID for the patient's admissions, of which a single patient might have many; and
- Prescription ID, specific to the Prescription and Prescription Item datasets, which identifies prescription items as belonging to specific prescriptions.
Researchers with access to the original database have already published several papers with the information which is to be released in BRATECA. For example, we have used state-of-the-art methods to develop algorithms to identify and remove names from Portuguese-language clinical texts. These were the methods were used to deidentify all free-text notes made available as part of BRATECA. The dataset has also been used to develop a prescription outlier detection system for use in hospital pharmacy services.
Details of the dataset will be presented at the LREC 2022, the annual conference of The European Language Resources Association .
Version 1.1: This access requirements were updated. Access to the dataset now requires approval by the data contributors.
Version 0.9: This version of the dataset includes an incomplete version of the clinical notes (
B1_ClinicalNote_demo.csv). The full set of notes will be released in future, to allow for additional review and quality checks to be completed.
BRATECA has been deidentified according to the Health Insurance Portability and Accountability Act (HIPAA) standards using structured data cleansing and date shifting. The NoHarm.ai system, developed by the Institute for Artificial Intelligence in Healthcare, gathers no identifiable information from patients.
The data used in the experiments we conducted for this article came from a research project developed with several hospitals in Brazil. Also, all data sharing was approved by each hospital participating in that research. Ethical approval to use the hospitals' datasets in this research was granted by the National Research Ethics Committee under the number 46652521.9.0000.5530.
We gratefully acknowledge partial financial support by CNPq under project 25/2020, CAPES, the Institute of Artificial Intelligence in Healthcare, and the FCT under project UIDB/00057/2020 (Portugal).
Conflicts of Interest
The authors declare no conflicts of interest.
- Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet. https://doi.org/10.13026/C2XW26.
- Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.
- Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2021). MIMIC-IV (version 1.0). PhysioNet. https://doi.org/10.13026/s6n6-xd98.
- Johnson, A., Bulgarelli, L., Pollard, T., Celi, L. A., Mark, R., & Horng, S. (2021). MIMIC-IV-ED (version 1.0). PhysioNet. https://doi.org/10.13026/77z6-9w59.
- N2C2 website. https://n2c2.dbmi.hms.harvard.edu/ [Accessed: 1 May 2022]
- e Oliveira, L. E. S., Peters, A. C., da Silva, A. M. P., Gebeluca, C. P., Gumiel, Y. B., Cintho, L. M. M., Carvalho, D. R., Hasan, S. A., and Moro, C. M. C. (2020). Semclinbr: a multi-institutional and multi-specialty semantically annotated corpus for portuguese clinical NLP tasks. CoRR, abs/2001.10071
- Elisa Terumi Rubel Schneider, João Vitor Andrioli de Souza, Julien Knafou, Lucas Emanuel Silva e Oliveira, Jenny Copara, Yohan Bonescki Gumiel, Lucas Ferro Antunes de Oliveira, Emerson Cabrera Paraiso, Douglas Teodoro, and Cláudia Maria Cabral Moro Barra. 2020. BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 65–72, Online. Association for Computational Linguistics.
- BioBERTpt on GitHub: https://github.com/HAILab-PUCPR/BioBERTpt [Accessed: 1 May 2022]
- Reis, E. P., Paiva, J., Bueno da Silva, M. C., Sousa Ribeiro, G. A., Fornasiero Paiva, V., Bulgarelli, L., Lee, H., dos Santos, P. V., brito, v., Amaral, L., Beraldo, G., Haidar Filho, J. N., Teles, G., Szarf, G., Pollard, T., Johnson, A., Celi, L. A., & Amaro, E. (2022). BRAX, a Brazilian labeled chest X-ray dataset (version 1.0.0). PhysioNet. https://doi.org/10.13026/ae9a-f727.
- de Melo, T., & Figueiredo, C. (2020). A first public dataset from Brazilian twitter and news on COVID-19 in Portuguese. Data in brief, 32, 106179. https://doi.org/10.1016/j.dib.2020.106179
- Institute for Artificial Intelligence in Healthcare website: https://instituto.noharm.ai/
- BRATECA on GitHub: https://github.com/noharm-ai/brateca [Accessed: 1 May 2022]
- D. P. dos Santos, H., D. P. S. Ulbrich, A. H., and Vieira, R. (2021). Evaluation of a prescription outlier detection system in hospital’s pharmacy services. In 12th International Workshop on Biomedical and Health Informatics (BHI).
- LREC 2022 Conference Website: https://lrec2022.lrec-conf.org/ [Accessed: 1 May 2022]
Only credentialed users who sign the DUA can access the files. In addition, users must have individual studies reviewed by the contributor.
License (for files):
PhysioNet Contributor Review Health Data License 1.5.0
Data Use Agreement:
PhysioNet Contributor Review Health Data Use Agreement 1.5.0
CITI Data or Specimens Only Research
prescriptions exams tertiary care clinical notes natural language processing
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- submit a request to the authors to use the data for your project