Database Credentialed Access
Paediatric Intensive Care database
Published: Dec. 1, 2019. Version: 1.0.0 <View latest version>
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
The Paediatric Intensive Care (PIC) database is a large paediatric-specific, single-center, bilingual database comprising information relating to children admitted to critical care units at a large children’s hospital in China. The database is de-identified and includes vital sign measurements, medications, laboratory measurements, fluid balance recordings, diagnostic codes, demographic information, and more. The data are publicly available after credentialing, which includes completion of a training course on research with human subjects and signing of a data use agreement mandating responsible handling of the data and adherence to the principle of collaborative research. The PIC database builds upon the success of the widely used MIMIC (Medical Information Mart for Intensive Care) database, and extends the approach into the field of paediatric critical care. The database has many unique characteristics which support academic and industrial research including the development of machine learning algorithms, clinical decision support tools, quality improvement initiatives, and international data sharing.
Intensive care units (ICUs) provide care for severely ill patients who require invasive life-saving treatment. Large amounts of data are routinely collected for these critically ill patients. The widely used MIMIC (Medical Information Mart for Intensive Care) critical care database has been developed for more than a decade and contains comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts [1,2]. For many years, it has been the only freely accessible database in critical care medicine, and it supports a broad range of research areas,as evidenced by many publications. Building upon the success of MIMIC-III, the Laboratory of Computational Physiology partnered with Philips Healthcare to publish a multi-center intensive care unit database called the eICU Collaborative Research Database .
Importantly, all of these databases primarily focus on adult critically ill patients. While MIMIC-III contains neonatal patients, the BIDMC does not have a paediatric ICU and as such comprehensive paediatric patient data are not available in it. Children are not just small adults and are admitted to the ICU for a different set of disease states and potential developmental issues. Children have significantly different responses to therapies and different trajectories of recovery when compared with adult patients in critical care . Furthermore, there are significant age-related changes in both illness and treatment linked to the child's developmental stage. Therefore, all clinical evidence should also be validated in children before being applied in paediatrics.
Here, we report the release of the Paediatric Intensive Care (PIC) database, a freely accessible paediatric-specific critical care database. Like the widely used MIMIC database, the PIC database integrates de-identified, comprehensive clinical data of paediatric patients admitted to the Children’s Hospital of Zhejiang University School of Medicine and makes it internationally accessible to researchers under a data use agreement.
The PIC database is notable for the following reasons:
- It is publicly and freely available to researchers worldwide;
- It encompasses a very large population of paediatric patients and spans almost a decade;
- It contains high temporal resolution and high-fidelity clinical data;
- It follows the structure of MIMIC-III, allowing researchers to leverage their experience with MIMIC-III;
- It is the first freely accessible English-Chinese bilingual clinical database.
More information on the database is provided on the online documentation page.
The PIC database was populated with data that had been acquired during routine hospital care The Children’s Hospital, Zhejiang University School of Medicine. This children’s hospital, with more than 1900 beds, is the largest comprehensive paediatric medical centre in Zhejiang Province and the Chinese National Clinical Research Centre of Child Health. There are more than 3 million outpatient and emergency visits per year in this children’s only medical center. It also accepts critical paediatric patients who referred from lower level hospital across 11 cities of Zhejiang province. It has 119 critical care beds in 5 intensive care units: general ICU, paediatric ICU (PICU), surgical ICU (SICU), cardiac ICU (CICU) and neonatal ICU (NICU). The clinical data of patients admitted to any of the ICUs between 2010 and 2018 was used to construct the PIC database. This project was approved by the Institutional Review Board of the Children’s Hospital, Zhejiang University School of Medicine (Hangzhou, China). The requirement for individual patient consent was waived because the project did not impact clinical care, and all protected health information was de-identified.
Data were extracted and downloaded from several information systems in the hospital, including the following:
- Hospital electronic medical record system
- Laboratory information system
- Computerized physician order entry system
- Nursing information system
- Anaesthesia information management system
- Reporting system of different examination departments (radiology, ultrasound, ECG, pathology, et al.)
The core database schema followed the classic MIMIC-III database with some adjustments to adapt it for local Chinese data. This will help researchers who have experience with MIMIC to easily understand and utilize the PIC and to compare it with the MIMIC database.
After collecting the original data, we performed post-processing and integrated medical records to ensure that each patient had relatively complete data. Since the time spans of the data in each table from the different systems are inconsistent, we used SUBJECT_ID and HADM_ID, which appear in the ADMISSIONS, PATIENTS, and ICUSTAYS table for data integration and filtering. In parallel, we carefully checked for impossible data entries (for example, the year of the visit date was 1900), erroneous characters (for example, Chinese characters appearing as numbers), and extreme outliers in the collected data and removed or updated these data.
Following the MIMIC-III database schema, structured clinical data, which included patient demographics, medications, fluid balances, comprehensive laboratory results, and microbiological information (tests performed and sensitivities) from the patients’ entire hospital stay, not only periods in the ICU, were collected from different systems. Similar to MIMIC-III, intermittent vital signs documented and validated by nursing staff are available. The PIC database also includes frequent vital signs collected during surgery from the anaesthesia information management system (one value every 5 minutes). To help research to distinguish these vital data from the more common nurse documented measures, we created a new table named SURGERY_VITAL_SIGNS. Unfortunately, high resolution vital signs (minute-to-minute) from critical care monitors are not available outside of surgery.
To make this database widely used worldwide, the largest challenge is the language barrier. All the lab test items, medication names, examinations and diagnoses were recorded using Chinese in the original information systems. To address this problem, the PIC database provides English and Chinese bilingual dictionary tables. For terms with widely used standard codes, such as the ICD-10 codes for diagnosis, the English terms were based on the corresponding English version of the standard codes. However, the local ICD-10 version is an extended version with 7 characters compared to the standard 5-character ICD-10 code, and therefore, the Chinese diagnosis is more specific than the English diagnosis. Lab tests associated with LOINC codes were translated using the English version of the corresponding LOINC code. Other clinical terms without a standard code were translated to English and reviewed manually by authors. We keep the original Chinese terms in all the tables with corresponding English terms allowing researchers to identify any language ambiguities which may still exist. The suffix
_cn is added to column names to indicate that the data will be in Chinese. Such a bilingual Chinese-English clinical dictionary table may also be used to support other international data sharing projects.
The free-text clinical documents and reports were also recorded in Chinese. Translating a large volume of narrative clinical documents and reports is not feasible, and automatically de-identification tools and algorithms for Chinese clinical documents are also not available. We could not release the original clinical documents at this stage. However, much important clinical information is embedded in the narrative of clinical documents, so common symptoms were extracted from clinical progress notes and discharge summaries using Natural Language Processing (NLP). These symptom terms and their negation status (present symptom with “+” and absence symptom with “-”) data are combined with the document recording time and published as the EMR_SYMPTOMS table in the first release of the PIC database. For many examinations there is only a record of the event, and the content of the associated report is unavailable. We will try to standardize most examination results in the future and publish it in the release of the next iteration of the database.
Taking into account the accuracy of NLP, we remind researchers to use these data with caution.To evaluate the accuracy of the NLP extraction, we randomly selected 100 notes including admission notes, discharge notes, progress notes, transferred notes, and so on. We extracted symptoms from these notes using NLP, resulting in a set of 3410 symptoms. Two authors independently checked the extracted results. The average accuracy was 91.9% [91.3%, 92.4%]. The three most common sources of NLP errors were as follows. First, longer Chinese phrases which did not have a natural word boundary were incorrectly segmented. Second, the sentiment detection would incorrectly classify a symptom as positive or negative. Finally, sentences which documented symptoms for family members were misinterpreted as referring to the patient.
Data were fully de-identified by removing all identifiers stipulated in the United States Health Insurance Portability and Accountability Act (HIPAA). HIPAA stipulates that all protected health information (PHI) must be removed, where PHI includes characteristics that could uniquely identify the individual (patient name, address, telephone numbers, and so on). When creating the dataset, patient IDs including SUBJECT_ID, HADM_ID and ICUSTAY_ID were randomly assigned a unique number, and the original identifier was not retained. As a result, the identifiers in the PIC cannot be linked back to the original, identifiable data, and as such the dataset can be considered anonymized. In addition, all dates were shifted to the future by a random patient-specific offset (50-100 years) resulting in stays that occur sometimes between 2060 and 2120. All dates for a single patient were shifted by the same constant. As such, intervals within a single patient were retained and the approach ensured accurate chronology of a patient's critical care stay. Time of day, day of the week, and approximate seasonality (we divided 12 months into four seasons: spring: March, April, May; summer: June, July, August; autumn: September, October, November; winter: December, January, February) were conserved during the date transformation. The ethnicity of patients from ethnic minorities with a very small population were removed and replaced with "Other" As all patients allow to be accepted by this Children’s Hospital must under the age of 18, we do not need to handle the birthday of patients who were older than 90 years old to hide their real age as MIMIC.
Data available in the PIC database includes laboratory measurements, charted observations during a patient’s stay, structured symptoms extracted from notes and vital signs recorded while a patient was present in the operating room. The PIC consists of 16 tables that are linked by unique identifiers. Tables with a ‘D’ prefix are dictionary tables and provide a definition of the identifiers. For example, each row of LABEVENTS is associated with an ITEMID that represents the concept of the measurement, but it does not contain the actual name of the measurement. By joining LABEVENTS and D_LABITEMS on the ITEMID, the concept can be identified represented by a given ITEMID. Notably, in the PIC database, the ITEMID is a character field, as opposed to an integer (as it is in MIMIC).
Briefly, there are three tables used to define and track patient hospitalization: ADMISSIONS; PATIENTS; and ICUSTAYS. The three identifiers described earlier (SUBJECT_ID, HADM_ID, ICUSTAY_ID) are present in all three of the above tables. The other three tables are dictionaries for defining disease codes and items appearing in the PIC database. The remaining tables contain data related to patient care during the hospital stay, such as demographics, laboratory test results, medications, symptoms, vital sign measurements, and mortality. All the tables are distributed as a collection of compressed comma separated value (CSV) files that can be loaded into many relational database systems or by general purpose software analysis packages.
The online documentation on the PIC website is available with greater detail about each table.
The PIC is provided as a collection of compressed comma separated value (CSV) files, along with example scripts to help with importing the data into the MySQL and PostgreSQL database systems.
To build a local MySQL database of PIC, please download all the compressed CSV files and the sql script files of this project. Decompress the files using an appropriate tool (*nix users could use gzip, while Windows users could use 7zip). Then create a schema named as "pic" in your local MySQL databases and run the define.sql script to create and load data of all the data tables. After that, running the index.sql and constraints.sql in MySQL will help you to create index and constraints on these tables.
To build a local PostgreSQL database, please download all of the compressed CSV files. Create a schema named "pic" in your PostgreSQL database and set it as the default search path. If you have the command line tool gzip, then you do not need to decompress the files, and can run the 1-create-tables-load-data-gzip.sql file to load the data into a PostgreSQL database. If you do not have gzip, you will need to decompress the files to CSV and run the 1-create-tables-load-data-csv.sql script. After loading the data, you can use 2-index.sql and 3-constraints.sql to create indices and constraints.
As the database contains detailed information regarding the clinical care of patients, it must be treated with appropriate care and respect. Following the MIMIC protocol, researchers are required to formally request access via a process documented on the PIC website. There are two key steps that must be completed before access is granted:
- the researcher must complete a recognized course in protecting human research participants that includes HIPAA requirements or obtain a GCP certification from a local institute in China.
- the researcher must sign a data use agreement, which outlines appropriate data usage and security standards, and forbids efforts to identify individual patients.
Any researchers that have been approved to access the MIMIC-III will be familiar with this process and can easily get started with the PIC. Approval requires approximately three working days. Once an application has been approved, the researcher will receive emails containing instructions for downloading the database.
PIC v1.0 was released on 26 November 2019. As this was the first release of the database, no changes are listed here.
This work was supported by the National Natural Science Foundation of China , the Chinese State’s Key Project of R&D Plan [2016YFC0901905,2016YFC0901703] and the Zhejiang Provincial Program for the Cultivation of High-level Innovative Health Talents [2016-6].
We also want to thank the MIMIC project; many of the protocols in this project were based on it.
Conflicts of Interest
The authors declare no competing financial interests.
- Saeed, M. et al. Multiparameter intelligent monitoring in intensive care II: A public-access intensive care unit database. in Critical Care Medicine (2011). doi:10.1097/CCM.0b013e31820a92c6
- Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data (2016). doi:10.1038/sdata.2016.35
- Pollard, T. J. et al. The eICU collaborative research database, a freely available multi-center database for critical care research. Sci. Data (2018). doi:10.1038/sdata.2018.178
- Czaja, A. S. Children Are Not Just Little Adults. Pediatric Critical Care Medicine (2016). doi:10.1097/PCC.0000000000000597
Only credentialed users who sign the specified DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0