Software Open Access
Digital Data Real-Time Ingestion Utility
Shivangi Kewalramani , Hayden Caldwell , Larisa Tereshchenko
Published: March 25, 2026. Version: 1.0.0
When using this resource, please cite:
Kewalramani, S., Caldwell, H., & Tereshchenko, L. (2026). Digital Data Real-Time Ingestion Utility (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/jt51-cs74
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
We present a Python-based utility for real-time ingestion of digital physiological signals and study enrollment metadata into an Oracle database. The application combines filesystem event monitoring, a GUI data entry form, and thin-mode Oracle connectivity to ensure timely and accurate data capture at the point of file creation. The system is designed for research assistants to enter metadata for digital signal recordings (e.g., ECG, PCG, and other physiological datasets), with automated database field assignments (e.g., study_date, study_id) and compliance with pre-defined enumerations and constraints. The utility follows best practices and aligns with high institutional and regulatory standards for digital data quality and integrity, ensuring robust, reproducible, and traceable research data collection.
Key features include real-time linkage between digital physiological recordings and their corresponding metadata, ensuring synchronized and traceable data entry. The use of the Oracle thin client enables lightweight and efficient database integration, while the GUI-based control of data input minimizes human errors during metadata collection. All entries are securely timestamped to maintain a comprehensive audit trail and support full accountability. From a security standpoint, database credentials should be stored in environment variables or .env files with restricted access, user accounts should be limited to INSERT privileges, and read-only database views are recommended for downstream data consumers to preserve data integrity and confidentiality.
Background
Accurate and timely collection of metadata at the time of recording is essential for longitudinal clinical research. Manual metadata collection introduces risks of inconsistency and omission, compromising data quality and traceability. This utility addresses these limitations by integrating a file watcher and controlled GUI input dialog to minimize human error, enforce coding standards, and synchronize demographic and clinical metadata with digital physiological recordings in real time. Developed within the Institution's clinical research environment, the software adheres to institutional standards for digital data management and integrity—emphasizing completeness, consistency, and audit traceability in compliance with digital data governance principles.
This work builds upon established PhysioNet resources such as the MIMIC-IV electronic health record database [1] and the MIT-BIH ECG Compression Test Database [2], extending their focus on structured physiological data toward real-time metadata ingestion and audit-ready data capture.
Software Description
The Digital Data Real-Time Ingestion Utility is a Python-based application that automates the linkage between digital physiological recordings and their corresponding metadata entries. It integrates three major components: Database Layer, Application Layer, and Monitoring Layer. The Database Layer consists of an Oracle database with a target table (ds_ecg_t1) for structured metadata ingestion. The Application Layer provides a Python GUI built with tkinter and tkcalendar for controlled and standardized data entry. The Monitoring Layer implements real-time folder monitoring using the watchdog package, which automatically triggers data input dialogs whenever a new file is created. Each ingestion event guarantees one-to-one correspondence between a captured digital file and a validated metadata entry. The design ensures that each file is accompanied by timestamped, user-attributed, and validated data entry, enabling traceable, reproducible clinical research datasets.
Technical Implementation
Architecture Overview
The system architecture comprises an Oracle [3] institutional database serving as the core data repository, with the target table ds_ecg_t1 designed for structured metadata ingestion. Certain fields, such as study_id and study_date, are automatically assigned within the database to ensure consistency and traceability. The application stack is developed in Python and incorporates a graphical user interface (GUI) built with tkinter and tkcalendar for controlled metadata entry. Real-time file monitoring is implemented through the watchdog package, enabling automatic detection and processing of newly created files. Database communication is handled using the oracledb driver in thin mode, which eliminates the need for a local Oracle client installation. The deployment process is streamlined via a Windows batch file (.cmd) that opens the monitored directory and launches the Python script for seamless operation.
Database Table Definition
CREATE TABLE ds_ecg_t1 (
study_id NUMBER GENERATED ALWAYS AS IDENTITY (START WITH 10001 INCREMENT BY 1) PRIMARY KEY,
name VARCHAR2(200) NOT NULL,
mrn VARCHAR2(6) NOT NULL,
dob DATE NOT NULL,
age NUMBER(5,2) NOT NULL,
sex VARCHAR2(10) NOT NULL,
race_eth VARCHAR2(10) NOT NULL,
shd_hist NUMBER(1) NOT NULL,
as_hist NUMBER(1) NOT NULL,
shd_eko NUMBER(1) NOT NULL,
file_path VARCHAR2(1024),
study_date TIMESTAMP(6) DEFAULT SYSTIMESTAMP NOT NULL,
usr_info VARCHAR2(64)
);
Data Handling and Integrity Controls
The system ensures data integrity through automated and traceable metadata handling. The patient's age is automatically calculated from the date of birth (DOB) to the current date, while the study date is auto-assigned by Oracle using the SYSTIMESTAMP function to ensure accurate temporal records. User information is securely captured from the system login via getpass.getuser(), allowing each data entry to be associated with the specific operator. Together, these mechanisms guarantee that every record is timestamped, user-attributed, and fully traceable, thereby maintaining accountability, reproducibility, and compliance with institutional data integrity standards.
Installation and Requirements
The system requires Python version 3.10 or later and the installation of the following packages: oracledb, watchdog [4], and tkcalendar [5] using the command pip install oracledb watchdog tkcalendar. Users must have an Oracle database account with INSERT privileges on the ds_ecg_t1 table and read/write permissions for the monitored directory. Configuration parameters include DB_USERNAME, DB_PASSWORD, DB_HOST, DB_SERVICENAME, and WATCHED_FOLDER, which should be set appropriately before execution. The application is launched using a Windows batch file containing the commands @echo off, explorer " ", and python " ", allowing users to start the system in a single step that automatically opens the monitored folder and initiates the application console.
Getting Started
- Install Python 3.10 or later.
- Clone or download this repository.
- Install dependencies:
pip install -r requirements.txt - Open
DB-GitHub.pyand set the following configuration values:DB_USERNAME,DB_PASSWORD,DB_HOST,DB_SERVICENAME,BASE_FOLDER. - Ensure the Oracle target table exists and that your user has INSERT privileges.
- Run the launcher:
Double-clickDB_entry-GH.bat, or run:python DB-GitHub.py - A new watch folder will be created automatically inside
BASE_FOLDER. - Place a physiological signal file into the generated folder.
- Complete the metadata form when prompted.
- Verify successful insertion in the console output.
Usage Notes
The research assistant workflow begins by launching the system through the provided batch file and verifying the console output, which should display the monitored folder path (e.g., "Monitoring folder: "). Once confirmed, the user can add or receive a new physiological data file, such as an ECG or PCG recording, into the monitored directory. Upon detection, a graphical user interface (GUI) window automatically appears, prompting the user to enter demographic and clinical metadata. After completing the form, the user can press OK to confirm and insert the data into the database or select Cancel to skip the insertion. Successful entries are confirmed in the console output. This automated workflow ensures that each digital signal recording is linked with accurate metadata, thereby enhancing traceability, auditability, and compliance with institutional data integrity standards.
In case of issues:
- No dialog appears: Verify that the watched folder path is correctly specified and accessible.
- ORA- errors: Check database credentials, service name, and network connection.
- Constraint violations: Ensure dropdown selections match the allowed database values.
- Module not found (e.g.,
tkcalendar): Install the missing package usingpip install tkcalendar.
Usage Workflow
The typical workflow for using the Digital Data Real-Time Ingestion Utility is as follows:
- Start the ingestion utility by running
DB_entry-GH.bator executingpython DB-GitHub.py. - The application automatically creates a watch folder inside the directory specified by
BASE_FOLDER. - The system continuously monitors this folder for newly added physiological signal files.
- When a new signal file is detected, a GUI metadata entry dialog appears prompting the research assistant to enter patient and study information.
- After the metadata form is completed and submitted, the application:
- Associates the signal file with the entered metadata
- Generates a new study identifier
- Inserts the metadata and file reference into the Oracle database.
- The database insertion status is displayed in the console output, confirming successful ingestion.
This workflow ensures that physiological signal recordings are immediately paired with validated metadata, improving traceability and data integrity for clinical research studies.
Repository File List
DB-GitHub.py— Main Python application for folder monitoring, GUI metadata entry, and Oracle database insertion.DB_entry-GH.bat— Windows batch launcher that opens the monitored folder and starts the Python application.README.md— Project overview, installation instructions, configuration details, usage workflow, and troubleshooting notes.requirements.txt— Python package dependencies and versions required to run the software.
Release Notes
Version: 1.0.0. Initial public release.
Ethics
The study protocol (including database development) was approved by the Institutional Review Board (IRB).
Acknowledgements
We thank the Cleveland Clinic Main Campus for institutional support in the design and testing of this digital data ingestion utility. This tool reflects our collective effort to uphold the highest standards in digital clinical data quality, integrity, and reproducibility across all research initiatives.
Conflicts of Interest
The authors declare no conflicts of interest related to the development, implementation, or publication of this software.
References
- Johnson AEW, Bulgarelli L, Pollard TJ, Horng S, Celi LA, Mark RG. MIMIC-IV (version 2.2). PhysioNet. Available from: https://physionet.org/content/mimiciv/
- Moody GB, Mark RG. MIT-BIH ECG Compression Test Database. PhysioNet. Available from: https://physionet.org/content/cdb/
- Oracle Corporation. Oracle Database version 19c. Available from: https://www.oracle.com/database/
- Barham T. Watchdog version 4.0.0. Python Package Index. Available from: https://pypi.org/project/watchdog/
- Tarkkanen A. Tkcalendar version 1.6.1. Python Package Index. Available from: https://pypi.org/project/tkcalendar/
Access
Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.
License (for files):
MIT License
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/jt51-cs74
DOI (latest version):
https://doi.org/10.13026/gmhb-9z50
Programming Languages:
Topics:
empty database
reliability
bedside digital signal collection
quality control
schema-only database
real-time recording
Project Website:
https://github.com/Tereshchenkolab/Digital_Data_Real_Time_Ingestion_Utility
Project Views
13
Current Version13
All VersionsCorresponding Author
Versions
Files
Total uncompressed size: 28.9 KB.
Access the files
- Download the ZIP file (14.0 KB)
-
Download the files using your terminal:
wget -r -N -c -np https://physionet.org/files/digital-data-ingestion-utility/1.0.0/
| Name | Size | Modified |
|---|---|---|
| Parent Directory | ||
| DB-GitHub.cpython-311.pyc (download) | 11.9 KB | 2026-03-20 |