Database Credentialed Access

DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries

Jayetri Bardhan Anthony Colas Kirk Roberts Daisy Zhe Wang

Published: April 12, 2022. Version: 1.0.0

When using this resource, please cite: (show more options)
Bardhan, J., Colas, A., Roberts, K., & Wang, D. Z. (2022). DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries (version 1.0.0). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Electronic Health Records (EHR) contain patient records, stored in structured tables as well as unstructured clinical notes. The information in structured and unstructured EHR records is not strictly disjoint: information may be duplicated, contradictory, or provide additional context between these sources. This presents a rich opportunity to study question answering (QA) models that combine reasoning over both structured and unstructured data. This work presents the first question answering (QA) dataset (DrugEHRQA) containing question-answer pairs from both structured tables and unstructured notes from MIMIC-III, a publicly available Electronic Health Record (EHR). We are releasing a QA dataset over MIMIC-III tables through PhysioNet, containing 41,417 triplets of natural language questions, its corresponding SQL query and the answer retrieved from MIMIC-III tables. We also generated a QA dataset on the unstructured clinical notes of MIMIC-III which can be found in the n2c2 repository. Both these datasets are combined to generate a multimodal QA dataset (DrugEHRQA), which contains question-answers from both structured and unstructured data of MIMIC-III. The DrugEHRQA dataset has medication-related queries, containing over 70,000 question-answer pairs.  Our goal is to provide a benchmark dataset for multi-modal QA systems, and to open up new avenues of research in improving question answering over EHR structured data by using context from unstructured clinical data.


Electronic Health Records (EHRs) are digitized records of patients’ medical history which can aid doctors in diagnosing better while it helps patients to obtain answers to health-related queries. The structured relational database of MIMIC-III [1-2] has multiple tables that store information about the patient’s medical data. The unstructured data, on the other hand, are notes entered by clinicians that contain a detailed description of every patient’s visit, their past medical history, their problem, symptoms and more. Thus, to benefit from both EHR tables and text, there arises a need for a multi-modal QA dataset on EHRs.

Existing QA datasets on EHRs like emrQA [3] and CliniQG4QA [4] use unstructured clinical notes to retrieve answers, whereas datasets like MIMICSQL [5] and emrKBQA [6] utilize the structured MIMIC-III [1-2] tables for QA. We present (DrugEHRQA), the first QA dataset which uses both the structured tables and the unstructured clinical notes of an EHR to answer questions. The answers from one source (structured or unstructured) can aid to provide context and improve QA over the other source (unstructured or structured).


The dataset has been generated using a template-based method. We used a novel strategy to automatically generate the dataset. Following this, we manually sampled 500 queries, and human-verified the sampled answers. We used the following steps to annotate the QA from the discharge summaries of MIMIC-III:

Annotation of question templates: We annotated nine types of question templates about drug-related questions.

Answer retrieval from unstructured data: The 2018 Adverse Drug Event (ADE) dataset and Medication Extraction Challenge Dataset [7]  present in the n2c2 repository [8] contains annotations for 505 discharge summaries of patients. We used their annotations to extract all the attributes of drugs for 505 discharge summaries of patients. Some of these attributes are: Strength-Drug, Form-Drug, Route-Drug, Dosage-Drug, Frequency-Drug and Reason-Drug. We used each of these drug attributes and the medicine names to generate nine types of natural language question templates. For example: the relation, strength-drug was used to generate the question template - ‘What is the drug strength of |drug| prescribed to the patient with admission id |hadm_id|, where hadm_id refers to the admission id of the patients. The medicines and the drug attributes of the 505 annotation files are slot-filled to replace the placeholders in the question templates to generate the question-answer pairs. For data licensing issues of n2c2 repository, we submitted this QA dataset on clinical notes of MIMIC-III on n2c2 repository [8].

Answer extraction from MIMIC-III tables: Extraction of answers from MIMIC-III [1-2] tables is achieved by using the admission ids, names of drugs and problems (or reasons), utilized in the data generation process from unstructured data to fill up the slots for |hadm_id|, |drug| and |problem| in the natural language and SQL query templates. Slot filling process was used to generate the SQL queries that helped in retrieving answers from the MIMIC-III’s structured database. We used the tables - PRESCRIPTIONS, DIAGNOSES_ICD and D_ICD_DIAGNOSES of MIMIC-III for the dataset generation. Following this, we paraphrased the natural language questions to obtain three paraphrases for every template.

Selecting multi-modal answers: We developed a rule-based method to generate multi-modal answers using answers retrieved from structured and unstructured data sources of MIMIC-III. We manually checked 500 samples of the dataset to verify the results (human-verified).

Data Description

The DrugEHRQA dataset for QA over structured MIMIC-III database contains nine templates, containing natural language questions, its corresponding SQL queries and the answers retrieved from the MIMIC-III tables. The data is stored in comma-separated values format (csv). It contains medicine related questions like drug dosage, route, form of the medicine, reasons for taking it. Each of the question templates are stored in different csv files. The files have the following headers:

  • NL_Question: The natural language question.
  • SQL_Query: The SQL queries used to retrieve answers from MIMIC-III.
  • Answer_Structured: The answer retrieved from MIMIC-III database.

Example QA triplet

The following is an example of a QA triplet of DrugEHRQA dataset over MIMIC-III's structured data:

Natural Language Question: What is the dosage of PREDNISONE prescribed to the patient with admission ID 103142

SQL Query:


Answer from MIMIC-III tables: (60MG,50MG)

This example retrieves answers from the PRESCRIPTIONS table of MIMIC-III. The templates (or csv file) correspond to certain drug-related entities and attributes as listed below:

  • query1.csv: drug entities
  • query2.csv: Strength-Drug
  • query3.csv: Form-Drug
  • query4.csv: Route-Drug
  • query5.csv: Dosage-Drug
  • query6.csv: Frequency/Duration-Drug
  • query7.csv and query8.csv: Reason-Drug
  • query9.csv: Dosage-Drug, Reason-Drug

Our generated dataset contains SQL queries with varying levels of difficulty: easy, medium, hard, and very hard (nested SQL queries).

Distribution of questions based on their difficulty level
Difficulty level Easy Medium Hard Very Hard
Percentage (%) of questions 1.30 32.06 33.32 33.32

Usage Notes

The DrugEHRQA dataset is aimed to open up new avenues of research in multimodal QA over EHRs. Additionally, the data in structured and unstructured EHR records may be same, different or may help to add context between the two sources. So, this dataset can be used to improve QA on structured data (or unstructured) using evidence from unstructured data (or structured). The dataset is however limited to medicine-related queries for QA over MIMIC-III.

To access the DrugEHRQA dataset for structured records of MIMIC-III, please download the zip file from PhysioNet, and store it in the /structured_queries directory. Since we generated the QA dataset on discharge summary of MIMIC-III with the help of drug attributes extracted from [7] available in the n2c2 repository [8], hence for license issues we submitted our QA dataset on unstructured data in the n2c2 repository. Download the QA dataset containing answers from the discharge summaries of MIMIC-III from the n2c2 repository. Then, use our two python scripts: and from the associated GitHub repository [9]. The former script joins the two datasets, while the latter automatically generates the selected multimodal answer.


The authors declare no ethics concerns.

Conflicts of Interest

The authors have no conflicts of interest to declare.


  1. Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet.
  2. Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.
  3. Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. emrqa: A large corpus for question answering on electronic medical records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2357–2368, 2018.
  4. Yue X, Zhang XF, Yao Z, Lin S, Sun H. Cliniqg4qa: Generating diverse questions for domain adaptation of clinical question answering. arXiv preprint arXiv:2010.16021. 2020 Oct 30.
  5. Ping Wang, Tian Shi, and Chandan K Reddy. Text-to-sql generation for question answering on electronic medical records. In Proceedings of The Web Conference 2020, pages 350–361, 2020.
  6. Preethi Raghavan, Jennifer J Liang, Diwakar Mahajan, Rachita Chandra, and Peter Szolovits. emrkbqa: A clinical knowledge-base question answering dataset. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 64–73, 2021.
  7. Sam Henry, Kevin Buchan, Michele Filannino, Amber Stubbs, and Ozlem Uzuner. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association, 27(1):3–12, 2020.
  8. Website:
  9. Jayetri Bardhan, Anthony Colas, Kirk Roberts, Daisy Wang. (2021). Code repository for the DrugEHRQA project. Website.

Parent Projects
DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.