Database Credentialed Access

MedNLI for Shared Task at ACL BioNLP 2019

Chaitanya Shivade

Published: April 17, 2019. Version: 1.0.0 <View latest version>

When using this resource, please cite: (show more options)
Shivade, C. (2019). MedNLI for Shared Task at ACL BioNLP 2019 (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Romanov, A., & Shivade, C. Lessons from Natural Language Inference in the Clinical Domain. In Proceedings of EMNLP 2018.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Natural language inference (NLI) is the task of determining whether a given hypothesis can be inferred from a given premise. Also known as Recognizing Textual Entailment (RTE), this task has enjoyed popularity among researchers for a long time. However, almost all datasets for this task focused on open domain data such as as news texts, blogs, and so on. To address this gap, the MedNLI dataset was created or language inference in the medical domain. MedNLI was a derived dataset with data sourced from MIMIC-III v1.4. In order to stimulate research for this problem, the MEDIQA shared task has been organized at BioNLP 2019. The dataset provided herein is a test set of 405 premise hypothesis pairs for the NLI challenge at the MEDIQA shared task. Participants of the shared task are expected to use the MedNLI data for development of their models and this dataset will be used as an unseen dataset for scoring each participant submission.


The problem of Natural Language Inference has been extremely popular among NLP researchers in the past few years.  The Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) is a large, high quality dataset and serves as a benchmark to evaluate NLI systems. However, it is restricted to a single text genre (Flickr image captions) and mostly consists of short and simple sentences. The MultiNLI corpus (Williams et al., 2018) introduced NLI corpora from multiple genres (e.g. fiction, travel) addressing this limitation. However, inferences in specialized domains such as medicine are more nuanced and require specialized knowledge. Owing to high costs of annotation and barriers in data access, the clinical NLP community lacks large labeled datasets to train modern data-intensive models for end-to-end tasks such as NLI.


This dataset was created following the same annotation protocol as for MedNLI. Sentences from the Past Medical History section of clinical notes from MIMIC-III were segmented out using a simple rule based program. Clinicians were then shown a premise sentence and asked to generate three sentences: (1) a hypothesis that is definitely true about the patient given the premise, (2) a hypothesis that is definitely false about the patient given the premise, and (3) a hypothesis that may be true about the patient given the premise. The inter-annotator agreement for MedNLI was a Cohen's kappa of 0.78 on a subset of 500 premise-hypothesis pairs. Additional details such as the exact annotation prompt can be found in (Romanov and Shivade, 2018).

Data Description

This test set consists of 405 premise-hypothesis pairs curated by the same clinicians who worked on creating the original MedNLI dataset. This dataset can be viewed as an additional test for the MedNLI data created for the BioNLP 2019 shared task. The premises in this dataset do not have an overlap with the premises in MedNLI. Participants interested in the shared task should register on the aicrowd platform and follow instructions for submitting a system. Each run will be evaluated using accuracy as a performance metric following an evaluation script here.

Usage Notes

The clinical notes from the NOTEEVENTS table of MIMIC-III (v1.4) are the source for the premise statements in this dataset. More specifically, each note was segmented into sections and sentences from the "past medical history" section were randomly sampled. The dataset is in json lines format and follows the exact the same format as the SNLI and Multi_NLI datasets. Each record of this test set is a json line consisting of the following structure:

  1. gold_label - entailment, contradiction, or neutral (redacted since this is a test set)
  2. sentence1 - the premise statement
  3. sentence2 - the hypothesis statement
  4. sentence1 parse - The constituency parse of the premise using Stanford parser
  5. sentence2 parse - The constituency parse of the hypothesis using Stanford parser
  6. sentence1 binary parse - The binary parse of the premise using Stanford parser
  7. sentence2 binary parse - The binary parse of the hypothesis using Stanford parser

A sample record from the training set is shown below

{"sentence1": "Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.", "pairID": "23eb94b8-66c7-11e7-a8dc-f45c89b91419", "sentence1_parse": "(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))", "sentence1_binary_parse": "( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )", "sentence2": " Patient has elevated Cr", "sentence2_parse": "(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ elevated) (NN Cr)))))", "sentence2_binary_parse": "( Patient ( has ( elevated Cr ) ) )", "gold_label": "entailment"}

The goal of the task is to classify a given premise-hypothesis pair into one of the thre classes: entailment, contradiction, or neutral.



We would like to thank Adam Coy and Chanida Thammachart for their help in curating this dataset. We would also like to thank Vandana Mukherjee for supporting this project.

Conflicts of Interest



  1. Romanov, A., & Shivade, C. (2018). Lessons from Natural Language Inference in the Clinical Domain. In Proceedings on EMNLP 2018.


Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.