MedNLI — A Natural Language Inference Dataset For The Clinical Domain

When referencing MIMIC-III, please cite the following publication:

MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available from: http://www.nature.com/articles/sdata201635

Please also include the standard PhysioNet citation:

Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals," Circulation 101(23):e215-e220 [Circulation Electronic Pages; http://circ.ahajournals.org/content/101/23/e215.full]; 2000 (June 13).

Natural Language Inference (NLI) is one of the critical tasks for understanding natural language. The objective of NLI is to determine if a given hypothesis can be inferred from a given premise. NLI systems have made significant progress over the years, and has gained popularity since the recent release of datasets such as the Stanford Natural Language Inference (SNLI) (Bowman et al. 2015) and Multi-NLI (Nangia et al. 2017).

We present MedNLI, a dataset for natural language inference in clinical domain that is analogous to SNLI. As the source of premise sentences, we used the MIMIC-III. More specifically, to minimize the risks to patient privacy, we worked with clinical notes corresponding to the deceased patients. The clinicians in our team suggested the Past Medical History to be the most informative section of a clinical note, from which useful inferences can be drawn about the patient.

Therefore, we segmented these notes into sections using a simple rule based program capturing the formatting of these section headers. We extracted the Past Medical History section and used a sentence splitter trained on biomedical articles from LingPipe get a pool of candidate premises. We then randomly sampled sentences from these candidates and presented them to the clinicians for annotation. The exact prompt shown to the clinicians for the annotation task is as follows.

You will be shown a sentence from Past Medical History section of a de-identified clinical note. Using only this sentence, your knowledge about the field of medicine, and common sense:

  • Write one alternate sentence that is definitely a true description of the patient. Example, for the sentence ``Patient has type II diabetes" you could write ``Patient suffers from a chronic condition``
  • Write one alternate sentence that might be a true description of the patient. Example, for the sentence ``Patient has type II diabetes" you could write ``Patient has hypertension"
  • Write one sentence that is definitely a false description of the patient. Example, for the sentence ``Patient has type II diabetes" you could write ``The patient's reports indicate consistently normal insulin levels"
  • References

  • Romanov, Alexey and Shivade, Chaitanya. "Lessons from Natural Language Inference in the Clinical Domain" Proceedings of EMLP (2018).
  • Derived Data

    The data associated with this repository is available here: https://physionet.org/works/MIMICIIIDerivedDataRepository/files/approved/mednli/

    Contribution

    Contributed on 2017-11-14 by Chaitanya Shivade

    Source Controlled Code

    Source Controlled Code Location: https://github.com/jgc128/mednli

    Icon  Name                     Last modified      Size  Description
    [PARENTDIR] Parent Directory - [   ] mednli_code.zip 2017-11-14 16:58 51K

    Questions and Comments

    If you would like help understanding, using, or downloading content, please see our Frequently Asked Questions.

    If you have any comments, feedback, or particular questions regarding this page, please send them to the webmaster.

    Comments and issues can also be raised on PhysioNet's GitHub page.

    Updated Friday, 28 October 2016 at 16:58 EDT

    PhysioNet is supported by the National Institute of General Medical Sciences (NIGMS) and the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number 2R01GM104987-09.