Name: Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation
Published: April 12, 2024
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Challenge Credentialed Access

Justin Xu

Published: April 12, 2024. Version: 1.3

When using this resource, please cite: (show more options)
Xu, J. (2024). Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation (version 1.3). PhysioNet. https://doi.org/10.13026/0zf5-fx50.

MLA	Xu, Justin. "Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation" (version 1.3). PhysioNet (2024), https://doi.org/10.13026/0zf5-fx50.
APA	Xu, J. (2024). Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation (version 1.3). PhysioNet. https://doi.org/10.13026/0zf5-fx50.
Chicago	Xu, Justin. "Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation" (version 1.3). PhysioNet (2024). https://doi.org/10.13026/0zf5-fx50.
Harvard	Xu, J. (2024) 'Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation' (version 1.3), PhysioNet. Available at: https://doi.org/10.13026/0zf5-fx50.
Vancouver	Xu J. Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation (version 1.3). PhysioNet. 2024. Available from: https://doi.org/10.13026/0zf5-fx50.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

"Discharge Me!", part of the BioNLP workshop co-located with ACL 2024, seeks to alleviate the significant burden on clinicians who dedicate substantial time to crafting detailed discharge notes in the EHR. Participants in the task will explore approaches to generating "Brief Hospital Course" and "Discharge Instructions" sections of the discharge summary using a subset of MIMIC-IV-Note and MIMIC-IV-ED that have been compiled by the task organizers. The full dataset (comprised of a defined training, validation, phase 1 testing, and phase 2 testing sets) consists of 109,168 emergency department admissions. The competition is being hosted on the Codabench platform, which will manage team registration, results submission, and score evaluation.

Objective

Our objective is to encourage the development of new systems for the generation of discharge summaries and to disseminate preliminary findings to the medical natural language processing community.

Clinicians play a crucial role in documenting patient progress, but the creation of concise yet comprehensive hospital course summaries and discharge instructions often demands a significant amount of time. This contributes to clinician burnout and poses operational inefficiencies within hospital workflows. By streamlining the generation of these sections, we can help enhance the accuracy and completeness of clinical documentation.

Participants are given a dataset based on MIMIC-IV which includes 109,168 admissions from the Emergency Department (ED), split into training, validation, and test sets. Each admission includes chief complaints and diagnosis codes (either ICD-9 or ICD-10) documented by the ED, at least one radiology report, and a discharge summary with both "Brief Hospital Course" and "Discharge Instructions" sections. The goal is to generate these two critical sections in discharge summaries based on other inputs.

We hope that this challenge will bolster the efforts of the clinical natural language processing community in developing effective solutions for the generation of discharge summary sections. We believe this task could form a solid foundation for future work on generating the entire discharge summary including the other sections, which would significantly help reduce the time clinicians spend on administrative tasks, ultimately improving patient care quality.

Participation

Please visit the Codabench competition page to register for this shared task. Codabench [1] is the platform that we will use throughout the challenge, and an account is required to officially join the competition. All submissions and leaderboards will be available on that platform. Please direct any questions about the competition to the Codabench discussion forum. Deadlines and further participation information is available on the Shared Task Website below.

All participants will be invited to submit a paper describing their solution to be included in the Proceedings of the 23rd Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2024. If you do not wish to write a paper, you must at least provide a thorough description of your system which will be included in the overview paper for this task. Otherwise, your submission (and reported scores) will not be taken into account.

Rules

Participants must comply with the PhysioNet Credentialed Health Data Use Agreement when using the data.
Participants may use any additional data to train (or pre-train) their systems. However, all data used for the submission must be in some way available to other researchers.
Participants may involve existing models trained on proprietary data in their systems. However these models must also be accessible to other researchers in some capacity.
If participants employ LLMs, please ensure that the team clearly notes the expected outputs by the models or the prompting strategies used so that results can be reproduced. However, please note that sending data via an API to a third party is a violation of the DUA. Please consult the informational note provided by PhysioNet for further detail.
All submissions must be made through the Codabench competition page.

Shared Task Website: https://stanford-aimi.github.io/discharge-me/

Data Description

The dataset for this task is created from MIMIC-IV's submodules MIMIC-IV-Note [3] and MIMIC-IV-ED [4]. In order to download the data, you must have a PhysioNet [2] account with signed agreements for both datasets.

The dataset has been split into a training (68,785 samples), a validation (14,719 samples), a phase I testing (14,702 samples), and a phase II testing (10,962 samples) dataset. The phase II testing dataset will serve as the final test set that will be released on April 12th (Friday), 2024. All datasets and tables are derived from the MIMIC-IV submodules.

Code to re-create the data splits is available on Colab.
Participants are free to use all or part of the provided dataset to develop their systems. However, submissions on Codabench will be evaluated on the entirety of the testing datasets.

Discharge summaries are split into various sections and written under a variety of headings. However, each note in the dataset for this task includes a "Brief Hospital Course" and a "Discharge Instructions" section. The "Brief Hospital Course" section is usually located in the middle of the discharge summary following information about patient history and treatments received during the current admission. The "Discharge Instructions" section is generally located the end of the note as one of the last sections.

Each admission is defined by a unique hadm_id and is associated with a corresponding discharge summary and at least one radiology report. Most admissions in the dataset will have only one corresponding ED stay. However, a select few admissions may have more than one ED stay (ie. multiple stay_id). Each stay_id can have multiple ICD diagnoses, but will only have one chief complaint. Participants may use online resources for descriptions and details about ICD codes.

Special Note:

If you are using pandas to read the .csv.gz tables, please ensure you set keep_default_na=False. For instance:

pd.read_csv('discharge_target.csv.gz', keep_default_na=False)

Otherwise, pandas will automatically convert certain strings, such as in cases where the discharge instruction is 'NA' or 'N/A', into the float NaN.

Dataset Statistics

The complete dataset contains the following items:

Item	Total Count	Training	Validation	Phase I Testing	Phase II Testing
Admissions	109,168	68,785	14,719	14,702	10,962
Discharge Summaries	109,168	68,785	14,719	14,702	10,962
Radiology Reports	409,359	259,304	54,650	54,797	40,608
ED Stays & Chief Complaints	109,403	68,936	14,751	14,731	10,985
ED Diagnoses	218,376	138,112	29,086	29,414	21,764

Dataset Schemas

For consistency and ease-of-use, the schemas of the data tables have been kept the same as the ones originally provided in MIMIC-IV and its submodules. An additional table in discharge_target.csv.gz is provided, which includes extracted "Brief Hospital Course" and "Discharge Instructions" sections from the discharge summaries.

Evaluation

The evaluation metrics for this task are based on textual similarity and factual correctness of the generated text. Specifically, the following 8 metrics will be considered:

BLEU-4 [5]
ROUGE-1, -2, -L [6]
BERTScore [7]
Meteor [8]
AlignScore [9]
MEDCON [10]

Additionally, the submissions from the top-performing teams will be reviewed by clinicians at the end of the competition.

There will be two separate leaderboards on the Codabench competition page. One will be dedicated for the scores from the initial phase I testing dataset, and one will be dedicated for the scores from the phase II testing dataset which will be released on April 12th (Friday), 2024.

Submissions will first be scored on their performance for the two target sections separately. For $N$ test set samples, we define the score for a given measure as:

s_m = \frac{1}{2} \left( \frac{1}{N}\sum _{i=1}^N g(BHI_i) + \frac{1}{N}\sum _{i=1}^N g(DI_i) \right)

where $g(BHI_i)$ is the measure calculated on the brief hospital course section for observation $i$ and $g(DI_i)$ is the measure calculated on the discharge instructions of for the same observation. Finally, the overall score would be calculated by:

Overall=\frac{1}{M}\sum _{m=1}^{M} s_m

... where $M$ is the number of measures evaluated (which is defined as 8 above). All scoring calculations will be done on Codabench with a Python 3.9 environment. The evaluation scripts are available on GitHub for reference.

For specific submission instructions and details on evaluation, please visit the Codabench competition page.

Release Notes

Version 1.3 - April 12th (Friday), 2024

Phase II testing dataset released.

Version 1.2 - March 1st (Friday), 2024

Samples with target sections of less than 10 words were removed from the training and validation datasets.

Version 1.1 - February 20th (Tuesday), 2024

Samples with target sections of less than 10 words were removed from the phase I testing and phase II testing datasets.

Version 1.0 - February 6th (Tuesday), 2024

Original dataset released.

Currently, the dataset only contains the samples in the training, validation, and phase I testing dataset. The phase II testing dataset will be released on April 12th (Friday), 2024.

Additionally, the organizers may further update this dataset throughout the shared task to address issues raised by the participants.

Ethics

All members of the organizing team have completed the required training and are credentialed users of MIMIC-IV.

Acknowledgements

Special thanks to Alistair Johnson and the PhysioNet team from the MIT Laboratory for Computational Physiology for managing the credentialing process and for hosting the data for this shared task.

Task Organizers:

Justin Xu
Jean-Benoit Delbrouck
Andrew Johnston
Louis Blankemeier
Curtis Langlotz

Conflicts of Interest

Nothing to declare.

References

Zhen Xu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, & Isabelle Guyon (2022). Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform. Patterns, 3(7), 100543.
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. https://doi.org/10.13026/1n74-ne17.
Johnson, A., Bulgarelli, L., Pollard, T., Celi, L. A., Mark, R., & Horng, S. (2023). MIMIC-IV-ED (version 2.2). PhysioNet. https://doi.org/10.13026/5ntk-km72.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” arXiv.org, 2019. https://arxiv.org/abs/1904.09675.
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
W. Yim, Y. Fu, A. Ben Abacha, N. Snider, T. Lin, and M. Yetisgen, “Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation,” Scientific Data, vol. 10, no. 1, p. 586, Sep. 2023, doi: https://doi.org/10.1038/s41597-023-02487-3. ‌