Database Credentialed Access

CHIFIR: Cytology and Histopathology Invasive Fungal Infection Reports

Vlada Rozova Anna Khanina Jasmine Teng Joanne Teh Leon Worth Monica Slavin karin thursky Karin Verspoor

Published: Nov. 2, 2023. Version: 1.0.1

When using this resource, please cite: (show more options)
Rozova, V., Khanina, A., Teng, J., Teh, J., Worth, L., Slavin, M., thursky, k., & Verspoor, K. (2023). CHIFIR: Cytology and Histopathology Invasive Fungal Infection Reports (version 1.0.1). PhysioNet.

Additionally, please cite the original publication:

Rozova V, Khanina A, Teng JC, S K Teh J, Worth LJ, Slavin MA, et al. Detecting evidence of invasive fungal infections in cytology and histopathology reports enriched with concept-level annotations. Journal of Biomedical Informatics. 2023:104293.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Surveillance of invasive fungal infection (IFI) in clinical settings is a laborious process requiring a detailed review of patient medical history. One of the key sources of clinical information is cytology and histopathology reports: pathologist-produced free-text reports outlining the macroscopic and microscopic structure of a specimen. The data was generated to facilitate the development of an automated tool for the detection of IFI. The Cytology and Histopathology Invasive Fungal Infection Reports (CHIFIR) corpus contains 283 de-identified reports annotated by infectious diseases physicians for terminology relevant to the IFI diagnosis. These include IFI-specific concepts and certainty cues, as well as relations between them. We release the annotation schema and the original reports along with the corresponding annotation files. We anticipate the CHIFIR corpus to be useful in the development and validation of named entity recognition and relation extraction methods with a focus on clinical data. Such methods can be instrumental in processing other types of clinical documentation (radiology reports, clinical notes, nursing notes) with various downstream tasks in mind.


Invasive fungal infections (IFIs) are rare but serious infections most commonly affecting immunocompromised and critically ill patients. IFIs are associated with increased morbidity, mortality, and management costs. Routine IFI surveillance is needed in healthcare facilities to allow timely detection of infection outbreaks, identify new and emerging risks for IFI, evaluate infection prevention and prophylaxis interventions, and enable benchmarking between facilities.

The lack of a single diagnostic test means traditional surveillance approaches involve examining multiple sources of clinical information among which cytology and histopathology reports are a key data element [1]. These patient investigations are evaluated against the consensus definitions of IFI from the European Organization for Research and Treatment of Cancer and the Mycoses Study Group (EORTC/MSG) to detect an episode of infection [2].

Our team aims to design and implement the Invasive Fungal Infection Surveillance (IFIS) tool for automated routine surveillance of IFI, employing a combination of natural language processing (NLP) and machine learning techniques. The CHIFIR corpus was created to support the development of an NLP-based classifier to enable the automated processing of cytology and histopathology reports as a critical component of the IFIS platform.


Study population and design

Patients receiving treatment for haematological malignancy and/or bone marrow transplant at two hospitals in Melbourne, Australia, The Royal Melbourne Hospital and Peter MacCallum Cancer Centre, between March 2010 and March 2021 were included in this study. We extracted cytology and histopathology reports from the electronic medical records and local pathology databases collected over the given period. Text reports from both initial and follow-up examinations were included. The reports were all written in English.

To ensure a balanced representation of reports positive and negative for IFI, we selected a subset of reports based on the following procedure:

  • Only reports relating to the following sample types: bronchial washings, broncho-alveolar lavage fluid and tissue (e.g., lung).
  • All reports from patients who fulfilled the EORTC/MSG mycology criteria for IFI were included in the study. Reports that contained a clear indication of IFI were considered positive cases. Case-finding of positive reports was enhanced by reviewing a list of patients with IFI curated by infectious diseases (ID) physicians from ongoing manual surveillance across both institutions.
  • A set of randomly selected reports from patients without the IFI diagnosis was included as negative controls.

Report de-identification

Dates were shifted within the text and then removed along with all protected health information, names, initials, and contact details of health professionals and healthcare facilities. The identifying information was replaced with a sequence of “XXXX” of the same length.

Entity and relation annotation

We annotated the data at the level of individual concepts directly related to the diagnosis of IFI. The annotation schema was developed iteratively with the intent to capture as much information as possible while maintaining relatively high granularity.

We used brat rapid annotation tool [4] to perform manual annotation. The process involved two ID physicians (authors JCT and JSKT) independently annotating a subset of reports followed by a meeting to resolve disparities and update the annotation schema to document consensus. This step was repeated several times until agreement on the structure of the schema, concept definitions, and rules for relations was reached. The annotators then proceeded to apply the final guidelines to the whole corpus of reports. A final consensus meeting was held to resolve any differences in interpretation, mostly around the level of certainty pertaining to IFI-related findings. The final judgement on discrepancies was conducted by senior ID physicians (authors KAT, LJW, MAS) and the annotations were consolidated into a single set of annotation files.

We adopted the following list of concepts related to the IFI diagnosis:

  • ClinicalQuery: clinical query of IFI indicating that the presence of an IFI is suspected; most commonly recorded in the “CLINICAL NOTES” section of the report.
  • FungalDescriptor: descriptors for the presence of fungal organisms.
  • Fungus: mentions of specific fungal organisms or syndromes.
  • Invasiveness: descriptors for the depth and degree of fungal invasion into tissues.
  • Stain: histological stains used to visualise fungal elements.

We allowed for any overlapping between FungalDescriptor and Fungus and two FungalDescriptors.

When not indicated explicitly in the “SPECIMEN” section of the report, we also tagged the location of the specimen:

  • SampleType: specification of the sampled organ, site, or tissue source.

In addition to these, we considered terms expressing three levels of certainty:

  • Positive: affirmative expression.
  • Equivocal: expression of uncertainty.
  • Negative: negating expression.

These certainty cues were captured only when pertaining to a FungalDescriptor, Fungus, Invasiveness, or Stain concept. They were linked to these target concepts via relations:

  • positive-rel: links a Positive expression to a FungalDescriptor, Fungus, Invasiveness, or Stain entity it is affirming.
  • equivocal-rel: links an Equivocal expression to a FungalDescriptor, Fungus, Invasiveness, or Stain entity about which it is conveying uncertainty.
  • negative-rel: links a Negative expression to a FungalDescriptor, Fungus, Invasiveness, or Stain entity it is negating.

In addition, we introduced several other relations to capture how concepts are connected to each other:

  • fungal-description-rel: links a FungalDescriptor referring to a Fungus.
  • invasiveness-rel: links a FungalDescriptor or a Fungus displaying Invasiveness.
  • fungus-stain-rel: links a Stain visualising a FungalDescriptor or a Fungus.

Data Description

File overview

  • The folder “reports“ contains the original de-identified reports in the .txt file format.
  • The folder “annotations“ contains brat annotation files in the .ann file format.
  • The file “annotation.conf” is a configuration file for brat that defines concept categories and relations.
  • The file “chifir_metadata.csv” indicates whether a report is positive or negative for IFI and which data split it belongs to.  

Each text document has a corresponding annotation file with the same base file name. File names are assigned following the format “ptX_rY”, where X is the patient ID and Y is the report number for a given patient. For example, the text of the third report belonging to patient #35 can be located under “pt35_r3.txt” with a corresponding annotation file named “pt35_r3.ann”. Reports belonging to the same patient are numbered chronologically based on the date of specimen collection (this information is not disclosed).

Annotation files are produced by brat in the brat standoff format (see [4] for a detailed overview). Information about entities and relations is recorded in the annotation file, one item per line:

  • Each entity annotation contains a unique entity ID, its assigned concept category (e.g., FungalDescriptor or Fungus),  and the span of characters containing the entity mention. For example, "T1    SampleType 202 206    skin" describes an entity "skin" with a unique ID T1, a concept category SampleType, starting at position 202 and ending at 206.
  • Each relation annotation contains a unique relation ID followed by the type of relation (e.g., positive-rel or invasiveness-rel) and its arguments. For example, "R2    positive-rel Arg1:T3 Arg2:T1" describes a relation between entities T3 and T1 with a unique ID R2, a relation type positive-rel, directed from its first argument, T3 to its second argument, T1.

Summary statistics

  • The total number of reports: 283.
  • The total number of patients: 201.
  • The number of reports per patient varies from 1 to 6 with a median of 1 report per patient.
  • The average document length is 1353 characters.
  • The majority of reports are semi-structured with headers indicating sections such as clinical notes, macroscopic and microscopic descriptions, and diagnosis; however, such organisation is not consistent across the corpus.

Usage Notes

The CHIFIR corpus was used to develop an NLP-based classifier to detect the evidence of IFI in cytology and histopathology reports. The results of automated information extraction and report classification have been recently published in our paper [6]. The code used to produce the summary statistics above and the results presented in the paper can be found in our GitHub repository [6]. Alternatively, researchers interested in validating their results against ours should refer to “chifir_metadata.csv” for the details on how the data was split in the original study. 

It is important to note that the CHIFIR corpus was derived from two hospitals within a small geographic area. Collecting more reports positive for IFI was not feasible due to the rarity of the condition. Hence, the number of terms annotated for each category may not be large enough to train (or even fine-tune) a deep-learning NER model. That said, there is a limited number of publicly available clinical text documents, particularly from the Australian context. We thus hope to provide other researchers working on a similar problem with an opportunity to conduct external validation of their methodologies and investigate domain adaptation techniques. We further anticipate the CHIFIR dataset to be useful to researchers developing new concept and relation extraction methods with a focus on clinical data. Such methods can be instrumental in processing other types of clinical documentation (radiology reports, clinical notes, nursing notes) with different downstream tasks in mind. 

Release Notes

Version 1.0.1: added “chifir_metadata.csv” containing metadata for users who want to make a direct comparison to the results published in [6].


Ethics approval granted by Peter MacCallum Cancer Centre (HREC Reference No: HREC/74975/PMCC) and Governance approval from Royal Melbourne Hospital (SSA/69640/MH-2020).


The authors would like to acknowledge Dr ShioYen Tio for providing access to the database of patients with confirmed mould infections. The authors would also like to thank Dr Christopher Angel for the input on the structure and components of the annotation schema. This work was supported by the Australian National Health and Medical Research Council (NHMRC) Project Grant APP1156426.

Conflicts of Interest

The authors declare no conflicts of interest.


  1. Donnelly JP, Chen SC, Kauffman CA, Steinbach WJ, Baddley JW, Verweij PE, et al. Revision and Update of the Consensus Definitions of Invasive Fungal Disease From the European Organization for Research and Treatment of Cancer and the Mycoses Study Group Education and Research Consortium. Clin Infect Dis. 2020;71(6):1367-76.
  2. Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii Ji, editors. brat: a Web- based Tool for NLP-Assisted Text Annotation2012 April; Avignon, France: Association for Computational Linguistics.
  3. [Accessed 10/16/2023]
  4. Rozova V, Khanina A, Teng JC, S K Teh J, Worth LJ, Slavin MA, et al. Detecting evidence of invasive fungal infections in cytology and histopathology reports enriched with concept-level annotations. Journal of Biomedical Informatics. 2023:104293.
  5. [Accessed 10/16/2023]
  6. Kontoyiannis DP, Marr KA, Park BJ, Alexander BD, Anaissie EJ, Walsh TJ, et al. Prospective surveillance for invasive fungal infections in hematopoietic stem cell transplant recipients, 2001-2006: overview of the Transplant-Associated Infection Surveillance Network (TRANSNET) Database. Clin Infect Dis. 2010;50(8):1091-100.


Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.