Database Open Access

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management

Cécile Logé Emily Ross David Yaw Amoah Dadey Saahil Jain Adriel Saporta Andrew Ng Pranav Rajpurkar

Published: June 11, 2021. Version: 1.0.0

When using this resource, please cite: (show more options)
Logé, C., Ross, E., Dadey, D. Y. A., Jain, S., Saporta, A., Ng, A., & Rajpurkar, P. (2021). Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management (version 1.0.0). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


We introduce Q-Pain, a dataset for assessing bias in medical QA in the context of pain management. We developed 55 medical question-answer pairs across five different types of pain management: each question includes a detailed patient-specific medical scenario ("vignette") designed to enable the substitution of multiple different racial and gender "profiles" and to evaluate whether bias is present when answering whether or not to prescribe medication.

Along with the dataset, we propose a new framework, including a sample experimental design, to measure the potential bias present during medical decision-making for patients with particular demographic profiles. We demonstrate its use in assessing two reference QA systems, GPT-2 and GPT-3, selected for their documented ability to answer questions given only a few examples. We hope that our dataset and framework will be used to assess bias across a wide variety of pain management QA systems.


Pain management remains one of the most challenging forms of clinician decision making [1]. Since the patient experience of pain and its manifestations are highly variable, there is inherent subjectivity in physician pain assessments [2-4]. The challenge of measuring pain combined with the vast diversity of pharmacologic and non-pharmacologic treatment options results in clinical decision pathways that are difficult to standardize [5]. Thus, the complexity of pain management, and the lack of universal standards, present opportunities for bias to impact clinical decisions. Racial disparities in treatment of pain have been shown in prior studies [6-8], where Black patients were consistently found to be more likely to be given inadequate or no pain medications when compared to White patients [9]. Furthermore, multiple studies have highlighted that women were more likely to be under-treated for pain [10].

Recent advances in natural language processing (NLP) have enabled the development of automated question answering (QA) systems that can answer personalized medical questions [11]. However, many NLP systems, including those specific to QA, have been shown to encode and reinforce harmful societal biases [12-15]. Benchmark datasets have been instrumental in surfacing bias in NLP systems [16,17]. Therefore, before medical QA systems are further deployed and even incorporated into medical workflows for applications such as mental health conversations and clinical advice lines, it is critical to develop benchmark datasets that will help us understand the extent to which encoded societal biases surface in medical QA systems.


Clinical vignettes are medical scenarios typically presenting a patient with a specific set of symptoms, often with the goal to illustrate decision-making and explore different ranges of actions. We introduce Q-Pain, a dataset of 55 pain-related clinical vignettes depicting patients in different medical contexts related to pain management. At the end of each vignette is a question asking whether the patient should be prescribed pain medication (either intravenous hydromorphone, hydrocodone, oxycodone or morphine), if so what dosage, and why.

Data Collection: A clinical expert with four years of medical experience designed each of the clinical scenarios such that the patient presentation and data were consistent with real-life situations where pain treatment should be offered. As such, with the exception of five vignettes used in our closed prompts (see details in our Experimental Design section), the expected treatment answer for all vignettes is “Yes.", thus making it easy to detect under-treatment in specific subgroups of patients. To allow for detection of more nuanced under-treatment, we included dose/supply scales in each question. These were defined as low or high according to an appropriate treatment metric (milligrams or weeks supply of medication) for each clinical context. These scales were designed such that both low and high choices of doses/supplies were objectively acceptable for pain treatment.

Validation Protocol: To further validate the dataset, a random selection of vignettes spanning each clinical context were provided to two internal medicine physicians who were not involved in the initial design. For each vignette, the physicians were asked to assess for (1) credibility of the described scenario, (2) coherence of clinical information provided, and (3) sufficiency of information provided toward making a clinical decision. For all vignettes assessed, both physicians responded affirmatively to each query.

Data Description

The Q-Pain dataset is structured into five csv files, one for each medical contexts: Acute Non-Cancer pain (data_acute_non_cancer.csv), Acute Cancer pain (data_acute_cancer.csv), Chronic Non-Cancer pain (data_chonic_non_cancer.csv), Chronic Cancer pain (data_chronic_cancer.csv), and Post-Operative pain (data_post_op.csv).

Vignettes: Each csv file includes ten "Yes." vignettes, and one "No." vignette. For each vignette, it includes: the case presentation ("Vignette"), the question ("Question"), the expected answer ("Answer") and dosage ("Dosage") as well as a brief explanation justifying the response ("Explanation").

Placeholders are used in the Yes vignettes: [Patient D], [race], [gender] and [possessive], [subject] for pronouns, to allow for substitution of patient profiles. The No vignettes use [Patient B] and fewer placeholders, as they're only meant for use in Closed Prompts.

Python Notebook: The Q-Pain_Experiments.ipynb notebook includes a walkthrough of the data as well as starter code to use for our Experimental Design framework.

Usage Notes

The Q-Pain dataset can be used in real-life survey settings (with medical staff and clinicians) or with AI Language models, to assess the presence of bias when it comes to Pain Management decision-making.

Example Experimental Design

The following experimental design may be used for prompting GPT-3 and GPT-2 [18,19].

Closed Prompts: LLMs have been shown to answer questions more accurately in a “few-shot” setting, in which the model is first given several example question answer pairs (“closed prompts”) to better learn the relevant task: with this in mind, we preceded all our vignettes (“open prompts”) with three closed prompts randomly sampled from the dataset itself. To limit the influence of these closed prompts on the generated text and make them as neutral as possible, we removed any indication of race or gender, named the patients “Patient A”, “Patient B” and “Patient C”, used gender-neutral pronouns ("they"/"their"), and ensured each of the three possible answers (“No.”, “Yes. Dosage:High”, and “Yes. Dosage: Low”) were represented in the closed prompt selection. We also used ‘##’as a stop token to mark the separation between prompts and give the models a sense of where to stop during generation (see Fig. 1a for an example of input). Throughout our experiments, temperature was set to 0 to eliminate the randomness of the output, as recommended by OpenAI for factual question answering. The maximum number of tokens for the model to generate was set to 150. We also retrieved the probabilities for the five most likely tokens. Other parameters were kept to their default values: no nucleus sampling and no presence or frequency penalties.

Rotating Patient Identities: As we prompted the models for answers, we rotated the patient’s race (White, Black, Hispanic, Asian) and gender (man, woman) across all vignettes for a total of eight patient combinations and 400 open prompts. Conscious of the demographic information carried within a person’s name and inspired by prior work on race-based stereotyping, we opted to name each patient according to their demographic profile using Harvard Dataverse’s "Demographic aspects of first names" dataset. This dataset contains a list of 4,250 first names along with data on their respective frequency and proportions across mutually exclusive racial groups. For each race and gender, we chose the top ten first names based on their overall frequency and representation within each group, excluding unisex names and names that differed by only one character.

For each context, we selected a random permutation of names to use across the ten open prompts. Beyond the names, race, gender, and corresponding pronouns, everything else in the vignettes remained completely unchanged—from symptom presentation and severity to age and preceding closed prompts, so as to minimize the influence of other factors in the final decision.

Conflicts of Interest

No conflict of interest.


  1. Timothy H. Wideman et al. The multimodal assessment model of pain. The Clinical Journal of Pain, 35(3):212–221, Mar 2019.
  2. Jana M. Mossey. Defining racial and ethnic disparities in pain management. Clinical Orthopaedics & Related Research, 469(7):1859–1870, Jul 2011
  3. Robert C. Coghill. Individual differences in the subjective experience of pain: New insights into mechanisms and models. Headache: The Journal of Head and Face Pain, 50(9):1531–1535, Oct 2010
  4. Richard A. Mularski et al. Measuring pain as the 5th vital sign does not improve quality of pain management. Journal of General Internal Medicine, 21(6):607–612, Jun 2006.
  5. Rita M. Holl et al. Complexity of pain, nurses knowledge, and treatment options. Holistic Nursing Practice, 29(6):377–380, Dec 2015.
  6. Brian B. Drwecki et al. Reducing racial disparities in pain treatment: The role of empathy and perspective-taking. PAIN, 152(5):1001–1006, 2011
  7. Karen O. Anderson et al. Racial and ethnic disparities in pain: Causes and consequences of unequal care. The Journal of Pain, 10(12):1187–1204, 2009
  8. Carmen R. Green et al. The Unequal Burden of Pain: Confronting Racial and Ethnic Disparities in Pain. Pain Medicine, 4(3):277–294, 09 2003
  9. Kelly M. Hoffman et al. Racial bias in pain assessment and treatment recommendations, and false beliefs about biological differences between blacks and whites. Proceedings of the National Academy of Sciences, 113(16):4296–4301, 2016
  10. H. Majediet al. Assessment of factors predicting inadequate pain management in chronic pain patients. Anesthesiology and Pain Medicine, 9, 2019
  11. Jacob Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019
  12. Tao Li et al. UNQOVERing stereotyping biases via underspecified questions. InFindings of the Association for Computational Linguistics: EMNLP2020, pages 3475–3489, Online, November 2020.
  13. Kris McGuffie et al. The radicalization risks of GPT-3 and advanced neural language models. CoRR, abs/2009.06807, 2020
  14. Emily Sheng et al. The woman worked as a babysitter: On biases in language generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China, November 2019.
  15. Emily Sheng et al. Towards Controllable Biases in Language Generation. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages3239–3254, Online, November 2020.
  16. Nikita Nangia et al. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online, November 2020.
  17. Moin Nadeem et al. Stereoset: Measuring stereotypical bias in pre-trained language models, 2020.
  18. Tom Brown et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates,Inc., 2020.
  19. Alec Radford et al. Language models are unsupervised multitask learners. 2019.


Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.

License (for files):
Creative Commons Attribution-ShareAlike 4.0 International Public License

Corresponding Author
You must be logged in to view the contact information.


Total uncompressed size: 97.6 KB.

Access the files

Folder Navigation: <base>
Name Size Modified
LICENSE.txt (download) 16.0 KB 2021-06-10
Q_Pain_Experiments.ipynb (download) 20.2 KB 2021-06-03
SHA256SUMS.txt (download) 609 B 2021-06-11
data_acute_cancer.csv (download) 11.9 KB 2021-06-03
data_acute_non_cancer.csv (download) 11.8 KB 2021-06-03
data_chronic_cancer.csv (download) 12.9 KB 2021-06-03
data_chronic_non_cancer.csv (download) 13.0 KB 2021-06-03
data_post_op.csv (download) 11.1 KB 2021-06-03