Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Shan Chen, Jack Gallifant, Mingye Gao, Pedro Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian Anthony, Leo Anthony Celi, William G. La Cava, Danielle S. Bitterman*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingAcademicpeer-review

Abstract

Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like ThePile influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: www.crosscare.net.
Original languageEnglish
Title of host publication38th Conference on Neural Information Processing Systems
Subtitle of host publicationNeurIPS 2024
EditorsA. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, C. Zhang
Place of PublicationVancouver
PublisherNeural Information Processing Systems Foundation
Volume37
ISBN (Print)10495258
Publication statusPublished - 1 Jan 2024
Event38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada
Duration: 10 Dec 202415 Dec 2024
Conference number: 38
https://neurips.cc/Conferences/2024

Publication series

SeriesAdvances in Neural Information Processing Systems
ISSN1049-5258

Conference

Conference38th Conference on Neural Information Processing Systems, NeurIPS 2024
Abbreviated titleNeurIPS 2024
Country/TerritoryCanada
CityVancouver
Period10/12/2415/12/24
Internet address

Fingerprint

Dive into the research topics of 'Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias'. Together they form a unique fingerprint.

Cite this