Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Jack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Hugo Aerts, Leo Anthony Celi, Thomas Hartvigsen, Danielle S. Bitterman*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingAcademicpeer-review

Abstract

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets.
Original languageEnglish
Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
PublisherAssociation for Computational Linguistics (ACL)
Pages12448-12465
Number of pages18
ISBN (Electronic)9798891761681
Publication statusPublished - 1 Jan 2024
Event2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States
Duration: 12 Nov 202416 Nov 2024
https://2024.emnlp.org/

Publication series

SeriesProceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Abbreviated titleEMNLP 2024
Country/TerritoryUnited States
CityMiami
Period12/11/2416/11/24
Internet address

Fingerprint

Dive into the research topics of 'Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks'. Together they form a unique fingerprint.

Cite this