Abstract
Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets.
Original language | English |
---|---|
Title of host publication | EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024 |
Editors | Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 12448-12465 |
Number of pages | 18 |
ISBN (Electronic) | 9798891761681 |
Publication status | Published - 1 Jan 2024 |
Event | 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States Duration: 12 Nov 2024 → 16 Nov 2024 https://2024.emnlp.org/ |
Publication series
Series | Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP |
---|
Conference
Conference | 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 |
---|---|
Abbreviated title | EMNLP 2024 |
Country/Territory | United States |
City | Miami |
Period | 12/11/24 → 16/11/24 |
Internet address |