BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study

Andrea Cozzi*, Katja Pinker, Andri Hidber, Tianyu Zhang, Luca Bonomo, Roberto Lo Gullo, Blake Christianson, Marco Curti, Stefania Rizzo, Filippo Del Grande, Ritse M. Mann, Simone Schiaffino

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Background: The performance of publicly available large language models (LLMs) remains unclear for complex clinical tasks. Purpose: To evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management. Materials and Methods: This retrospective study included reports for women who underwent MRI, mammography, and/or US for breast cancer screening or diagnostic purposes at three referral centers. Reports with findings categorized as BI-RADS 1–5 and written in Italian, English, or Dutch were collected between January 2000 and October 2023. Board-certified breast radiologists and the LLMs GPT-3.5 and GPT-4 (OpenAI) and Bard, now called Gemini (Google), assigned BI-RADS categories using only the findings described by the original radiologists. Agreement between human readers and LLMs for BI-RADS categories was assessed using the Gwet agreement coefficient (AC1 value). Frequencies were calculated for changes in BI-RADS category assignments that would affect clinical management (ie, BI-RADS 0 vs BI-RADS 1 or 2 vs BI-RADS 3 vs BI-RADS 4 or 5) and compared using the McNemar test. Results: Across 2400 reports, agreement between the original and reviewing radiologists was almost perfect (AC1 = 0.91), while agreement between the original radiologists and GPT-4, GPT-3.5, and Bard was moderate (AC1 = 0.52, 0.48, and 0.42, respectively). Across human readers and LLMs, differences were observed in the frequency of BI-RADS category upgrades or downgrades that would result in changed clinical management (118 of 2400 [4.9%] for human readers, 611 of 2400 [25.5%] for Bard, 573 of 2400 [23.9%] for GPT-3.5, and 435 of 2400 [18.1%] for GPT-4; P < .001) and that would negatively impact clinical management (37 of 2400 [1.5%] for human readers, 435 of 2400 [18.1%] for Bard, 344 of 2400 [14.3%] for GPT-3.5, and 255 of 2400 [10.6%] for GPT-4; P < .001). Conclusion: LLMs achieved moderate agreement with human reader–assigned BI-RADS categories across reports written in three languages but also yielded a high percentage of discordant BI-RADS categories that would negatively impact clinical management.
Original languageEnglish
Article numbere232133
JournalRadiology
Volume311
Issue number1
DOIs
Publication statusPublished - 1 Apr 2024

Fingerprint

Dive into the research topics of 'BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study'. Together they form a unique fingerprint.

Cite this