Evaluating the ChatGPT family of models for biomedical reasoning and classification

Shan Chen; Yingya Li; Sheng Lu; Hoang Van; Hugo J. W. L. Aerts; Guergana K. Savova; Danielle S. Bitterman

doi:10.1093/jamia/ocad256

Evaluating the ChatGPT family of models for biomedical reasoning and classification

Shan Chen, Yingya Li, Sheng Lu, Hoang Van, Hugo J. W. L. Aerts, Guergana K. Savova, Danielle S. Bitterman^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

1 Downloads (Pure)

Abstract

Objective Large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates ChatGPT family of models (GPT-3.5, GPT-4) in biomedical tasks beyond question-answering.Materials and Methods We evaluated model performance with 11 122 samples for two fundamental tasks in the biomedical domain-classification (n = 8676) and reasoning (n = 2446). The first task involves classifying health advice in scientific literature, while the second task is detecting causal relations in biomedical literature. We used 20% of the dataset for prompt development, including zero- and few-shot settings with and without chain-of-thought (CoT). We then evaluated the best prompts from each setting on the remaining dataset, comparing them to models using simple features (BoW with logistic regression) and fine-tuned BioBERT models.Results Fine-tuning BioBERT produced the best classification (F1: 0.800-0.902) and reasoning (F1: 0.851) results. Among LLM approaches, few-shot CoT achieved the best classification (F1: 0.671-0.770) and reasoning (F1: 0.682) results, comparable to the BoW model (F1: 0.602-0.753 and 0.675 for classification and reasoning, respectively). It took 78 h to obtain the best LLM results, compared to 0.078 and 0.008 h for the top-performing BioBERT and BoW models, respectively.Discussion The simple BoW model performed similarly to the most complex LLM prompting. Prompt engineering required significant investment.Conclusion Despite the excitement around viral ChatGPT, fine-tuning for two fundamental biomedical natural language processing tasks remained the best strategy.

Original language	English
Article number	ocad256
Pages (from-to)	940-948
Number of pages	9
Journal	Journal of the American Medical Informatics Association
Volume	31
Issue number	4
Early online date	22 Jan 2024
DOIs	https://doi.org/10.1093/jamia/ocad256
Publication status	Published - 1 Apr 2024

Keywords

natural language processing
ChatGPT
biomedical research
classification
reasoning

Access to Document

10.1093/jamia/ocad256

Embargoed Document

Full Text
Final published version, 1.7 MB
Licence: Taverne
Embargo ends: 22/07/24
Request copy

Cite this

@article{064ab7391a864e9bb2125b1241b91b4c,

title = "Evaluating the ChatGPT family of models for biomedical reasoning and classification",

abstract = "Objective Large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates ChatGPT family of models (GPT-3.5, GPT-4) in biomedical tasks beyond question-answering.Materials and Methods We evaluated model performance with 11 122 samples for two fundamental tasks in the biomedical domain-classification (n = 8676) and reasoning (n = 2446). The first task involves classifying health advice in scientific literature, while the second task is detecting causal relations in biomedical literature. We used 20% of the dataset for prompt development, including zero- and few-shot settings with and without chain-of-thought (CoT). We then evaluated the best prompts from each setting on the remaining dataset, comparing them to models using simple features (BoW with logistic regression) and fine-tuned BioBERT models.Results Fine-tuning BioBERT produced the best classification (F1: 0.800-0.902) and reasoning (F1: 0.851) results. Among LLM approaches, few-shot CoT achieved the best classification (F1: 0.671-0.770) and reasoning (F1: 0.682) results, comparable to the BoW model (F1: 0.602-0.753 and 0.675 for classification and reasoning, respectively). It took 78 h to obtain the best LLM results, compared to 0.078 and 0.008 h for the top-performing BioBERT and BoW models, respectively.Discussion The simple BoW model performed similarly to the most complex LLM prompting. Prompt engineering required significant investment.Conclusion Despite the excitement around viral ChatGPT, fine-tuning for two fundamental biomedical natural language processing tasks remained the best strategy.",

keywords = "natural language processing, ChatGPT, biomedical research, classification, reasoning",

author = "Shan Chen and Yingya Li and Sheng Lu and Hoang Van and Aerts, {Hugo J. W. L.} and Savova, {Guergana K.} and Bitterman, {Danielle S.}",

year = "2024",

month = apr,

day = "1",

doi = "10.1093/jamia/ocad256",

language = "English",

volume = "31",

pages = "940--948",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "4",

}

TY - JOUR

T1 - Evaluating the ChatGPT family of models for biomedical reasoning and classification

AU - Chen, Shan

AU - Li, Yingya

AU - Lu, Sheng

AU - Van, Hoang

AU - Aerts, Hugo J. W. L.

AU - Savova, Guergana K.

AU - Bitterman, Danielle S.

PY - 2024/4/1

Y1 - 2024/4/1

N2 - Objective Large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates ChatGPT family of models (GPT-3.5, GPT-4) in biomedical tasks beyond question-answering.Materials and Methods We evaluated model performance with 11 122 samples for two fundamental tasks in the biomedical domain-classification (n = 8676) and reasoning (n = 2446). The first task involves classifying health advice in scientific literature, while the second task is detecting causal relations in biomedical literature. We used 20% of the dataset for prompt development, including zero- and few-shot settings with and without chain-of-thought (CoT). We then evaluated the best prompts from each setting on the remaining dataset, comparing them to models using simple features (BoW with logistic regression) and fine-tuned BioBERT models.Results Fine-tuning BioBERT produced the best classification (F1: 0.800-0.902) and reasoning (F1: 0.851) results. Among LLM approaches, few-shot CoT achieved the best classification (F1: 0.671-0.770) and reasoning (F1: 0.682) results, comparable to the BoW model (F1: 0.602-0.753 and 0.675 for classification and reasoning, respectively). It took 78 h to obtain the best LLM results, compared to 0.078 and 0.008 h for the top-performing BioBERT and BoW models, respectively.Discussion The simple BoW model performed similarly to the most complex LLM prompting. Prompt engineering required significant investment.Conclusion Despite the excitement around viral ChatGPT, fine-tuning for two fundamental biomedical natural language processing tasks remained the best strategy.

AB - Objective Large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates ChatGPT family of models (GPT-3.5, GPT-4) in biomedical tasks beyond question-answering.Materials and Methods We evaluated model performance with 11 122 samples for two fundamental tasks in the biomedical domain-classification (n = 8676) and reasoning (n = 2446). The first task involves classifying health advice in scientific literature, while the second task is detecting causal relations in biomedical literature. We used 20% of the dataset for prompt development, including zero- and few-shot settings with and without chain-of-thought (CoT). We then evaluated the best prompts from each setting on the remaining dataset, comparing them to models using simple features (BoW with logistic regression) and fine-tuned BioBERT models.Results Fine-tuning BioBERT produced the best classification (F1: 0.800-0.902) and reasoning (F1: 0.851) results. Among LLM approaches, few-shot CoT achieved the best classification (F1: 0.671-0.770) and reasoning (F1: 0.682) results, comparable to the BoW model (F1: 0.602-0.753 and 0.675 for classification and reasoning, respectively). It took 78 h to obtain the best LLM results, compared to 0.078 and 0.008 h for the top-performing BioBERT and BoW models, respectively.Discussion The simple BoW model performed similarly to the most complex LLM prompting. Prompt engineering required significant investment.Conclusion Despite the excitement around viral ChatGPT, fine-tuning for two fundamental biomedical natural language processing tasks remained the best strategy.

KW - natural language processing

KW - ChatGPT

KW - biomedical research

KW - classification

KW - reasoning

U2 - 10.1093/jamia/ocad256

DO - 10.1093/jamia/ocad256

M3 - Article

SN - 1067-5027

VL - 31

SP - 940

EP - 948

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 4

M1 - ocad256

ER -