TY - JOUR
T1 - Assessing the possibility of using large language models in ocular surface diseases
AU - Ling, Qian
AU - Xu, Zi-Song
AU - Zeng, Yan-Mei
AU - Hong, Qi
AU - Qian, Xian-Zhe
AU - Hu, Jin-Yu
AU - Pei, Chong-Gang
AU - Wei, Hong
AU - Zou, Jie
AU - Chen, Cheng
AU - Wang, Xiao-Yu
AU - Chen, Xu
AU - Wu, Zhen-Kai
AU - Shao, Yi
PY - 2025/1/18
Y1 - 2025/1/18
N2 - AIM: To assess the possibility of using different large language models (LLMs) in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases: ChatGPT-4, ChatGPT-3.5, Claude 2, PaLM2, and SenseNova. METHODS: A group of experienced ophthalmology professors were asked to develop a 100-question single- choice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions. The exam includes questions on the following topics: keratitis disease (20 questions), keratoconus, keratomalaciac, corneal dystrophy, corneal degeneration, erosive corneal ulcers, and corneal lesions associated with systemic diseases (20 questions), conjunctivitis disease (20 questions), trachoma, pterygoid and conjunctival tumor diseases (20 questions), and dry eye disease (20 questions). Then the total score of each LLMs and compared their mean score, mean correlation, variance, and confidence were calculated. RESULTS: GPT-4 exhibited the highest performance in terms of LLMs. Comparing the average scores of the LLMs group with the four human groups, chief physician, attending physician, regular trainee, and graduate student, it was found that except for ChatGPT-4, the total score of the rest of the LLMs is lower than that of the graduate student group, which had the lowest score in the human group. Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers, giving very little chance of an incorrect answer. ChatGPT-4 showed higher credibility when answering questions, with a success rate of 59%, but gave the wrong answer to the question 28% of the time. CONCLUSION: GPT-4 model exhibits excellent performance in both answer relevance and confidence. PaLM2 shows a positive correlation (up to 0.8) in terms of answer accuracy during the exam. In terms of answer confidence, PaLM2 is second only to GPT4 and surpasses Claude 2, SenseNova, and GPT-3.5. Despite the fact that ocular surface disease is a highly specialized discipline, GPT-4 still exhibits superior performance, suggesting that its potential and ability to be applied in this field is enormous, perhaps with the potential to be a valuable resource for medical students and clinicians in the future.
AB - AIM: To assess the possibility of using different large language models (LLMs) in ocular surface diseases by selecting five different LLMS to test their accuracy in answering specialized questions related to ocular surface diseases: ChatGPT-4, ChatGPT-3.5, Claude 2, PaLM2, and SenseNova. METHODS: A group of experienced ophthalmology professors were asked to develop a 100-question single- choice question on ocular surface diseases designed to assess the performance of LLMs and human participants in answering ophthalmology specialty exam questions. The exam includes questions on the following topics: keratitis disease (20 questions), keratoconus, keratomalaciac, corneal dystrophy, corneal degeneration, erosive corneal ulcers, and corneal lesions associated with systemic diseases (20 questions), conjunctivitis disease (20 questions), trachoma, pterygoid and conjunctival tumor diseases (20 questions), and dry eye disease (20 questions). Then the total score of each LLMs and compared their mean score, mean correlation, variance, and confidence were calculated. RESULTS: GPT-4 exhibited the highest performance in terms of LLMs. Comparing the average scores of the LLMs group with the four human groups, chief physician, attending physician, regular trainee, and graduate student, it was found that except for ChatGPT-4, the total score of the rest of the LLMs is lower than that of the graduate student group, which had the lowest score in the human group. Both ChatGPT-4 and PaLM2 were more likely to give exact and correct answers, giving very little chance of an incorrect answer. ChatGPT-4 showed higher credibility when answering questions, with a success rate of 59%, but gave the wrong answer to the question 28% of the time. CONCLUSION: GPT-4 model exhibits excellent performance in both answer relevance and confidence. PaLM2 shows a positive correlation (up to 0.8) in terms of answer accuracy during the exam. In terms of answer confidence, PaLM2 is second only to GPT4 and surpasses Claude 2, SenseNova, and GPT-3.5. Despite the fact that ocular surface disease is a highly specialized discipline, GPT-4 still exhibits superior performance, suggesting that its potential and ability to be applied in this field is enormous, perhaps with the potential to be a valuable resource for medical students and clinicians in the future.
KW - ChatGPT-4.0
KW - ChatGPT-3.5
KW - large language models
KW - ocular surface diseases
KW - ARTIFICIAL-INTELLIGENCE
KW - CARE
U2 - 10.18240/ijo.2025.01.01
DO - 10.18240/ijo.2025.01.01
M3 - Article
SN - 2222-3959
VL - 18
SP - 1
EP - 8
JO - International Journal of Ophthalmology
JF - International Journal of Ophthalmology
IS - 1
ER -