Search | VHL Regional Portal

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.

Cai, Louis Z; Shaheen, Abdulla; Jin, Andrew; Fukui, Riya; Yi, Jonathan S; Yannuzzi, Nicolas; Alabiad, Chrisfouad.

Am J Ophthalmol ; 254: 141-149, 2023 10.

Article in English | MEDLINE | ID: mdl-37339728

ABSTRACT

PURPOSE: To investigate the ability of generative artificial intelligence models to answer ophthalmology board-style questions. DESIGN: Experimental study. METHODS: This study evaluated 3 large language models (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 questions from the Basic Science and Clinical Science Self-Assessment Program. Although ChatGPT is trained on information last updated in 2021, Bing Chat incorporates a more recently indexed internet search to generate its answers. Performance was compared with human respondents. Questions were categorized by complexity and patient care phase, and instances of information fabrication or nonlogical reasoning were documented. MAIN OUTCOME MEASURES: Primary outcome was response accuracy. Secondary outcomes were performance in question subcategories and hallucination frequency. RESULTS: Human respondents had an average accuracy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), whereas ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (odds ratio [OR], 3.89, 95% CI, 1.19-14.73, P = .03) compared with diagnostic questions, but struggled with image interpretation (OR, 0.14, 95% CI, 0.05-0.33, P < .01) when compared with single-step reasoning questions. Against single-step questions, Bing Chat also faced difficulties with image interpretation (OR, 0.18, 95% CI, 0.08-0.44, P < .01) and multi-step reasoning (OR, 0.30, 95% CI, 0.11-0.84, P = .02). ChatGPT-3.5 had the highest rate of hallucinations and nonlogical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%). CONCLUSIONS: LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respondents answering questions from the Basic Science and Clinical Science Self-Assessment Program. The frequency of hallucinations and nonlogical reasoning suggests room for improvement in the performance of conversational agents in the medical domain.

Subject(s)

Artificial Intelligence , Ophthalmology , Humans , Language , Hallucinations/diagnosis , Internet

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL