Búsqueda | Global Index Medicus

Performance of domestic and international large language models in question banks of clinical laboratory medicine / 临床检验杂志

Yuechang LIU; Ziru CHEN; Ming YANG; Chen FU; Tao ZENG.

Chinese Journal of Clinical Laboratory Science ; (12): 941-944, 2023.

Artículo en Chino | WPRIM | ID: wpr-1019111

RESUMEN

Objective To explore the performance of domestic and international large language models(LLMs)in the context of ques-tion banks for clinical examination knowledge.Methods The performance of six domestic or international LLMs,in the question banks with a set of 330 questions for intermediate-level of clinical medical laboratory technology were assessed.The differences in accuracy and consistency among the different LLMs were evaluated using chi-square tests,Fisher's exact tests and logistic regression.Results The accuracy results for the four English LLMs along with 95％confidence intervals(95％CI)were as follows:the accuracy rates of ChatGPT,BingAI,Claude and GPT-4 were demonstrated as 0.56(95％CI:0.527-0.601),0.61(95％CI:0.572-0.644),0.64(95％CI:0.607-0.678)and 0.80(95％CI:0.767-0.833)respectively,while the performance of Xinghuo and Tiangong yielded accuracy rates of 0.52(95％CI:0.479-0.561)and 0.45(95％CI:0.408-0.482)respectively.Using ChatGPT as the reference model,we found that the odds ratios(OR)of correct answers of BingAI,Claude and GPT-4 were 1.272(95％CI:1.020-1.588),1.397(95％CI:1.119-1.743)and 3.270(95％CI:1.904-2.729)respectively.The differences of LLMs performance were statistically significant(P＜0.05)for all the three models.In terms of consistency,Tiangong and BingAI showed poor consistency,while GPT-4 appeared better.Conclusion A-mong the six LLMs,GPT-4 demonstrated the highest overall accuracy and consistency in each question category.

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA