Accuracy of Large Language Models for Infective Endocarditis Prophylaxis in Dental Procedures.

Rewthamrongsris, Paak; Burapacheep, Jirayu; Trachoo, Vorapat; Porntaveetus, Thantrira

Rewthamrongsris, Paak; Burapacheep, Jirayu; Trachoo, Vorapat; Porntaveetus, Thantrira.

Afiliação

Rewthamrongsris P; Department of Anatomy, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand.
Burapacheep J; Stanford University, Stanford, California, USA.
Trachoo V; Department of Oral and Maxillofacial Surgery, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand.
Porntaveetus T; Center of Excellence in Genomics and Precision Dentistry, Clinical Research Center, Geriatric Dentistry and Special Patients Care International Program, Department of Physiology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand. Electronic address: thantrira.p@chula.ac.th.

Int Dent J ; 2024 Oct 11.

Article em En | MEDLINE | ID: mdl-39395898

ABSTRACT

ABSTRACT

PURPOSE:

Infective endocarditis (IE) is a serious, life-threatening condition requiring antibiotic prophylaxis for high-risk individuals undergoing invasive dental procedures. As LLMs are rapidly adopted by dental professionals for their efficiency and accessibility, assessing their accuracy in answering critical questions about antibiotic prophylaxis for IE prevention is crucial.

METHODS:

Twenty-eight true/false questions based on the 2021 American Heart Association (AHA) guidelines for IE were posed to 7 popular LLMs. Each model underwent five independent runs per question using two prompt strategies a pre-prompt as an experienced dentist and without a pre-prompt. Inter-model comparisons utilised the Kruskal-Wallis test, followed by post-hoc pairwise comparisons using Prism 10 software.

RESULTS:

Significant differences in accuracy were observed among the LLMs. All LLMs had a narrower confidence interval with a pre-prompt, and most, except Claude 3 Opus, showed improved performance. GPT-4o had the highest accuracy (80% with a pre-prompt, 78.57% without), followed by Gemini 1.5 Pro (78.57% and 77.86%) and Claude 3 Opus (75.71% and 77.14%). Gemini 1.5 Flash had the lowest accuracy (68.57% and 63.57%). Without a pre-prompt, Gemini 1.5 Flash's accuracy was significantly lower than Claude 3 Opus, Gemini 1.5 Pro, and GPT-4o. With a pre-prompt, Gemini 1.5 Flash and Claude 3.5 were significantly less accurate than Gemini 1.5 Pro and GPT-4o. None of the LLMs met the commonly used benchmark scores. All models provided both correct and incorrect answers randomly, except Claude 3.5 Sonnet with a pre-prompt, which consistently gave incorrect answers to eight questions across five runs.

CONCLUSION:

LLMs like GPT-4o show promise for retrieving AHA-IE guideline information, achieving up to 80% accuracy. However, complex medical questions may still pose a challenge. Pre-prompts offer a potential solution, and domain-specific training is essential for optimizing LLM performance in healthcare, especially with the emergence of models with increased token limits.

Palavras-chave

AHA guidelines; Artificial intelligence; ChatGPT; Claude; Gemini

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Int Dent J / Int. dent. j / International dental journal Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Tailândia País de publicação: Reino Unido

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google