Your browser doesn't support javascript.
loading
ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions?
Khromchenko, Keren; Shaikh, Sameeha; Singh, Meghana; Vurture, Gregory; Rana, Rima A; Baum, Jonathan D.
Afiliação
  • Khromchenko K; Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA.
  • Shaikh S; Obstetrics and Gynecology, Hackensack Meridian School of Medicine, Nutley, USA.
  • Singh M; Obstetrics and Gynecology, Hackensack Meridian School of Medicine, Nutley, USA.
  • Vurture G; Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA.
  • Rana RA; Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA.
  • Baum JD; Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA.
Cureus ; 16(7): e65543, 2024 Jul.
Article em En | MEDLINE | ID: mdl-39188430
ABSTRACT
Large language models (LLM) have been widely used to provide information in many fields, including obstetrics and gynecology. Which model performs best in providing answers to commonly asked pregnancy questions is unknown. A qualitative analysis of Chat Generative Pre-Training Transformer Version 3.5 (ChatGPT-3.5) (OpenAI, Inc., San Francisco, California, United States) and Bard, recently renamed Google Gemini (Google LLC, Mountain View, California, United States), was performed in August of 2023. Each LLM was queried on 12 commonly asked pregnancy questions and asked for their references. Review and grading of the responses and references for both LLMs were performed by the co-authors individually and then as a group to formulate a consensus. Query responses were graded as "acceptable" or "not acceptable" based on correctness and completeness in comparison to American College of Obstetricians and Gynecologists (ACOG) publications, PubMed-indexed evidence, and clinical experience. References were classified as "verified," "broken," "irrelevant," "non-existent," and "no references." Grades of "acceptable" were given to 58% of ChatGPT-3.5 responses (seven out of 12) and 83% of Bard responses (10 out of 12). In regard to references, ChatGPT-3.5 had reference issues in 100% of its references, and Bard had discrepancies in 8% of its references (one out of 12). When comparing ChatGPT-3.5 responses between May 2023 and August 2023, a change in "acceptable" responses was noted 50% versus 58%, respectively. Bard answered more questions correctly than ChatGPT-3.5 when queried on a small sample of commonly asked pregnancy questions. ChatGPT-3.5 performed poorly in terms of reference verification. The overall performance of ChatGPT-3.5 remained stable over time, with approximately one-half of responses being "acceptable" in both May and August of 2023. Both LLMs need further evaluation and vetting before being accepted as accurate and reliable sources of information for pregnant women.
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Cureus Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos País de publicação: Estados Unidos

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Cureus Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Estados Unidos País de publicação: Estados Unidos