RESUMO
AIM: To assess the clinical reasoning capabilities of two large language models, ChatGPT-4 and Claude-2.0, compared to those of neonatal nurses during neonatal care scenarios. DESIGN: A cross-sectional study with a comparative evaluation using a survey instrument that included six neonatal intensive care unit clinical scenarios. PARTICIPANTS: 32 neonatal intensive care nurses with 5-10â¯years of experience working in the neonatal intensive care units of three medical centers. METHODS: Participants responded to 6 written clinical scenarios. Simultaneously, we asked ChatGPT-4 and Claude-2.0 to provide initial assessments and treatment recommendations for the same scenarios. The responses from ChatGPT-4 and Claude-2.0 were then scored by certified neonatal nurse practitioners for accuracy, completeness, and response time. RESULTS: Both models demonstrated capabilities in clinical reasoning for neonatal care, with Claude-2.0 significantly outperforming ChatGPT-4 in clinical accuracy and speed. However, limitations were identified across the cases in diagnostic precision, treatment specificity, and response lag. CONCLUSIONS: While showing promise, current limitations reinforce the need for deep refinement before ChatGPT-4 and Claude-2.0 can be considered for integration into clinical practice. Additional validation of these tools is important to safely leverage this Artificial Intelligence technology for enhancing clinical decision-making. IMPACT: The study provides an understanding of the reasoning accuracy of new Artificial Intelligence models in neonatal clinical care. The current accuracy gaps of ChatGPT-4 and Claude-2.0 need to be addressed prior to clinical usage.