Search | VHL Regional Portal

Assessing data gathering of chatbot based symptom checkers - a clinical vignettes study.

Ben-Shabat, Niv; Sharvit, Gal; Meimis, Ben; Ben Joya, Daniel; Sloma, Ariel; Kiderman, David; Shabat, Aviv; Tsur, Avishai M; Watad, Abdulla; Amital, Howard.

Int J Med Inform ; 168: 104897, 2022 12.

Article in English | MEDLINE | ID: mdl-36306653

ABSTRACT

BACKGROUND: The burden on healthcare systems is mounting continuously owing to population growth and aging, overuse of medical services, and the recent COVID-19 pandemic. This overload is also causing reduced healthcare quality and outcomes. One solution gaining momentum is the integration of intelligent self-assessment tools, known as symptom-checkers, into healthcare-providers' systems. To the best of our knowledge, no study so far has investigated the data-gathering capabilities of these tools, which represent a crucial resource for simulating doctors' skills in medical-interviews. OBJECTIVES: The goal of this study was to evaluate the data-gathering function of currently available chatbot symptom-checkers. METHODS: We evaluated 8 symptom-checkers using 28 clinical vignettes from the repository of MSD-Manual case studies. The mean number of predefined pertinent findings for each case was 31.8 ± 6.8. The vignettes were entered into the platforms by 3 medical students who simulated the role of the patient. For each conversation, we obtained the number of pertinent findings retrieved and the number of questions asked. We then calculated the recall-rates (pertinent-findings retrieved out of all predefined pertinent-findings), and efficiency-rates (pertinent-findings retrieved out of the number of questions asked) of data-gathering, and compared them between the platforms. RESULTS: The overall recall rate for all symptom-checkers was 0.32(2,280/7,112;95 %CI 0.31-0.33) for all pertinent findings, 0.37(1,110/2,992;95 %CI 0.35-0.39) for present findings, and 0.28(1140/4120;95 %CI 0.26-0.29) for absent findings. Among the symptom-checkers, Kahun platform had the highest recall rate with 0.51(450/889;95 %CI 0.47-0.54). Out of 4,877 questions asked overall, 2,280 findings were gathered, yielding an efficiency rate of 0.46(95 %CI 0.45-0.48) across all platforms. Kahun was the most efficient tool 0.74 (95 %CI 0.70-0.77) without a statistically significant difference from Your.MD 0.69(95 %CI 0.65-0.73). CONCLUSION: The data-gathering performance of currently available symptom checkers is questionable. From among the tools available, Kahun demonstrated the best overall performance.

Subject(s)

COVID-19 , Humans , COVID-19/diagnosis , COVID-19/epidemiology , Pandemics , Quality of Health Care , Software

Assessing the Performance of a New Artificial Intelligence-Driven Diagnostic Support Tool Using Medical Board Exam Simulations: Clinical Vignette Study.

Ben-Shabat, Niv; Sloma, Ariel; Weizman, Tomer; Kiderman, David; Amital, Howard.

JMIR Med Inform ; 9(11): e32507, 2021 Nov 30.

Article in English | MEDLINE | ID: mdl-34672262

ABSTRACT

BACKGROUND: Diagnostic decision support systems (DDSS) are computer programs aimed to improve health care by supporting clinicians in the process of diagnostic decision-making. Previous studies on DDSS demonstrated their ability to enhance clinicians' diagnostic skills, prevent diagnostic errors, and reduce hospitalization costs. Despite the potential benefits, their utilization in clinical practice is limited, emphasizing the need for new and improved products. OBJECTIVE: The aim of this study was to conduct a preliminary analysis of the diagnostic performance of "Kahun," a new artificial intelligence-driven diagnostic tool. METHODS: Diagnostic performance was evaluated based on the program's ability to "solve" clinical cases from the United States Medical Licensing Examination Step 2 Clinical Skills board exam simulations that were drawn from the case banks of 3 leading preparation companies. Each case included 3 expected differential diagnoses. The cases were entered into the Kahun platform by 3 blinded junior physicians. For each case, the presence and the rank of the correct diagnoses within the generated differential diagnoses list were recorded. Each diagnostic performance was measured in two ways: first, as diagnostic sensitivity, and second, as case-specific success rates that represent diagnostic comprehensiveness. RESULTS: The study included 91 clinical cases with 78 different chief complaints and a mean number of 38 (SD 8) findings for each case. The total number of expected diagnoses was 272, of which 174 were different (some appeared more than once). Of the 272 expected diagnoses, 231 (87.5%; 95% CI 76-99) diagnoses were suggested within the top 20 listed diagnoses, 209 (76.8%; 95% CI 66-87) were suggested within the top 10, and 168 (61.8%; 95% CI 52-71) within the top 5. The median rank of correct diagnoses was 3 (IQR 2-6). Of the 91 expected diagnoses, 62 (68%; 95% CI 59-78) of the cases were suggested within the top 20 listed diagnoses, 44 (48%; 95% CI 38-59) within the top 10, and 24 (26%; 95% CI 17-35) within the top 5. Of the 91 expected diagnoses, in 87 (96%; 95% CI 91-100), at least 2 out of 3 of the cases' expected diagnoses were suggested within the top 20 listed diagnoses; 78 (86%; 95% CI 79-93) were suggested within the top 10; and 61 (67%; 95% CI 57-77) within the top 5. CONCLUSIONS: The diagnostic support tool evaluated in this study demonstrated good diagnostic accuracy and comprehensiveness; it also had the ability to manage a wide range of clinical findings.

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL