Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 2 de 2
Filter
Add more filters










Database
Language
Publication year range
1.
JMIR Med Educ ; 9: e50514, 2023 Sep 19.
Article in English | MEDLINE | ID: mdl-37725411

ABSTRACT

BACKGROUND: Large language model (LLM)-based chatbots are evolving at an unprecedented pace with the release of ChatGPT, specifically GPT-3.5, and its successor, GPT-4. Their capabilities in general-purpose tasks and language generation have advanced to the point of performing excellently on various educational examination benchmarks, including medical knowledge tests. Comparing the performance of these 2 LLM models to that of Family Medicine residents on a multiple-choice medical knowledge test can provide insights into their potential as medical education tools. OBJECTIVE: This study aimed to quantitatively and qualitatively compare the performance of GPT-3.5, GPT-4, and Family Medicine residents in a multiple-choice medical knowledge test appropriate for the level of a Family Medicine resident. METHODS: An official University of Toronto Department of Family and Community Medicine Progress Test consisting of multiple-choice questions was inputted into GPT-3.5 and GPT-4. The artificial intelligence chatbot's responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into arithmetic, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of Family Medicine residents who concurrently attempted the test. RESULTS: GPT-4 performed significantly better compared to GPT-3.5 (difference 25.0%, 95% CI 16.3%-32.8%; McNemar test: P<.001); it correctly answered 89/108 (82.4%) questions, while GPT-3.5 answered 62/108 (57.4%) questions correctly. Further, GPT-4 scored higher across all 11 categories of Family Medicine knowledge. In 86.1% (n=93) of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% (n=18) achieved by GPT-3.5. Qualitatively, for both GPT-3.5 and GPT-4 responses, logical errors were the most common, while arithmetic errors were the least common. The average performance of Family Medicine residents was 56.9% (95% CI 56.2%-57.6%). The performance of GPT-3.5 was similar to that of the average Family Medicine resident (P=.16), while the performance of GPT-4 exceeded that of the top-performing Family Medicine resident (P<.001). CONCLUSIONS: GPT-4 significantly outperforms both GPT-3.5 and Family Medicine residents on a multiple-choice medical knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications in medical education, including the creation of exam questions and scenarios as well as serving as a resource for medical knowledge or information on community services.

2.
JMIR Med Educ ; 9: e41953, 2023 Jul 27.
Article in English | MEDLINE | ID: mdl-37498660

ABSTRACT

BACKGROUND: Field notes, a form for resident-preceptor clinical encounter feedback, are widely adopted across Canadian medical residency training programs for documenting residents' performance. This process generates a sizeable cumulative collection of feedback text, which is difficult for medical education faculty to navigate. As sentiment analysis is a subfield of text mining that can efficiently synthesize the polarity of a text collection, sentiment analysis may serve as an innovative solution. OBJECTIVE: This study aimed to examine the feasibility and utility of sentiment analysis using 3 popular sentiment lexicons on medical resident field notes. METHODS: We used a retrospective cohort design, curating text data from University of Toronto medical resident field notes gathered over 2 years (from July 2019 to June 2021). Lexicon-based sentiment analysis was applied using 3 standardized dictionaries, modified by removing ambiguous words as determined by a medical subject matter expert. Our modified lexicons assigned words from the text data a sentiment score, and we aggregated the word-level scores to a document-level polarity score. Agreement between dictionaries was assessed, and the document-level polarity was correlated with the overall preceptor rating of the clinical encounter under assessment. RESULTS: Across the 3 original dictionaries, approximately a third of labeled words in our field note corpus were deemed ambiguous and were removed to create modified dictionaries. Across the 3 modified dictionaries, the mean sentiment for the "Strengths" section of the field notes was mildly positive, while it was slightly less positive in the "Areas of Improvement" section. We observed reasonable agreement between dictionaries for sentiment scores in both field note sections. Overall, the proportion of positively labeled documents increased with the overall preceptor rating, and the proportion of negatively labeled documents decreased with the overall preceptor rating. CONCLUSIONS: Applying sentiment analysis to systematically analyze field notes is feasible. However, the applicability of existing lexicons is limited in the medical setting, even after the removal of ambiguous words. Limited applicability warrants the need to generate new dictionaries specific to the medical education context. Additionally, aspect-based sentiment analysis may be applied to navigate the more nuanced structure of texts when identifying sentiments. Ultimately, this will allow for more robust inferences to discover opportunities for improving resident teaching curriculums.

SELECTION OF CITATIONS
SEARCH DETAIL
...