Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
1.
Acad Med ; 99(2): 192-197, 2024 Feb 01.
Article in English | MEDLINE | ID: mdl-37934828

ABSTRACT

PURPOSE: In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT. METHOD: As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023. RESULTS: For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning. CONCLUSIONS: Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool.


Subject(s)
Artificial Intelligence , Knowledge , Humans , Computer Simulation , Language , Learning
2.
Appl Psychol Meas ; 47(1): 34-47, 2023 Jan.
Article in English | MEDLINE | ID: mdl-36425288

ABSTRACT

In equating practice, the existence of outliers in the anchor items may deteriorate the equating accuracy and threaten the validity of test scores. Therefore, stability of the anchor item performance should be evaluated before conducting equating. This study used simulation to investigate the performance of the t-test method in detecting outliers and compared its performance with other outlier detection methods, including the logit difference method with 0.5 and 0.3 as the cutoff values and the robust z statistic with 2.7 as the cutoff value. The investigated factors included sample size, proportion of outliers, item difficulty drift direction, and group difference. Across all simulated conditions, the t-test method outperformed the other methods in terms of sensitivity of flagging true outliers, bias of the estimated translation constant, and the root mean square error of examinee ability estimates.

3.
Appl Psychol Meas ; 46(6): 529-547, 2022 Sep.
Article in English | MEDLINE | ID: mdl-35991825

ABSTRACT

In common item equating, the existence of item outliers may impact the accuracy of equating results and bring significant ramifications to the validity of test score interpretations. Therefore, common item equating should involve a screening process to flag outlying items and exclude them from the common item set before equating is conducted. The current simulation study demonstrated that the sampling variance associated with the item response theory (IRT) item parameter estimates can help detect outliers in the common items under the 2-PL and 3-PL IRT models. The results showed the proposed sampling variance statistic (SV) outperformed the traditional displacement method with cutoff values of 0.3 and 0.5 along a variety of evaluation criteria. Based on the favorable results, item outlier detection statistics based on estimated sampling variability warrant further consideration in both research and practice.

5.
Acad Med ; 96(9): 1324-1331, 2021 09 01.
Article in English | MEDLINE | ID: mdl-34133345

ABSTRACT

PURPOSE: The United States Medical Licensing Examination (USMLE) sequence and the Accreditation Council for Graduate Medical Education (ACGME) milestones represent 2 major components along the continuum of assessment from undergraduate through graduate medical education. This study examines associations between USMLE Step 1 and Step 2 Clinical Knowledge (CK) scores and ACGME emergency medicine (EM) milestone ratings. METHOD: In February 2019, subject matter experts (SMEs) provided judgments of expected associations for each combination of Step examination and EM subcompetency. The resulting sets of subcompetencies with expected strong and weak associations were selected for convergent and discriminant validity analysis, respectively. National-level data for 2013-2018 were provided; the final sample included 6,618 EM residents from 158 training programs. Empirical bivariate correlations between milestone ratings and Step scores were calculated, then those correlations were compared with the SMEs' judgments. Multilevel regression analyses were conducted on the selected subcompetencies, in which milestone ratings were the dependent variable, and Step 1 score, Step 2 CK score, and cohort year were independent variables. RESULTS: Regression results showed small but statistically significant positive relationships between Step 2 CK score and the subcompetencies (regression coefficients ranged from 0.02 [95% confidence interval (CI), 0.01-0.03] to 0.12 [95% CI, 0.11-0.13]; all P < .05), with the degree of association matching the SMEs' judgments for 7 of the 9 selected subcompetencies. For example, a 1 standard deviation increase in Step 2 CK score predicted a 0.12 increase in MK-01 milestone rating, when controlling for Step 1. Step 1 score showed a small statistically significant effect with only the MK-01 subcompetency (regression coefficient = 0.06 [95% CI, 0.05-0.07], P < .05). CONCLUSIONS: These results provide incremental validity evidence in support of Step 1 and Step 2 CK score and EM milestone rating uses.


Subject(s)
Clinical Competence/statistics & numerical data , Education, Medical, Graduate/statistics & numerical data , Educational Measurement/statistics & numerical data , Emergency Medicine/statistics & numerical data , Internship and Residency/statistics & numerical data , Accreditation , Adult , Educational Measurement/methods , Emergency Medicine/education , Female , Humans , Licensure, Medical , Male , Middle Aged , Multilevel Analysis , Regression Analysis , Reproducibility of Results , United States , Young Adult
6.
Acad Med ; 96(6): 876-884, 2021 06 01.
Article in English | MEDLINE | ID: mdl-33711841

ABSTRACT

PURPOSE: To examine whether there are group differences in milestone ratings submitted by program directors working with clinical competency committees (CCCs) based on gender for internal medicine (IM) residents and whether women and men rated similarly on milestones perform comparably on subsequent in-training and certification examinations. METHOD: This national retrospective study examined end-of-year medical knowledge (MK) and patient care (PC) milestone ratings and IM In-Training Examination (IM-ITE) and IM Certification Examination (IM-CE) scores for 2 cohorts (2014-2017, 2015-2018) of U.S. IM residents at ACGME-accredited programs. It included 20,098/21,440 (94%) residents, with 9,424 women (47%) and 10,674 men (53%). Descriptive statistics and differential prediction techniques using hierarchical linear models were performed. RESULTS: For MK milestone ratings in PGY-1, men and women showed no statistical difference at a significance level of .01 (P = .02). In PGY-2 and PGY-3, men received statistically higher average MK ratings than women (P = .002 and P < .001, respectively). In contrast, men and women received equivalent average PC ratings in each PGY (P = .47, P = .72, and P = .80, for PGY-1, PGY-2, and PGY-3, respectively). Men slightly outperformed women with similar MK or PC ratings in PGY-1 and PGY-2 on the IM-ITE by about 1.7 and 1.5 percentage points, respectively, after adjusting for covariates. For PGY-3 ratings, women and men with similar milestone ratings performed equivalently on the IM-CE. CONCLUSIONS: Milestone ratings were largely similar for women and men. Generally, women and men with similar MK or PC milestone ratings performed similarly on future examinations. Although there were small differences favoring men on earlier examinations, these differences disappeared by the final training year. It is questionable whether these small differences are educationally or clinically meaningful. The findings suggest fair, unbiased milestone ratings generated by program directors and CCCs assessing residents.


Subject(s)
Clinical Competence , Educational Measurement , Internal Medicine/education , Sexism , Adult , Certification , Female , Humans , Internship and Residency , Male , Retrospective Studies , Sex Factors , United States
7.
Teach Learn Med ; 33(4): 366-381, 2021.
Article in English | MEDLINE | ID: mdl-33356583

ABSTRACT

Phenomenon: Schools are considering the optimal timing of Step 1 of the United States Medical Licensing Examination (USMLE). Two primary reasons for moving Step 1 after the core clerkships are to promote deeper, more integrated basic science learning in clinical contexts and to better prepare students for the increasingly clinical focus of Step 1. Positioning Step 1 after the core clerkships leverages a major national assessment to drive learning, encouraging students to deepen their basic science knowledge while in the clinical setting. Previous studies demonstrated small increases in Step 1 scores, reductions in failure rates, and similar Step 2 Clinical Knowledge scores when Step 1 was after the clerkships. Some schools that have moved Step 1 reported declines in clinical subject examination (CSE) performance. This may be due to shortened pre-clerkship curricula, the absence of the Step 1 study period for knowledge consolidation, or exposure to fewer National Board of Medical Examiners type questions prior to taking CSEs. This multi-institutional study aimed to determine whether student performance on CSEs was affected by moving Step 1 after the core clerkships. Approach: CSE scores for students from eight schools that moved Step 1 after core clerkships between 2012 and 2016 were analyzed in a pre-post format. Hierarchical linear modeling was used to quantify the effect of the curriculum on CSE performance. Additional analysis determined if clerkship order impacted clinical subject exam performance and whether the curriculum change resulted in more students scoring in the lowest percentiles (as defined as below the national fifth percentile) before and after the curricular change. Findings: After moving Step 1 to after the clerkships, collectively these eight schools demonstrated statistically significant lower performance on four CSEs (Medicine, Neurology, Pediatrics, and Surgery) but not Obstetrics/Gynecology or Psychiatry. Comparing performance within the three years pre and post Step 1 change, differences across all clerkships ranged from 0.3 to -2.0 points, with an average difference of -1.1. CSE performance in clerkships taken early in the sequence was more affected by the curricular change, and differences gradually disappeared with subsequent examinations. Medicine and Neurology showed the largest average differences between curricular-group when taken early in the clinical year. Finally, there was a slightly higher chance of scoring below the national fifth percentile in four of the clinical subject exams (Medicine, Neurology, Pediatrics, and Psychiatry) for the cohort with Step 1 after the clerkships. Insights: Moving Step 1 after core clerkships had a small impact on CSE scores overall, with decreased scores for exams early in the clerkship sequence and an increased number of students below the fifth percentile. Score differences have minor effects on clerkship grades, but overall the size of the effect is unlikely to be educationally meaningful. Schools can use a variety of mitigation strategies to address CSE performance and Step 1 preparation in the clerkship phase.


Subject(s)
Clinical Clerkship , Students, Medical , Child , Clinical Competence , Curriculum , Educational Measurement , Humans , Licensure, Medical , United States
8.
Acad Med ; 95(9): 1388-1395, 2020 09.
Article in English | MEDLINE | ID: mdl-32271224

ABSTRACT

PURPOSE: To assess the correlations between United States Medical Licensing Examination (USMLE) performance, American College of Physicians Internal Medicine In-Training Examination (IM-ITE) performance, American Board of Internal Medicine Internal Medicine Certification Exam (IM-CE) performance, and other medical knowledge and demographic variables. METHOD: The study included 9,676 postgraduate year (PGY)-1, 11,424 PGY-2, and 10,239 PGY-3 internal medicine (IM) residents from any Accreditation Council for Graduate Medical Education-accredited IM residency program who took the IM-ITE (2014 or 2015) and the IM-CE (2015-2018). USMLE scores, IM-ITE percent correct scores, and IM-CE scores were analyzed using multiple linear regression, and IM-CE pass/fail status was analyzed using multiple logistic regression, controlling for USMLE Step 1, Step 2 Clinical Knowledge, and Step 3 scores; averaged medical knowledge milestones; age at IM-ITE; gender; and medical school location (United States or Canada vs international). RESULTS: All variables were significant predictors of passing the IM-CE with IM-ITE scores having the strongest association and USMLE Step scores being the next strongest predictors. Prediction curves for the probability of passing the IM-CE based solely on IM-ITE score for each PGY show that residents must score higher on the IM-ITE with each subsequent administration to maintain the same estimated probability of passing the IM-CE. CONCLUSIONS: The findings from this study should support residents and program directors in their efforts to more precisely identify and evaluate knowledge gaps for both personal learning and program improvement. While no individual USMLE Step score was as strongly predictive of IM-CE score as IM-ITE score, the combined relative contribution of all 3 USMLE Step scores was of a magnitude similar to that of IM-ITE score.


Subject(s)
Educational Measurement/methods , Internal Medicine/education , Internship and Residency , Licensure, Medical , Specialty Boards , Accreditation , Clinical Competence , Humans , United States
9.
Acad Med ; 95(1): 111-121, 2020 01.
Article in English | MEDLINE | ID: mdl-31365399

ABSTRACT

PURPOSE: To investigate the effect of a change in the United States Medical Licensing Examination Step 1 timing on Step 2 Clinical Knowledge (CK) scores, the effect of lag time on Step 2 CK performance, and the relationship of incoming Medical College Admission Test (MCAT) score to Step 2 CK performance pre and post change. METHOD: Four schools that moved Step 1 after core clerkships between academic years 2008-2009 and 2017-2018 were analyzed. Standard t tests were used to examine the change in Step 2 CK scores pre and post change. Tests of differences in proportions were used to evaluate whether Step 2 CK failure rates differed between curricular change groups. Linear regressions were used to examine the relationships between Step 2 CK performance, lag time and incoming MCAT score, and curricular change group. RESULTS: Step 2 CK performance did not change significantly (P = .20). Failure rates remained highly consistent (pre change: 1.83%; post change: 1.79%). The regression indicated that lag time had a significant effect on Step 2 CK performance, with scores declining with increasing lag time, with small but significant interaction effects between MCAT and Step 2 CK scores. Students with lower incoming MCAT scores tended to perform better on Step 2 CK when Step 1 was after clerkships. CONCLUSIONS: Moving Step 1 after core clerkships appears to have had no significant impact on Step 2 CK scores or failure rates, supporting the argument that such a change is noninferior to the traditional model. Students with lower MCAT scores benefit most from the change.


Subject(s)
Clinical Clerkship/statistics & numerical data , Clinical Competence/statistics & numerical data , Licensure, Medical/trends , Academic Failure/trends , College Admission Test/statistics & numerical data , Curriculum/standards , Curriculum/trends , Female , Humans , Knowledge , Licensure, Medical/statistics & numerical data , Linear Models , Male , Students, Medical/classification , Students, Medical/statistics & numerical data , United States/epidemiology
10.
Acad Med ; 94(7): 925-926, 2019 07.
Article in English | MEDLINE | ID: mdl-31241573

Subject(s)
Licensure , United States
11.
Acad Med ; 94(3): 371-377, 2019 03.
Article in English | MEDLINE | ID: mdl-30211755

ABSTRACT

PURPOSE: Schools undergoing curricular reform are reconsidering the optimal timing of Step 1. This study provides a psychometric investigation of the impact on United States Medical Licensing Examination Step 1 scores of changing the timing of Step 1 from after completion of the basic science curricula to after core clerkships. METHOD: Data from four schools that recently moved the examination were analyzed in a pre-post format using examinee scores from three years before and after the change. The sample included scores from 2008 through 2016. Several confounders were addressed, including rising national scores and potential differences in cohort abilities using deviation scores and analysis of covariance (ANCOVA) controlling for Medical College Admission Test (MCAT) scores. A resampling procedure compared study schools' score changes versus similar schools' in the same time period. RESULTS: The ANCOVA indicated postchange Step 1 scores were higher compared with prechange (adjusted difference = 2.67; 95% confidence interval: 1.50-3.83, P < .001; effect size = 0.14) after adjusting for MCAT scores and rising national averages. The average score increase in study schools was larger than changes seen in similar schools. Failure rates also decreased from 2.87% (n = 48) pre change to 0.39% (n = 6) post change (P < .001). CONCLUSIONS: Results suggest moving Step 1 after core clerkships yielded a small increase in scores and a reduction in failure rates. Although these small increases are unlikely to represent meaningful knowledge gains, this demonstration of "noninferiority" may allow schools to implement significant curricular reforms.


Subject(s)
Clinical Clerkship , College Admission Test , Canada , Humans , Licensure, Medical , Psychometrics , United States
12.
Clin J Am Soc Nephrol ; 13(5): 710-717, 2018 05 07.
Article in English | MEDLINE | ID: mdl-29490975

ABSTRACT

BACKGROUND AND OBJECTIVES: Medical specialty and subspecialty fellowship programs administer subject-specific in-training examinations to provide feedback about level of medical knowledge to fellows preparing for subsequent board certification. This study evaluated the association between the American Society of Nephrology In-Training Examination and the American Board of Internal Medicine Nephrology Certification Examination in terms of scores and passing status. DESIGN, SETTING, PARTICIPANTS, & MEASUREMENTS: The study included 1684 nephrology fellows who completed the American Society of Nephrology In-Training Examination in their second year of fellowship training between 2009 and 2014. Regression analysis examined the association between In-Training Examination and first-time Nephrology Certification Examination scores as well as passing status relative to other standardized assessments. RESULTS: This cohort included primarily men (62%) and international medical school graduates (62%), and fellows had an average age of 32 years old at the time of first completing the Nephrology Certification Examination. An overwhelming majority (89%) passed the Nephrology Certification on their first attempt. In-Training Examination scores showed the strongest association with first-time Nephrology Certification Examination scores, accounting for approximately 50% of the total explained variance in the model. Each SD increase in In-Training Examination scores was associated with a difference of 30 U (95% confidence interval, 27 to 33) in certification performance. In-Training Examination scores also were significantly associated with passing status on the Nephrology Certification Examination on the first attempt (odds ratio, 3.46 per SD difference in the In-Training Examination; 95% confidence interval, 2.68 to 4.54). An In-Training Examination threshold of 375, approximately 1 SD below the mean, yielded a positive predictive value of 0.92 and a negative predictive value of 0.50. CONCLUSIONS: American Society of Nephrology In-Training Examination performance is significantly associated with American Board of Internal Medicine Nephrology Certification Examination score and passing status.


Subject(s)
Certification , Educational Measurement , Nephrology/education , Adult , Female , Humans , Internal Medicine , Male
13.
J Vet Med Educ ; 45(3): 381-387, 2018.
Article in English | MEDLINE | ID: mdl-29393767

ABSTRACT

Individuals who want to become licensed veterinarians in North America must complete several qualifying steps including obtaining a passing score on the North American Veterinary Licensing Examination (NAVLE). Given the high-stakes nature of the NAVLE, it is essential to provide evidence supporting the validity of the reported test scores. One important way to assess validity is to evaluate the degree to which scores are impacted by the allotted testing time which, if inadequate, can hinder examinees from demonstrating their true level of proficiency. We used item response data from the November-December 2014 and April 2015 NAVLE administrations (n =5,292), to conduct timing analyses comparing performance across several examinee subgroups. Our results provide evidence that conditions were sufficient for most examinees, thereby supporting the current time limits. For the relatively few examinees who may have been impacted, results suggest the cause is not a bias with the test but rather the effect of poor pacing behavior combined with knowledge deficits.


Subject(s)
Educational Measurement , Licensure , Animals , Canada , Education, Veterinary , Humans , Reproducibility of Results , Time Factors , United States
14.
Acad Med ; 93(4): 636-641, 2018 04.
Article in English | MEDLINE | ID: mdl-29028636

ABSTRACT

PURPOSE: Increasing criticism of maintenance of certification (MOC) examinations has prompted certifying boards to explore alternative assessment formats. The purpose of this study was to examine the effect of allowing test takers to access reference material while completing their MOC Part III standardized examination. METHOD: Item response data were obtained from 546 physicians who completed a medical subspecialty MOC examination between 2013 and 2016. To investigate whether accessing references was related to better performance, an analysis of covariance was conducted on the MOC examination scores with references (access or no access) as the between-groups factor and scores from the physicians' initial certification examination as a covariate. Descriptive analyses were conducted to investigate how the new feature of accessing references influenced time management within the test day. RESULTS: Physicians scored significantly higher when references were allowed (mean = 534.44, standard error = 6.83) compared with when they were not (mean = 472.75, standard error = 4.87), F(1, 543) = 60.18, P < .001, ω(2) = 0.09. However, accessing references affected pacing behavior; physicians were 13.47 times more likely to finish with less than a minute of test time remaining per section when reference material was accessible. CONCLUSIONS: Permitting references caused an increase in performance, but also a decrease in the perception that the test has sufficient time limits. Implications for allowing references are discussed, including physician time management, impact on the construct assessed by the test, and the importance of providing validity evidence for all test design decisions.


Subject(s)
Attitude of Health Personnel , Physicians , Specialty Boards , Analysis of Variance , Certification , Clinical Competence , Education, Medical, Continuing , Humans , Time Factors , United States
15.
Educ Psychol Meas ; 75(4): 610-633, 2015 Aug.
Article in English | MEDLINE | ID: mdl-29795835

ABSTRACT

In educational testing, differential item functioning (DIF) statistics must be accurately estimated to ensure the appropriate items are flagged for inspection or removal. This study showed how using the Rasch model to estimate DIF may introduce considerable bias in the results when there are large group differences in ability (impact) and the data follow a three-parameter logistic model. With large group ability differences, difficult non-DIF items appeared to favor the focal group and easy non-DIF items appeared to favor the reference group. Correspondingly, the effect sizes for DIF items were biased. These effects were mitigated when data were coded as missing for item-examinee encounters in which the person measure was considerably lower than the item location. Explanation of these results is provided by illustrating how the item response function becomes differentially distorted by guessing depending on the groups' ability distributions. In terms of practical implications, results suggest that measurement practitioners should not trust the DIF estimates from the Rasch model when there is a large difference in ability and examinees are potentially able to answer items correctly by guessing, unless data from examinees poorly matched to the item difficulty are coded as missing.

SELECTION OF CITATIONS
SEARCH DETAIL
...