Search | VHL Regional Portal

Using Automatic Item Generation to Improve the Quality of MCQ Distractors.

Lai, Hollis; Gierl, Mark J; Touchie, Claire; Pugh, Debra; Boulais, André-Philippe; De Champlain, André.

Teach Learn Med ; 28(2): 166-73, 2016.

Article in English | MEDLINE | ID: mdl-26849247

ABSTRACT

UNLABELLED: CONSTRUCT: Automatic item generation (AIG) is an alternative method for producing large numbers of test items that integrate cognitive modeling with computer technology to systematically generate multiple-choice questions (MCQs). The purpose of our study is to describe and validate a method of generating plausible but incorrect distractors. Initial applications of AIG demonstrated its effectiveness in producing test items. However, expert review of the initial items identified a key limitation where the generation of implausible incorrect options, or distractors, might limit the applicability of items in real testing situations. BACKGROUND: Medical educators require development of test items in large quantities to facilitate the continual assessment of student knowledge. Traditional item development processes are time-consuming and resource intensive. Studies have validated the quality of generated items through content expert review. However, no study has yet documented how generated items perform in a test administration. Moreover, no study has yet to validate AIG through student responses to generated test items. APPROACH: To validate our refined AIG method in generating plausible distractors, we collected psychometric evidence from a field test of the generated test items. A three-step process was used to generate test items in the area of jaundice. At least 455 Canadian and international medical graduates responded to each of the 13 generated items embedded in a high-stake exam administration. Item difficulty, discrimination, and index of discrimination estimates were calculated for the correct option as well as each distractor. RESULTS: Item analysis results for the correct options suggest that the generated items measured candidate performances across a range of ability levels while providing a consistent level of discrimination for each item. Results for the distractors reveal that the generated items differentiated the low- from the high-performing candidates. CONCLUSIONS: Previous research on AIG highlighted how this item development method can be used to produce high-quality stems and correct options for MCQ exams. The purpose of the current study was to describe, illustrate, and evaluate a method for modeling plausible but incorrect options. Evidence provided in this study demonstrates that AIG can produce psychometrically sound test items. More important, by adapting the distractors to match the unique features presented in the stem and correct option, the generation of MCQs using automated procedure has the potential to produce plausible distractors and yield large numbers of high-quality items for medical education.

Subject(s)

Computer-Assisted Instruction/methods , Education, Medical, Undergraduate/methods , Educational Measurement/methods , Quality Improvement , Automation , Humans , Jaundice/diagnosis , Jaundice/therapy , Models, Educational , Psychometrics

Calibrating the Medical Council of Canada's Qualifying Examination Part I using an integrated item response theory framework: a comparison of models and designs.

De Champlain, Andre F; Boulais, Andre-Philippe; Dallas, Andrew.

J Educ Eval Health Prof ; 13: 6, 2016.

Article in English | MEDLINE | ID: mdl-26883811

ABSTRACT

PURPOSE: The aim of this research was to compare different methods of calibrating multiple choice question (MCQ) and clinical decision making (CDM) components for the Medical Council of Canada's Qualifying Examination Part I (MCCQEI) based on item response theory. METHODS: Our data consisted of test results from 8,213 first time applicants to MCCQEI in spring and fall 2010 and 2011 test administrations. The data set contained several thousand multiple choice items and several hundred CDM cases. Four dichotomous calibrations were run using BILOG-MG 3.0. All 3 mixed item format (dichotomous MCQ responses and polytomous CDM case scores) calibrations were conducted using PARSCALE 4. RESULTS: The 2-PL model had identical numbers of items with chi-square values at or below a Type I error rate of 0.01 (83/3,499 or 0.02). In all 3 polytomous models, whether the MCQs were either anchored or concurrently run with the CDM cases, results suggest very poor fit. All IRT abilities estimated from dichotomous calibration designs correlated very highly with each other. IRT-based pass-fail rates were extremely similar, not only across calibration designs and methods, but also with regard to the actual reported decision to candidates. The largest difference noted in pass rates was 4.78%, which occurred between the mixed format concurrent 2-PL graded response model (pass rate= 80.43%) and the dichotomous anchored 1-PL calibrations (pass rate= 85.21%). CONCLUSION: Simpler calibration designs with dichotomized items should be implemented. The dichotomous calibrations provided better fit of the item response matrix than more complex, polytomous calibrations.

Subject(s)

Educational Measurement/standards , Licensure, Medical/standards , Calibration , Canada , Choice Behavior , Humans , Models, Theoretical

Using Automated Scoring to Evaluate Written Responses in English and French on a High-Stakes Clinical Competency Examination.

Latifi, Syed; Gierl, Mark J; Boulais, André-Philippe; De Champlain, André F.

Eval Health Prof ; 39(1): 100-13, 2016 Mar.

Article in English | MEDLINE | ID: mdl-26377072

ABSTRACT

We present a framework for technology-enhanced scoring of bilingual clinical decision-making (CDM) questions using an open-source scoring technology and evaluate the strength of the proposed framework using operational data from the Medical Council of Canada Qualifying Examination. Candidates' responses from six write-in CDM questions were used to develop a three-stage-automated scoring framework. In Stage 1, the linguistic features from CDM responses were extracted. In Stage 2, supervised machine learning techniques were employed for developing the scoring models. In Stage 3, responses to six English and French CDM questions were scored using the scoring models from Stage 2. Of the 8,007 English and French CDM responses, 7,643 were accurately scored with an agreement rate of 95.4% between human and computer scoring. This result serves as an improvement of 5.4% when compared with the human inter-rater reliability. Our framework yielded scores similar to those of expert physician markers and could be used for clinical competency assessment.

Subject(s)

Clinical Competence , Educational Measurement/methods , Educational Measurement/standards , Electronic Data Processing/standards , Translating , Canada , Clinical Decision-Making , Humans , Licensure, Medical , Reproducibility of Results

Automated essay scoring and the future of educational assessment in medical education.

Gierl, Mark J; Latifi, Syed; Lai, Hollis; Boulais, André-Philippe; De Champlain, André.

Med Educ ; 48(10): 950-62, 2014 Oct.

Article in English | MEDLINE | ID: mdl-25200016

ABSTRACT

CONTEXT: Constructed-response tasks, which range from short-answer tests to essay questions, are included in assessments of medical knowledge because they allow educators to measure students' ability to think, reason, solve complex problems, communicate and collaborate through their use of writing. However, constructed-response tasks are also costly to administer and challenging to score because they rely on human raters. One alternative to the manual scoring process is to integrate computer technology with writing assessment. The process of scoring written responses using computer programs is known as 'automated essay scoring' (AES). METHODS: An AES system uses a computer program that builds a scoring model by extracting linguistic features from a constructed-response prompt that has been pre-scored by human raters and then, using machine learning algorithms, maps the linguistic features to the human scores so that the computer can be used to classify (i.e. score or grade) the responses of a new group of students. The accuracy of the score classification can be evaluated using different measures of agreement. RESULTS: Automated essay scoring provides a method for scoring constructed-response tests that complements the current use of selected-response testing in medical education. The method can serve medical educators by providing the summative scores required for high-stakes testing. It can also serve medical students by providing them with detailed feedback as part of a formative assessment process. CONCLUSIONS: Automated essay scoring systems yield scores that consistently agree with those of human raters at a level as high, if not higher, as the level of agreement among human raters themselves. The system offers medical educators many benefits for scoring constructed-response tasks, such as improving the consistency of scoring, reducing the time required for scoring and reporting, minimising the costs of scoring, and providing students with immediate feedback on constructed-response tasks.

Subject(s)

Computer-Assisted Instruction/trends , Education, Medical/methods , Education, Medical/trends , Educational Measurement/methods , Software , Clinical Competence , Humans , Writing

Identifying the unauthorized use of examination material.

Wood, Timothy J; St-Onge, Christina; Boulais, André-Philippe; Blackmore, David E; Maguire, Thomas O.

Eval Health Prof ; 33(1): 96-108, 2010 Mar.

Article in English | MEDLINE | ID: mdl-20042416

ABSTRACT

Item disclosure is one of the most serious threats to the validity of high stakes examinations, and identifying examinees that may have had unauthorized access to material is an important step in ensuring the integrity of an examination. A procedure was developed to identify examinees that potentially had unauthorized prior access to examination content. A standardized difference score is created by comparing examinee ability estimates for potentially exposed items to ability estimates for unexposed items. Outliers in this distribution are then flagged for further review. The steps associated with this procedure are described and followed by an example of applying the procedure. In addition, the use of this procedure is supported by the results of a simulation that models the use of unauthorized access to examination material.

Subject(s)

Clinical Competence/standards , Educational Measurement/standards , Health Occupations/ethics , Specialty Boards/standards , Analysis of Variance , Canada , Clinical Competence/statistics & numerical data , Deception , Educational Measurement/statistics & numerical data , Educational Status , Feasibility Studies , Health Occupations/education , Humans , Monte Carlo Method , Psychometrics , Regression Analysis , Specialty Boards/statistics & numerical data

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL