RESUMO
A multiscale criterion-referenced test that featured two presumably equivalent forms (A and B), was administered to 1,667 Head Start children at each of four points over an academic year. Using a randomly equivalent groups design, three equating methods were applied: common-item IRT equating using concurrent calibration, linear transformation, and equipercentile transformation. The methods were compared by examining mean score differences, weighted mean squared difference, and Kolmogorov's D statistics for each subscale. The results indicated that over time the IRT equating method and conventional equating methods exhibited different patterns of discrepancy between the two test forms. IRT equating yielded marginally smaller form-to-form mean score differences and generated slightly fewer distributional discrepancies between Forms A and B than both linear and equipercentile equating. However, the results were mixed indicating that more studies are needed to provide additional information on the relative merits and weaknesses of each approach.
Assuntos
Avaliação Educacional/estatística & dados numéricos , Avaliação Educacional/normas , Pré-Escolar , Estudos de Coortes , Intervenção Educacional Precoce/métodos , Intervenção Educacional Precoce/normas , Intervenção Educacional Precoce/estatística & dados numéricos , Avaliação Educacional/métodos , Feminino , Humanos , Estudos Longitudinais , Masculino , PhiladelphiaRESUMO
Educators need accurate assessments of preschool cognitive growth to guide curriculum design, evaluation, and timely modification of their instructional programs. But available tests do not provide content breadth or growth sensitivity over brief intervals. This article details evidence for a multiform, multiscale test criterion-referenced to national standards for alphabet knowledge, vocabulary, listening comprehension and mathematics, developed in field trials with 3433 3-5(1/2)-year-old Head Start children. The test enables repeated assessments (20-30 min per time point) over a school year. Each subscale is calibrated to yield scaled scores based on item response theory and Bayesian estimation of ability. Multilevel modeling shows that nearly all score variation is associated with child performance rather than examiner performance and individual growth-curve modeling demonstrates the high sensitivity of scores to child growth, controlled for age, sex, prior schooling, and language and special needs status.