Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Behav Res Methods ; 56(3): 2260-2272, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-37341912

RESUMO

Surveys often add reverse-coded questions to monitor respondents with insufficient effort responses (IERs) but often wrongly assume that all respondents consistently answer all questions with full effort. By contrast, this study expanded the mixture model for IERs and ran a simulation via LatentGOLD to show the harmful consequences of ignoring IERs to positively and negatively worded questions: less test reliability, bias and less accuracy in slope and intercept parameters. We showed its practical application to two public data sets: Machiavellianism (five-point scale) and self-reported depression (four-point scale).


Assuntos
Reprodutibilidade dos Testes , Humanos , Inquéritos e Questionários , Autorrelato , Viés
2.
Behav Res Methods ; 2023 Nov 02.
Artigo em Inglês | MEDLINE | ID: mdl-37919615

RESUMO

Performance assessments increasingly utilize onscreen or internet-based technology to collect human ratings. One of the benefits of onscreen ratings is the automatic recording of rating times along with the ratings. Considering rating times as an additional data source can provide a more detailed picture of the rating process and improve the psychometric quality of the assessment outcomes. However, currently available models for analyzing performance assessments do not incorporate rating times. The present research aims to fill this gap and advance a joint modeling approach, the "hierarchical facets model for ratings and rating times" (HFM-RT). The model includes two examinee parameters (ability and time intensity) and three rater parameters (severity, centrality, and speed). The HFM-RT successfully recovered examinee and rater parameters in a simulation study and yielded superior reliability indices. A real-data analysis of English essay ratings collected in a high-stakes assessment context revealed that raters exhibited considerably different speed measures, spent more time on high-quality than low-quality essays, and tended to rate essays faster with increasing severity. However, due to the significant heterogeneity of examinees' writing proficiency, the improvement in the assessment's reliability using the HFM-RT was not salient in the real-data example. This discussion focuses on the advantages of accounting for rating times as a source of information in rating quality studies and highlights perspectives from the HFM-RT for future research on rater cognition.

3.
Appl Psychol Meas ; 47(3): 221-236, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-37113521

RESUMO

A variety of approaches have been presented for assessing desirable responding in self-report measures. Among them, the overclaiming technique asks respondents to rate their familiarity with a large set of real and nonexistent items (foils). The application of signal detection formulas to the endorsement rates of real items and foils yields indices of (a) knowledge accuracy and (b) knowledge bias. This overclaiming technique reflects both cognitive ability and personality. Here, we develop an alternative measurement model based on multidimensional item response theory (MIRT). We report three studies demonstrating this new model's capacity to analyze overclaiming data. First, a simulation study illustrates that MIRT and signal detection theory yield comparable indices of accuracy and bias-although MIRT provides important additional information. Two empirical examples-one based on mathematical terms and one based on Chinese idioms-are then elaborated. Together, they demonstrate the utility of this new approach for group comparisons and item selection. The implications of this research are illustrated and discussed.

4.
Appl Psychol Meas ; 47(1): 19-33, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-36425284

RESUMO

In traditional test models, test items are independent, and test-takers slowly and thoughtfully respond to each test item. However, some test items have a common stimulus (dependent test items in a testlet), and sometimes test-takers lack motivation, knowledge, or time (speededness), so they perform rapid guessing (RG). Ignoring the dependence in responses to testlet items can negatively bias standard errors of measurement, and ignoring RG by fitting a simpler item response theory (IRT) model can bias the results. Because computer-based testing captures response times on testlet responses, we propose a mixture testlet IRT model with item responses and response time to model RG behaviors in computer-based testlet items. Two simulation studies with Markov chain Monte Carlo estimation using the JAGS program showed (a) good recovery of the item and person parameters in this new model and (b) the harmful consequences of ignoring RG (biased parameter estimates: overestimated item difficulties, underestimated time intensities, underestimated respondent latent speed parameters, and overestimated precision of respondent latent estimates). The application of IRT models with and without RG to data from a computer-based language test showed parameter differences resembling those in the simulations.

5.
Educ Psychol Meas ; 82(4): 757-781, 2022 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-35754620

RESUMO

Performance assessments heavily rely on human ratings. These ratings are typically subject to various forms of error and bias, threatening the assessment outcomes' validity and fairness. Differential rater functioning (DRF) is a special kind of threat to fairness manifesting itself in unwanted interactions between raters and performance- or construct-irrelevant factors (e.g., examinee gender, rater experience, or time of rating). Most DRF studies have focused on whether raters show differential severity toward known groups of examinees. This study expands the DRF framework and investigates the more complex case of dual DRF effects, where DRF is simultaneously present in rater severity and centrality. Adopting a facets modeling approach, we propose the dual DRF model (DDRFM) for detecting and measuring these effects. In two simulation studies, we found that dual DRF effects (a) negatively affected measurement quality and (b) can reliably be detected and compensated under the DDRFM. Using sample data from a large-scale writing assessment (N = 1,323), we demonstrate the practical measurement consequences of the dual DRF effects. Findings have implications for researchers and practitioners assessing the psychometric quality of ratings.

6.
Behav Res Methods ; 54(6): 2750-2764, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-35018607

RESUMO

A rater's overall impression of a ratee's essay (or other assessment) can influence ratings on multiple criteria to yield excessively similar ratings (halo effect). However, existing analytic methods fail to identify whether similar ratings stem from homogeneous criteria (true halo) or rater bias (illusory halo). Hence, we introduce and test a mixture Rasch facets model for halo effects (MRFM-H) that distinguishes true halo versus illusory halo effects to classify normal versus halo raters. In a simulation study, when raters assessed enough ratees, MRFM-H accurately identified halo raters. Also, more rating criteria increased classification accuracy. A simpler model ignored halo effects and biased the parameters for evaluation criteria and for rater severity but not for ratee assessments. MRFM-H application to three empirical datasets showed that (a) experienced raters were subject to illusory halo effects, (b) illusory halo effects were less likely with greater numbers of criteria, and (c) more informative survey responses were more distinguishable from less informative responses.


Assuntos
Modificador do Efeito Epidemiológico , Humanos
7.
Multivariate Behav Res ; 57(2-3): 208-222, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-33001710

RESUMO

A combination of positively and negatively worded items (termed a mixed-format design) has been widely adopted in personality and attitude assessments. While advocates claim that the inclusion of positively and negatively worded items will encourage respondents to process the items more carefully and avoid response preference, others have reported that negatively worded (NW) items may induce a nuisance factor and contaminate scale scores. The present study examined the extent of the impact of the NW-item feature and further investigated whether a mixed-format design could effectively control acquiescence and the preference for extreme response options using two datasets (Attitude toward Peace Walls, and International Personality Item Pool). A proposed multidimensional item response model was implemented to simultaneously estimate the impact of item feature and response preference. The results suggested that NW items induced an impact on item responses and that affirmative preference was negligible, regardless of the proportion of NW items in a scale. However, participants' extremity preference was large in both balanced and imbalanced mixed-format designs. It concludes that the impact of the NW-item feature is not negligible in a mixed-format scale, which exhibits good control of acquiescence but not extremity preference.


Assuntos
Psicometria , Humanos , Psicometria/métodos
8.
Front Psychol ; 11: 570365, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33101139

RESUMO

Many test-takers do not carefully answer every test question; instead they sometimes quickly answer without thoughtful consideration (rapid guessing, RG). Researchers have not modeled RG when assessing student learning with cognitive diagnostic models (CDMs) to personalize feedback on a set of fine-grained skills (or attributes). Therefore, this study proposes to enhance cognitive diagnosis by modeling RG via an advanced CDM with item response and response time. This study tests the parameter recovery of this new CDM with a series of simulations via Markov chain Monte Carlo methods in JAGS. Also, this study tests the degree to which the standard and proposed CDMs fit the student response data for the Programme for International Student Assessment (PISA) 2015 computer-based mathematics test. This new CDM outperformed the simpler CDM that ignored RG; the new CDM showed less bias and greater precision for both item and person estimates, and greater classification accuracy of test results. Meanwhile, the empirical study showed different levels of student RG across test items and confirmed the findings in the simulations.

9.
Behav Res Methods ; 52(1): 23-35, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-30706348

RESUMO

Likert or rating scales may elicit an extreme response style (ERS), which means that responses to scales do not reflect the ability that is meant to be measured. Research has shown that the presence of ERS could lead to biased scores and thus influence the accuracy of differential item functioning (DIF) detection. In this study, a new method under the multiple-indicators multiple-causes (MIMIC) framework is proposed as a means to eliminate the impact of ERS in DIF detection. The findings from a series of simulations showed that a difference in ERS between groups caused inflated false-positive rates and deflated true-positive rates in DIF detection when ERS was not taken into account. The modified MIMIC model, as compared to conventional MIMIC, logistic discriminant function analysis, ordinal logistic regression, and their extensions, could control false-positive rates across situations and yielded trustworthy true-positive rates. An empirical example from a study of Chinese marital resilience was analyzed to demonstrate the proposed model.


Assuntos
Modelos Logísticos , Coleta de Dados
10.
Br J Educ Psychol ; 90 Suppl 1: 224-239, 2020 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-31556972

RESUMO

BACKGROUND: Most bullying incidents occur in the presence of bystanders, with few choosing to intervene. Therefore, the development of a valid instrument to measure individuals' willingness to intervene in bullying is warranted. AIMS: This study aimed to develop as well as validate a self-reported willingness to intervene in bullying scale (WIBS) for secondary school students. SAMPLES: Two junior high school students' samples (N = 553; N = 950) in Taiwan were collected for both scale revision and scale validation. METHODS: This study examined whether 'perceived severity of bullying' and 'self-efficacy of intervention' were important attributes of the willingness to intervene in bullying. The partial credit model (PCM) and the model with internal restriction of item difficulty (MIRID) were utilized to fit the data. RESULTS: The WIBS had good model-data fit with both the PCM and the MIRID, and it suggested (via the MIRID) that 'perceived severity of bullying' and 'self-efficacy of intervention' are important components of the willingness to intervene in bullying scenarios, although the latter component had greater weight than did the former. Moreover, the willingness to intervene was related to the pro-victim attitude and self-reported defending behaviours. CONCLUSIONS: Students' willingness to intervene in school bullying situations could be explained by their self-efficacy in stopping bullying and their perceived severity of bullying incidents. Therefore, educators and researchers should attempt to raise students' self-efficacy regarding intervention and their perceived severity of all kinds of bullying to promote their willingness to intervene in school bullying situations.


Assuntos
Comportamento do Adolescente/psicologia , Bullying/psicologia , Psicometria/instrumentação , Psicometria/normas , Autoeficácia , Interação Social , Percepção Social , Estudantes/psicologia , Adolescente , Feminino , Humanos , Masculino , Taiwan
11.
Appl Psychol Meas ; 42(8): 613-629, 2018 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-30559570

RESUMO

Differential item functioning (DIF) makes test scores incomparable and substantially threatens test validity. Although conventional approaches, such as the logistic regression (LR) and the Mantel-Haenszel (MH) methods, have worked well, they are vulnerable to high percentages of DIF items in a test and missing data. This study developed a simple but effective method to detect DIF using the odds ratio (OR) of two groups' responses to a studied item. The OR method uses all available information from examinees' responses, and it can eliminate the potential influence of bias in the total scores. Through a series of simulation studies in which the DIF pattern, impact, sample size (equal/unequal), purification procedure (with/without), percentages of DIF items, and proportions of missing data were manipulated, the performance of the OR method was evaluated and compared with the LR and MH methods. The results showed that the OR method without a purification procedure outperformed the LR and MH methods in controlling false positive rates and yielding high true positive rates when tests had a high percentage of DIF items favoring the same group. In addition, only the OR method was feasible when tests adopted the item matrix sampling design. The effectiveness of the OR method with an empirical example was illustrated.

12.
Front Psychol ; 9: 1302, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30100891

RESUMO

Conventional differential item functioning (DIF) approaches such as logistic regression (LR) often assume unidimensionality of a scale and match participants in the reference and focal groups based on total scores. However, many educational and psychological assessments are multidimensional by design, and a matching variable using total scores that does not reflect the test structure may not be good practice in multidimensional items for DIF detection. We propose the use of all subscores of a scale in LR and compare its performance with alternative matching methods, including the use of total score and individual subscores. We focused on uniform DIF situation in which 250, 500, or 1,000 participants in each group answered 21 items reflecting two dimensions, and the 21st item was the studied item. Five factors were manipulated in the study: (a) the test structure, (b) numbers of cross-loaded items, (c) group differences in latent abilities, (d) the magnitude of DIF, and (e) group sample size. The results showed that, when the studied item measured a single domain, the conventional LR incorporating total scores as a matching variable yielded inflated false positive rates (FPRs) when two groups differed in one latent ability. The situation worsened when one group had a higher ability in one domain and lower ability in another. The LR using a single subscore as the matching variable performed well in terms of FPRs and true positive rates (TPRs) when two groups did not differ in either one latent ability or differed in one latent ability. However, this approach yielded inflated FPRs when two groups differed in two latent abilities. The proposed LR using two subscores yielded well-controlled FPRs across all conditions and yielded the highest TPRs. When the studied item measured two domains, the use of either the total score or two subscores worked well in the control of FPRs and yielded similar TPRs across conditions, whereas the use of a single subscore resulted in inflated FPRs when two groups differed in one or two latent abilities. In conclusion, we recommend the use of multiple subscores to match subjects in DIF detection for multidimensional data.

13.
14.
Front Psychol ; 8: 1143, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28736542

RESUMO

Extreme response styles (ERS) is prevalent in Likert- or rating-type data but previous research has not well-addressed their impact on differential item functioning (DIF) assessments. This study aimed to fill in the knowledge gap and examined their influence on the performances of logistic regression (LR) approaches in DIF detections, including the ordinal logistic regression (OLR) and the logistic discriminant functional analysis (LDFA). Results indicated that both the standard OLR and LDFA yielded severely inflated false positive rates as the magnitude of the differences in ERS increased between two groups. This study proposed a class of modified LR approaches to eliminating the ERS effect on DIF assessment. These proposed modifications showed satisfactory control of false positive rates when no DIF items existed and yielded a better control of false positive rates and more accurate true positive rates under DIF conditions than the conventional LR approaches did. In conclusion, the proposed modifications are recommended in survey research when there are multiple group or cultural groups.

15.
Multivariate Behav Res ; 52(3): 391-402, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28328280

RESUMO

Multifaceted data are very common in the human sciences. For example, test takers' responses to essay items are marked by raters. If multifaceted data are analyzed with standard facets models, it is assumed there is no interaction between facets. In reality, an interaction between facets can occur, referred to as differential facet functioning. A special case of differential facet functioning is the interaction between ratees and raters, referred to as differential rater functioning (DRF). In existing DRF studies, the group membership of ratees is known, such as gender or ethnicity. However, DRF may occur when the group membership is unknown (latent) and thus has to be estimated from data. To solve this problem, in this study, we developed a new mixture facets model to assess DRF when the group membership is latent and we provided two empirical examples to demonstrate its applications. A series of simulations were also conducted to evaluate the performance of the new model in the DRF assessment in the Bayesian framework. Results supported the use of the mixture facets model because all parameters were recovered fairly well, and the more data there were, the better the parameter recovery.


Assuntos
Teorema de Bayes , Modelos Estatísticos , Simulação por Computador , Interpretação Estatística de Dados , Humanos , Testes de Linguagem , Estudantes , Redação
16.
Appl Psychol Meas ; 41(8): 600-613, 2017 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-29881107

RESUMO

There is re-emerging interest in adopting forced-choice items to address the issue of response bias in Likert-type items for noncognitive latent traits. Multidimensional pairwise comparison (MPC) items are commonly used forced-choice items. However, few studies have been aimed at developing item response theory models for MPC items owing to the challenges associated with ipsativity. Acknowledging that the absolute scales of latent traits are not identifiable in ipsative tests, this study developed a Rasch ipsative model for MPC items that has desirable measurement properties, yields a single utility value for each statement, and allows for comparing psychological differentiation between and within individuals. The simulation results showed a good parameter recovery for the new model with existing computer programs. This article provides an empirical example of an ipsative test on work style and behaviors.

17.
Educ Psychol Meas ; 75(1): 157-178, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-29795817

RESUMO

Many scales contain both positively and negatively worded items. Reverse recoding of negatively worded items might not be enough for them to function as positively worded items do. In this study, we commented on the drawbacks of existing approaches to wording effect in mixed-format scales and used bi-factor item response theory (IRT) models to test the assumption of reverse coding and evaluate the magnitude of the wording effect. The parameters of the bi-factor IRT models can be estimated with existing computer programs. Two empirical examples from the Program for International Student Assessment and the Trends in International Mathematics and Science Study were given to demonstrate the advantages of the bi-factor approach over traditional ones. It was found that the wording effect in these two data sets was substantial and that ignoring the wording effect resulted in overestimated test reliability and biased person measures.

18.
Appl Psychol Meas ; 39(5): 406-425, 2015 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-29881016

RESUMO

It is common in educational and psychological tests or social surveys that the same statement is judged on multiple scales. These multiple responses are linked by the same statement, which may cause local dependence. Considering the way a statement is judged on multiple scales, a new class of item response theory (IRT) models is developed to account for the nonrecursive carry-over effect, in which a response can be affected only by its preceding response rather than by a subsequent response. The parameters of the models can be estimated with the freeware WinBUGS. Two simulation studies were conducted to evaluate the parameter recovery of the new models and the consequences of model misspecification. Results showed that the parameters of the new models were recovered fairly well; fitting unnecessarily complicated models to data that did not have the carry-over effect did little harm to parameter estimation; and ignoring the carry-over effect by fitting standard IRT models yielded biased estimates for the item parameters, the correlation between latent traits, and the test reliability. Two empirical examples with parallel design and sequential design are provided to demonstrate the implications and applications of the new models.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...