Search | VHL Regional Portal

Recognize the Value of the Sum Score, Psychometrics' Greatest Accomplishment.

Sijtsma, Klaas; Ellis, Jules L; Borsboom, Denny.

Psychometrika ; 89(1): 84-117, 2024 Mar.

Article in English | MEDLINE | ID: mdl-38627311

ABSTRACT

The sum score on a psychological test is, and should continue to be, a tool central in psychometric practice. This position runs counter to several psychometricians' belief that the sum score represents a pre-scientific conception that must be abandoned from psychometrics in favor of latent variables. First, we reiterate that the sum score stochastically orders the latent variable in a wide variety of much-used item response models. In fact, item response theory provides a mathematically based justification for the ordinal use of the sum score. Second, because discussions about the sum score often involve its reliability and estimation methods as well, we show that, based on very general assumptions, classical test theory provides a family of lower bounds several of which are close to the true reliability under reasonable conditions. Finally, we argue that eventually sum scores derive their value from the degree to which they enable predicting practically relevant events and behaviors. None of our discussion is meant to discredit modern measurement models; they have their own merits unattainable for classical test theory, but the latter model provides impressive contributions to psychometrics based on very few assumptions that seem to have become obscured in the past few decades. Their generality and practical usefulness add to the accomplishments of more recent approaches.

Subject(s)

Psychometrics , Psychometrics/methods , Humans , Reproducibility of Results , Models, Statistical

Proof of Reliability Convergence to 1 at Rate of Spearman-Brown Formula for Random Test Forms and Irrespective of Item Pool Dimensionality.

Ellis, Jules L; Sijtsma, Klaas.

Psychometrika ; 2024 Mar 12.

Article in English | MEDLINE | ID: mdl-38472632

ABSTRACT

It is shown that the psychometric test reliability, based on any true-score model with randomly sampled items and uncorrelated errors, converges to 1 as the test length goes to infinity, with probability 1, assuming some general regularity conditions. The asymptotic rate of convergence is given by the Spearman-Brown formula, and for this it is not needed that the items are parallel, or latent unidimensional, or even finite dimensional. Simulations with the 2-parameter logistic item response theory model reveal that the reliability of short multidimensional tests can be positively biased, meaning that applying the Spearman-Brown formula in these cases would lead to overprediction of the reliability that results from lengthening a test. However, test constructors of short tests generally aim for short tests that measure just one attribute, so that the bias problem may have little practical relevance. For short unidimensional tests under the 2-parameter logistic model reliability is almost unbiased, meaning that application of the Spearman-Brown formula in these cases of greater practical utility leads to predictions that are approximately unbiased.

A Test to Distinguish Monotone Homogeneity from Monotone Multifactor Models.

Ellis, Jules L; Sijtsma, Klaas.

Psychometrika ; 88(2): 387-412, 2023 06.

Article in English | MEDLINE | ID: mdl-36933110

ABSTRACT

The goodness-of-fit of the unidimensional monotone latent variable model can be assessed using the empirical conditions of nonnegative correlations (Mokken in A theory and procedure of scale-analysis, Mouton, The Hague, 1971), manifest monotonicity (Junker in Ann Stat 21:1359-1378, 1993), multivariate total positivity of order 2 (Bartolucci and Forcina in Ann Stat 28:1206-1218, 2000), and nonnegative partial correlations (Ellis in Psychometrika 79:303-316, 2014). We show that multidimensional monotone factor models with independent factors also imply these empirical conditions; therefore, the conditions are insensitive to multidimensionality. Conditional association (Rosenbaum in Psychometrika 49(3):425-435, 1984) can detect multidimensionality, but tests of it (De Gooijer and Yuan in Comput Stat Data Anal 55:34-44, 2011) are usually not feasible for realistic numbers of items. The only existing feasible test procedures that can reveal multidimensionality are Rosenbaum's (Psychometrika 49(3):425-435, 1984) Case 2 and Case 5, which test the covariance of two items or two subtests conditionally on the unweighted sum of the other items. We improve this procedure by conditioning on a weighted sum of the other items. The weights are estimated in a training sample from a linear regression analysis. Simulations show that the Type I error rate is under control and that, for large samples, the power is higher if one dimension is more important than the other or if there is a third dimension. In small samples and with two equally important dimensions, using the unweighted sum yields greater power.

Subject(s)

Models, Theoretical , Psychometrics/methods , Regression Analysis , Linear Models

A Test Can Have Multiple Reliabilities.

Ellis, Jules L.

Psychometrika ; 86(4): 869-876, 2021 12.

Article in English | MEDLINE | ID: mdl-34498211

ABSTRACT

It is argued that the generalizability theory interpretation of coefficient alpha is important. In this interpretation, alpha is a slightly biased but consistent estimate for the coefficient of generalizability in a subjects x items design where both subjects and items are randomly sampled. This interpretation is based on the "domain sampling" true scores. It is argued that these true scores have a more solid empirical basis than the true scores of Lord and Novick (1968), which are based on "stochastic subjects" (Holland, 1990), while only a single observation is available for each within-subject distribution. Therefore, the generalizability interpretation of coefficient alpha is to be preferred, unless the true scores can be defined by a latent variable model that has undisputed empirical validity for the test and that is sufficiently restrictive to entail a consistent estimate of the reliability-as, for example, McDonald's omega. If this model implies that the items are essentially tau-equivalent, both the generalizability and the reliability interpretation of alpha can be defensible.

Subject(s)

Models, Theoretical , Humans , Psychometrics , Reproducibility of Results

A Simple Model to Determine the Efficient Duration of Exams.

Ellis, Jules L.

Educ Psychol Meas ; 81(3): 549-568, 2021 Jun.

Article in English | MEDLINE | ID: mdl-33994563

ABSTRACT

This study develops a theoretical model for the costs of an exam as a function of its duration. Two kind of costs are distinguished: (1) the costs of measurement errors and (2) the costs of the measurement. Both costs are expressed in time of the student. Based on a classical test theory model, enriched with assumptions on the context, the costs of the exam can be expressed as a function of various parameters, including the duration of the exam. It is shown that these costs can be minimized in time. Applied in a real example with reliability .80, the outcome is that the optimal exam time would be much shorter and would have reliability .675. The consequences of the model are investigated and discussed. One of the consequences is that optimal exam duration depends on the study load of the course, all other things being equal. It is argued that it is worthwhile to investigate empirically how much time students spend on preparing for resits. Six variants of the model are distinguished, which differ in their weights of the errors and in the way grades affect how much time students study for the resit.

Primary care functioning scale showed validity and reliability in patients with chronic conditions: a psychometric study.

Postma, Simone A E; Schers, Henk; Ellis, Jules L; van Boven, Kees; Napel, Huib Ten; Stappers, Hugo; Olde Hartman, Tim C; Gerritsen, Debby L.

J Clin Epidemiol ; 125: 130-137, 2020 09.

Article in English | MEDLINE | ID: mdl-32479791

ABSTRACT

OBJECTIVES: We evaluated the psychometric properties of a newly developed self-report questionnaire that aims for a more person-centered approach in primary care for patients with chronic conditions, the Primary Care Functioning Scale (PCFS). STUDY DESIGN AND SETTING: To test the measurement properties of the PCFS, we asked patients with diabetes, cardiovascular disease, and chronic pulmonary disease to complete the PCFS questionnaire. The PCFS is entirely based on the International Classification of Functioning, Disability, and Health (ICF), consisting of 52 ICF-related items covering body functions, activities and participation, environmental factors, and personal factors. We analyzed three hypotheses representing different item sets of the 34 ICF-related items that assess the level of functioning (body functions, activities, and participation). We tested for unidimensionality, differential item functioning, reliability, and criterion-related validity. RESULTS: Five hundred and eighty-two patients completed the questionnaire. The total scores of the polytomous and dichotomized items from the overall set 'body functions, activities and participation' demonstrated unidimensionality, good reliability (>0.80), and stability over time without bias from background variables. CONCLUSION: In sum, the PCFS can be used as a valid and reliable instrument to measure functioning in patients with chronic morbidity in primary care.

Subject(s)

Chronic Disease/psychology , Primary Health Care/methods , Psychometrics/methods , Activities of Daily Living , Aged , Female , Humans , Male , Patient-Centered Care , Reproducibility of Results , Self Report

Gaining power in multiple testing of interval hypotheses via conditionalization.

Ellis, Jules L; Pecanka, Jakub; Goeman, Jelle J.

Biostatistics ; 21(2): e65-e79, 2020 04 01.

Article in English | MEDLINE | ID: mdl-30247521

ABSTRACT

In this article, we introduce a novel procedure for improving power of multiple testing procedures (MTPs) of interval hypotheses. When testing interval hypotheses the null hypothesis $P$-values tend to be stochastically larger than standard uniform if the true parameter is in the interior of the null hypothesis. The new procedure starts with a set of $P$-values and discards those with values above a certain pre-selected threshold, while the rest are corrected (scaled-up) by the value of the threshold. Subsequently, a chosen family-wise error rate (FWER) or false discovery rate MTP is applied to the set of corrected $P$-values only. We prove the general validity of this procedure under independence of $P$-values, and for the special case of the Bonferroni method, we formulate several sufficient conditions for the control of the FWER. It is demonstrated that this "filtering" of $P$-values can yield considerable gains of power.

Subject(s)

Biostatistics/methods , Data Interpretation, Statistical , Models, Statistical , Benchmarking , Computer Simulation , Humans , Neuropsychological Tests/statistics & numerical data , Psychometrics/statistics & numerical data

An inequality for correlations in unidimensional monotone latent variable models for binary variables.

Ellis, Jules L.

Psychometrika ; 79(2): 303-16, 2014 Apr.

Article in English | MEDLINE | ID: mdl-24659373

ABSTRACT

It is shown that a unidimensional monotone latent variable model for binary items implies a restriction on the relative sizes of item correlations: The negative logarithm of the correlations satisfies the triangle inequality. This inequality is not implied by the condition that the correlations are nonnegative, the criterion that coefficient H exceeds 0.30, or manifest monotonicity. The inequality implies both a lower bound and an upper bound for each correlation between two items, based on the correlations of those two items with every possible third item. It is discussed how this can be used in Mokken's (A theory and procedure of scale-analysis, Mouton, The Hague, 1971) scale analysis.

Subject(s)

Psychometrics/methods , Statistics as Topic/methods , Humans

Probability interpretations of intraclass reliabilities.

Ellis, Jules L.

Stat Med ; 32(26): 4596-608, 2013 Nov 20.

Article in English | MEDLINE | ID: mdl-23703932

ABSTRACT

Research where many organizations are rated by different samples of individuals such as clients, patients, or employees frequently uses reliabilities computed from intraclass correlations. Consumers of statistical information, such as patients and policy makers, may not have sufficient background for deciding which levels of reliability are acceptable. It is shown that the reliability is related to various probabilities that may be easier to understand, for example, the proportion of organizations that will be classed significantly above (or below) the mean and the probability that an organization is classed correctly given that it is classed significantly above (or below) the mean. One can view these probabilities as the amount of information of the classification and the correctness of the classification. These probabilities have an inverse relationship: given a reliability, one can 'buy' correctness at the cost of informativeness and conversely. This article discusses how this can be used to make judgments about the required level of reliabilities.

Subject(s)

Models, Statistical , Reproducibility of Results , Health Workforce , Humans

10.

A standard for test reliability in group research.

Ellis, Jules L.

Behav Res Methods ; 45(1): 16-24, 2013 Mar.

Article in English | MEDLINE | ID: mdl-22736454

ABSTRACT

Many authors adhere to the rule that test reliabilities should be at least .70 or .80 in group research. This article introduces a new standard according to which reliabilities can be evaluated. This standard is based on the costs or time of the experiment and of administering the test. For example, if test administration costs are 7 % of the total experimental costs, the efficient value of the reliability is .93. If the actual reliability of a test is equal to this efficient reliability, the test size maximizes the statistical power of the experiment, given the costs. As a standard in experimental research, it is proposed that the reliability of the dependent variable be close to the efficient reliability. Adhering to this standard will enhance the statistical power and reduce the costs of experiments.

Subject(s)

Behavioral Research/economics , Behavioral Research/standards , Research Design/standards , Research/economics , Budgets , Cost Control , Group Processes , Humans , Psychometrics/standards , Reproducibility of Results , Sample Size , Surveys and Questionnaires/economics , Surveys and Questionnaires/standards

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL