ABSTRACT
We introduce a new class of heteroscedastic partially linear model (PLM) with skew-normal distribution. Maximum likelihood estimation of the model parameters by the ECM algorithm (Expectation/Conditional Maximization) as well as influence diagnostics for the new model are investigated. In addition, a Likelihood Ratio test for assessing the homogeneity of the scale parameter is presented. Simulation studies for assessing the performance of the ECM algorithm and the Likelihood Ratio test statistics for homogeneity of variance are developed. Also, a study for misspecification of the structure function is considered. Finally, an application of the new heteroscedastic PLM to a real data set on ragweed pollen concentration is presented to show that it provides a better fit than the classic homocedastic PLM. We hope that the proposed model may attract applications in different areas of knowledge.
ABSTRACT
In this paper, we propose and derive a new regression model for response variables defined on the open unit interval. By reparameterizing the unit generalized half-normal distribution, we get the interpretation of its location parameter as being a quantile of the distribution. In addition, we can evaluate effects of the explanatory variables in the conditional quantiles of the response variable as an alternative to the Kumaraswamy quantile regression model. The suitability of our proposal is demonstrated with two simulated examples and two real applications. For such data sets, the obtained fits of the proposed regression model are compared with that provided by a Kumaraswamy regression model.
ABSTRACT
Most item response theory (IRT) models for dichotomous responses are based on probit or logit link functions which assume a symmetric relationship between the probability of a correct response and the latent traits of individuals taking a test. This assumption restricts the use of those models to the case in which all items behave symmetrically. On the other hand, asymmetric models proposed in the literature impose that all the items in a test behave asymmetrically. This assumption is inappropriate for great majority of tests which are, in general, composed of both symmetric and asymmetric items. Furthermore, a straightforward extension of the existing models in the literature would require a prior selection of the items' symmetry/asymmetry status. This paper proposes a Bayesian IRT model that accounts for symmetric and asymmetric items in a flexible but parsimonious way. That is achieved by assigning a finite mixture prior to the skewness parameter, with one of the mixture components being a point mass at zero. This allows for analyses under both model selection and model averaging approaches. Asymmetric item curves are designed through the centred skew normal distribution, which has a particularly appealing parametrization in terms of parameter interpretation and computational efficiency. An efficient Markov chain Monte Carlo algorithm is proposed to perform Bayesian inference and its performance is investigated in some simulated examples. Finally, the proposed methodology is applied to a data set from a large-scale educational exam in Brazil.
Subject(s)
Algorithms , Humans , Bayes Theorem , Markov Chains , Monte Carlo MethodABSTRACT
Bimodal data sets are very common in different areas of knowledge. The crude birth rates data, fish length data, egg diameter data, the eruption and interruption times of the Old Faithful geyser, are examples of this type of data. In this paper, a new class of symmetric density functions for modeling bimodal data as described above are presented. From density functions with support on [ 0 , + ∞ ) , the symmetry is getting by reflecting the density function in the negative semi-axis with their respective normalization. In this way, if the primitive density function is unimodal, then the resulting density will be bimodal. We introduce asymmetry parameters and study their behavior, in particular the values of their modes and some other statistical values of interest. The cases for densities generated by Gamma, Weibull, Log-normal, and Birnbaum-Saunders densities, among others are studied. Statistical inference is performed from a classical perspective. A small simulation study to evaluate the benefits and limitations of the new proposal. In addition, an application to a data set related to the fetal weight in grams obtained through ultrasound in a sample of 500 units is also presented; the results show the great usefulness of the model in practical situations.
ABSTRACT
In several applications, the assumption of normality is often violated in data with some level of skewness, so skewness affects the mean's estimation. The class of skew-normal distributions is considered, given their flexibility for modeling data with asymmetry parameter. In this paper, we considered two location parameter (µ) estimation methods in the skew-normal setting, where the coefficient of variation and the skewness parameter are known. Specifically, the least square estimator (LSE) and the best unbiased estimator (BUE) for µ are considered. The properties for BUE (which dominates LSE) using classic theorems of information theory are explored, which provides a way to measure the uncertainty of location parameter estimations. Specifically, inequalities based on convexity property enable obtaining lower and upper bounds for differential entropy and Fisher information. Some simulations illustrate the behavior of differential entropy and Fisher information bounds.
ABSTRACT
Resumen OBJETIVO: Validar el rendimiento de la calculadora de la Fundación de Medicina Fetal 4.0 adaptada a población mexicana. MATERIALES Y MÉTODOS: Estudio de cohorte efectuado en embarazos con feto único, según el modelo de riesgos en competencia para preeclampsia en un centro de medicina fetal de la Ciudad de México. El riesgo a priori se calculó de acuerdo con la historia clínica. La presión arterial media, el índice de pulsatilidad medio de la arteria uterina y la proteína plasmática A asociada al embarazo se midieron a las 11 a 14 semanas de gestación con metodología estandarizada. El valor de cada marcador se transformó en múltiplos de la mediana adaptados a la población local. Se aplicaron la distribución normal multivariante y el teorema de Bayes para obtener las probabilidades posprueba individuales, que se utilizaron como clasificadores para el área bajo la curva de característica receptor-operador. RESULTADOS: La incidencia de preeclampsia fue del 5.0% (54/1078). El área bajo la curva de característica receptor-operador fue de 0.784 (0.712; 0.856) para preeclampsia a menos de 37 semanas y de 0.807 (0.762; 0.852) para preeclampsia global. CONCLUSIONES: La calculadora FMF 4.0 adaptada a población mexicana resultó válida. Si bien tuvo menor rendimiento al esperado para preeclampsia a menos de 37 semanas, el rendimiento para preeclampsia global fue satisfactorio. Se justifica desarrollar la calculadora local.
Abstract OBJECTIVE: To validate the performance of the Fetal Medicine Foundation 4.0 calculator adapted to the Mexican population. MATERIALS AND METHODS: Cohort study performed in singleton pregnancies, according to the competing risk model for preeclampsia in a fetal medicine center in Mexico City. The a priori risk was calculated according to the clinical history. Mean arterial pressure, mean uterine artery pulsatility index and pregnancy-associated plasma protein A were measured at 11 to 14 weeks of gestation with standardized methodology. The value of each marker was transformed into multiples of the median adapted to the local population. Multivariate normal distribution and Bayes' theorem were applied to obtain individual posttest probabilities, which were used as classifiers for the area under the receiver-operator characteristic curve. RESULTS: The incidence of preeclampsia was 5.0% (54/1078). The area under the receiver-operator characteristic curve was 0.784 (0.712; 0.856) for preeclampsia at less than 37 weeks and 0.807 (0.762; 0.852) for global preeclampsia. CONCLUSIONS: The FMF 4.0 calculator adapted to Mexican population proved valid. Although it had lower performance than expected for preeclampsia at less than 37 weeks, the performance for global preeclampsia was satisfactory. The development of the local calculator is justified.
ABSTRACT
Receiver operating characteristic (ROC) and predictiveness curves are graphical tools to study the discriminative and predictive power of a continuous-valued marker in a binary outcome. In this paper, a copula-based construction of the joint density of the marker and the outcome is developed for plotting and analyzing both curves. The methodology only requires a copula function, the marginal distribution of the marker, and the prevalence rate for the model to be characterized. The adoption of the Gaussian copula and the customization of the margin for the marker are proposed for such characterization. The computation of both curves is numerically more feasible than methods that attempt to obtain one curve in terms of the other. Estimation is carried out using maximum likelihood and resampling-based methods. Randomized quantile residuals from each conditional distribution are employed for both assessing the adequacy of the model and identifying outliers. The performance of the estimators of both curves and their underlying quantities is evaluated in simulation studies that assume different dependence structures and sample sizes. The methods are illustrated with an analysis of the level of progesterone receptor gene expression for the diagnosis and prediction of estrogen receptor-positive breast cancer.
Subject(s)
Models, Statistical , Biomarkers/analysis , Computer Simulation , Humans , Normal Distribution , ROC CurveABSTRACT
Modeling is an important statistical tool to Forest Science, especially to forest planning by predicting the forest's yield and assortments, for instance. This paper evaluated the accuracy of bivariate and generalized linear mixed modeling in the representation of the Pinus taeda L. trunk taper and the estimation of its assortments. To compose the fitting data, 558 trees from plantations located in the Southern region of Santa Catarina, Brazil, were scaled. Initially, the data's bivariate normality was evaluated, and the bivariate standard normal distribution was fitted. Six generalized linear mixed models were fitted for the bivariate representation of diameter and height in the trunk. Afterwards, some statistical indices were obtained to verify the quality of the fitted models and, in a complementary way, of the bivariate graphs of the residuals. Even with the application of Box-Cox transformation, the results indicate the non-normality of the variables, but the transformation contributed to improve the model fitting in 50%. The ordinal and exponential models obtained the best statistics for height representation, with the Akaike Information Criterion (AIC) value being reduced from 16,430.13 to 5,686.78 when considering normal distribution. When evaluating the assortments prediction, there were high discrepancies in the estimated values (246 logs for sawmill and 120 logs for veneers) versus the observed ones (881 logs for sawmill and 628 logs for veneers), which corresponds to a 75% underestimation of total logs per hectare. Thus, the generalized linear mixed modeling improved the trunk taper representation, and the bivariate modeling was not efficient to predict assortments production.
Subject(s)
Pinus taeda , Pinus , Brazil , Forests , Linear Models , TreesABSTRACT
Response variables in medical sciences are often bounded, e.g. proportions, rates or fractions of incidence of some disease. In this work, we are interested to study if some characteristics of the population, e.g. sex and race which can explain the incidence rate of colorectal cancer cases. To accommodate such responses, we propose a new class of regression models for bounded response by considering a new distribution in the open unit interval which includes a new parameter to make a more flexible distribution. The proposal is to obtain compound power normal distribution as a base distribution with a quantile transformation of another family of distributions with the same support and then is to study some properties of the new family. In addition, the new family is extended to regression models as an alternative to the regression model with a unit interval response. We also present inferential procedures based on the Bayesian methodology, specifically a Metropolis-Hastings algorithm is used to obtain the Bayesian estimates of parameters. An application to real data to illustrate the use of the new family is considered.
Subject(s)
Colorectal Neoplasms , Bayes Theorem , Colorectal Neoplasms/epidemiology , Humans , Incidence , Normal DistributionABSTRACT
The Birnbaum-Saunders distribution is a widely studied model with diverse applications. Its origins are in the modeling of lifetimes associated with material fatigue. By using a motivating example, we show that, even when lifetime data related to fatigue are modeled, the Birnbaum-Saunders distribution can be unsuitable to fit these data in the distribution tails. Based on the nice properties of the Birnbaum-Saunders model, in this work, we use a modified skew-normal distribution to construct such a model. This allows us to obtain flexibility in skewness and kurtosis, which is controlled by a shape parameter. We provide a mathematical characterization of this new type of Birnbaum-Saunders distribution and then its statistical characterization is derived by using the maximum-likelihood method, including the associated information matrices. In order to improve the inferential performance, we correct the bias of the corresponding estimators, which is supported by a simulation study. To conclude our investigation, we retake the motivating example based on fatigue life data to show the good agreement between the new type of Birnbaum-Saunders distribution proposed in this work and the data, reporting its potential applications.
ABSTRACT
This paper focuses on studying a truncated positive version of the power-normal (PN) model considered in Durrans (1992). The truncation point is considered to be zero so that the resulting model is an extension of the half normal distribution. Some probabilistic properties are studied for the proposed model along with maximum likelihood and moments estimation. The model is fitted to two real datasets and compared with alternative models for positive data. Results indicate good performance of the proposed model.
ABSTRACT
Este artigo tem o objetivo de avaliar técnicas de correções para o teste Qui-Quadrado (χ2) aplicadas a modelos da análise fatorial confirmatória (CFA) em amostras não normais. Em uma abordagem simulada e exploratória, foram mensuradas distribuições distintas em termos de curtose multivariada. Na maioria das situações verificadas, observou-se uma tendência dos testes aferidos de realizar correções diferenciadas dos valores do χ2, CFI e RMSEA em contextos similares. Como conclusão, dentre outros testes avaliados, sugere-se o uso dos seguintes: teste Elíptico com Mínimos Quadrados Reponderados (Teoria Elíptica); teste da Curtose Heterogênea com Mínimos Quadrados Reponderados (Teoria Curtose Heterogênea) e teste Escalado de Satorra-Bentler com Máxima Verossimilhança (para distribuições com excesso de assimetria e/ou curtose univariadas). Porém, devido ao fator de correção, o teste Escalado de Satorra-Bentler pode aceitar modelos moderadamente mal especificados na presença de extrema curtose. (AU)
This paper aims to evaluate techniques for correcting the chi-square test (χ2) as applied to Confirmatory Factor Analysis (CFA) models in non-normal data. In a simulated and exploratory approach, distinct distributions were analyzed in terms of multivariate kurtosis. In most situations, it was observed a tendency of the analyzed tests to produce differing corrections on the χ2 values, as well as for the CFI and RMSEA values. Among other tests evaluated, this study suggested the use of the Elliptical Test with Least Squares (Elliptical Theory), Heterogeneous Kurtosis Test with Reweighted Least Squares (Heterogeneous Kurtosis Theory) and Satorra-Bentler Scaled Test with Maximum Likelihood estimation (for distributions with excessive univariate asymmetry and/or kurtosis). However, due to the correction factor, the Satorra-Bentler Scaled test can accept moderately poorly specified models in the presence of extreme kurtosis. (AU)
Este artículo tiene por objetivo evaluar las técnicas de correcciones para la prueba chi-cuadrado (χ2) aplicadas a modelos del Análisis Factorial Confirmatorio (CFA) en muestras no normales. En un enfoque simulado y exploratorio, se midieron distribuciones distintas en términos de curtosis multivariada. En la mayoría de las situaciones verificadas, se observó una tendencia de las pruebas evaluadas de realizar correcciones diferenciadas de los valores del χ2 , CFI y RMSEA en contextos similares. En conclusión, entre otras pruebas evaluadas, se sugiere el uso de las siguientes: Prueba Elíptica con Mínimos Cuadrados Reponderados (Teoría Elíptica); Prueba de la Curtosis Heterogénea con Mínimos Cuadrados Reponderados (Teoría de la Curtosis Heterogénea) y Prueba Escalada de Satorra-Bentler con Máxima Verosimilitud (para distribuciones con exceso de asimetría y/o curtosis univariadas). No obstante, por cuenta del factor de corrección, la Prueba Escalada de Satorra-Bentler puede aceptar modelos moderadamente mal especificados en presencia de extrema curtosis. (AU)
Subject(s)
Chi-Square Distribution , Reproducibility of Results , Factor Analysis, StatisticalABSTRACT
The aim of this study was to determine the probable monthly rainfall for the state of Mato Grosso do Sul, considering the level of 75% probability, and study the spatial distribution associated with its different biomes. The rainfall data of 32 stations (sites) in the state of Mato Grosso do Sul were collected in the period 1954-2013. In each of the 384 series, the average monthly rainfall was calculated, for at least 30 years of observation. The Kolmogorov-Smirnov adhesion test was applied to the rainfall time series to check the fit of the data to a normal distribution. The likely fallout was estimated at 75% probability, using the normal probability distribution and, subsequently, it was adopted the method of Ordinary Kriging interpolation mathematics to spatial data. Based on the likely monthly precipitation estimated, the State of Mato Grosso do Sul possess three distinct periods, with the precipitation associated with different biomes: the rainy season (between the months November to March, where increased precipitation occurred in the Savanna biome), dry season (between the months from June to August, when the highest rainfall occurred in the Atlantic Forest) and transition period (April and May and September and October).
O objetivo estudo foi determinar a precipitação mensal provável para o Estado de Mato Grosso do Sul, considerando o nível de 75% probabilidade e estudar sua distribuição espacial associada aos seus diferentes biomas. Os dados de precipitação pluvial de 32 estações (locais) do Estado do Mato Grosso do Sul foram coletados do período de 1954 a 2013. Em cada uma das 384 séries temporais de precipitação pluvial mensal calculou-se a média, com no mínimo 30 anos de observação. Foi aplicado o teste de aderência de Kolmogorov-Smirnov nas 364 séries temporais de precipitação pluvial mensal para verificar o ajuste dos dados a distribuição normal. A precipitação provável foi estimada a 75% de probabilidade, utilizando-se a distribuição de probabilidade normal e, posteriormente, foi adotado o método de interpolação matemática da Krigagem Ordinária para espacialização dos dados. Com base na precipitação mensal provável, estimada pela distribuição normal a 75% de probabilidade, o Estado do Mato Grosso do Sul possuí três períodos distintos, estando à precipitação associada aos diferentes biomas: período chuvoso (entre os meses de novembro a março, onde as maiores precipitações ocorrem no bioma Cerrado), período seco (entre os meses de junho a agosto, onde as maiores precipitações ocorrem no bioma Mata Atlântica) e período de transição (meses de abril e maio e setembro e outubro).
Subject(s)
Rain , Ecosystem , Atmospheric Precipitation , Sampling Studies , Sample SizeABSTRACT
ABSTRACT: The likelihood ratio test (LRT), to the independence between two sets of variables, allows to identify whether there is a dependency relationship between them. The aim of this study was to calculate the type I error and power of the LRT for determining independence between two sets of variables under multivariate normal distributions in scenarios consisting of combinations of 16 sample sizes; 40 combinations of the number of variables of the two groups; and nine degrees of correlation between the variables (for the power). The rate of type I error and power were calculate at 640 and 5,760 scenarios, respectively. A performance evaluation of the LRT was conducted by computer simulation by the Monte Carlo method, using 2,000 simulations in each scenario. When the number of variables was large (24), the TRV controlled the rate of type I errors and showed high power in sizes greater than 100 samples. For small sample sizes (25, 30 and 50), the test showed good performance because the number of variables did not exceed 12.
RESUMO: O teste de razão de verossimilhança para a independência entre dois grupos de variáveis permite-nos identificar se existe uma relação de dependência entre eles. O objetivo deste trabalho foi calcular o erro tipo I e o poder do teste de razão de verossimilhança para independência entre dois grupos de caracteres, com distribuição normal multivariada, em cenários constituídos pelas combinações de: 16 tamanhos de amostra; 40 combinações de número de caracteres dos dois grupos; e nove graus de correlação entre os caracteres (para o poder). A taxa de erro tipo I e o poder foram calculados em 640 e 5.760 cenários a taxa de erro tipo I e o poder, respectivamente. A avaliação do desempenho do teste de razão de verossimilhança foi realizada por meio de simulação computacional pelo método Monte Carlo, utilizando-se 2.000 simulações em cada um dos cenários. Quando o número de caracteres é grande (24), o teste de razão de verossimilhança controla a taxa de erro tipo I e apresenta poder elevado (próximo a 100%), em tamanhos de amostra superiores a 100. Para tamanhos amostrais pequenos (25, 30 e 50), o teste apresenta bom desempenho (erro tipo I esperado e poder elevado), desde que o número de caracteres não exceda a 12.
ABSTRACT
The likelihood ratio test (LRT), to the independence between two sets of variables, allows to identify whether there is a dependency relationship between them. The aim of this study was to calculate the type I error and power of the LRT for determining independence between two sets of variables under multivariate normal distributions in scenarios consisting of combinations of 16 sample sizes; 40 combinations of the number of variables of the two groups; and nine degrees of correlation between the variables (for the power). The rate of type I error and power were calculate at 640 and 5,760 scenarios, respectively. A performance evaluation of the LRT was conducted by computer simulation by the Monte Carlo method, using 2,000 simulations in each scenario. When the number of variables was large (24), the TRV controlled the rate of type I errors and showed high power in sizes greater than 100 samples. For small sample sizes (25, 30 and 50), the test showed good performance because the number of variables did not exceed 12.(AU)
O teste de razão de verossimilhança para a independência entre dois grupos de variáveis permite-nos identificar se existe uma relação de dependência entre eles. O objetivo deste trabalho foi calcular o erro tipo I e o poder do teste de razão de verossimilhança para independência entre dois grupos de caracteres, com distribuição normal multivariada, em cenários constituídos pelas combinações de: 16 tamanhos de amostra; 40 combinações de número de caracteres dos dois grupos; e nove graus de correlação entre os caracteres (para o poder). A taxa de erro tipo I e o poder foram calculados em 640 e 5.760 cenários a taxa de erro tipo I e o poder, respectivamente. A avaliação do desempenho do teste de razão de verossimilhança foi realizada por meio de simulação computacional pelo método Monte Carlo, utilizando-se 2.000 simulações em cada um dos cenários. Quando o número de caracteres é grande (24), o teste de razão de verossimilhança controla a taxa de erro tipo I e apresenta poder elevado (próximo a 100%), em tamanhos de amostra superiores a 100. Para tamanhos amostrais pequenos (25, 30 e 50), o teste apresenta bom desempenho (erro tipo I esperado e poder elevado), desde que o número de caracteres não exceda a 12.(AU)
Subject(s)
Ricinus , Likelihood Functions , Multivariate Analysis , Data Interpretation, StatisticalABSTRACT
The identification of the probability distribution function for the representation of the monthly rainfall is relevant in agricultural planning, mainly regard to the establishment of crops. The aim of this work was to verify the probability distribution (exponential, gamma or normal) which best fits to data monthly rainfall of 14 sites in the state of Mato Grosso do Sul. Rainfall data of 14 stations (sites) of the State of Mato Grosso do Sul it were obtained from the National Water Agency (ANA) database, collected in the period 1975 - 2013. At each of the 168 time series of monthly rainfall was applied the Kolmogorov-Smirnov test to assess the fit to probability distributions exponential, gamma and normal. The normal probability distribution presented the best fit to monthly rainfall series of Mato Grosso do Sul and it can be used for the estimation the monthly rainfall, especially in the rainy season months (October to March). The exponential probability distribution can be used for the estimation of monthly rainfall in the driest months of the year (May to September). Thus, we recommend that these distributions be used in future research, aimed to estimate the probable rainfall for the Mato Grosso do Sul State.
A identificação da função de distribuição de probabilidade para representação da chuva mensal é relevante no planejamento agrícola, sobretudo no que diz respeito à instalação de culturas. O objetivo deste trabalho foi verificar qual a distribuição de probabilidade (exponencial, gama ou normal) se ajusta melhor aos dados de precipitação pluvial mensal de 14 locais do Estado do Mato Grosso do Sul. Os dados pluviométricos de 14 estações (locais) do Estado do Mato Grosso do Sul foram obtidos do Banco de Dados da Agência Nacional de Águas (ANA), coletados do período de 1975 a 2013. Em cada uma das 168 séries temporais de chuva mensal aplicou-se o teste de Kolmogorov-Smirnov para avaliar o ajuste às distribuições de probabilidade exponencial, gama e normal. A distribuição de probabilidade normal apresentou melhor ajuste as séries de chuva mensal do Estado de Mato Grosso do Sul, podendo ser utilizada para estimativa da precipitação pluvial mensal, principalmente, nos meses de período chuvoso (outubro a março). A distribuição de probabilidade exponencial pode ser utilizada para estimativa da chuva mensal nos meses mais secos do ano (maio a setembro). Desta forma, recomendamos que estas distribuições sejam utilizadas em futuras pesquisas, que visem estimar a precipitação provável para o Estado de Mato Grosso do Sul.
Subject(s)
Rain , AgricultureABSTRACT
La prueba t-Student se fundamenta en dos premisas; la primera: en la distribución de normalidad, y la segunda: en que las muestras sean independientes. Permite comparar muestras, N ≤ 30 y/o establece la diferencia entre las medias de las muestras. El análisis matemático y estadístico de la prueba con frecuencia se minimiza para N > 30, utilizando pruebas no paramétricas, cuando la prueba tiene suficiente poder estadístico.
Student's t test is based on two premises; first: normality of distribution and second: the independence of the samples. This allows comparing samples N ≤ 30 and/or establishes the differences between the means of the two samples. The mathematical and statistical analysis of the test is frequently minimalized N > 30, using non parametric tests, when the test has enough statistical power.
ABSTRACT
El uso de pruebas no paramétricas resulta recomendable cuando los datos a analizar no cumplen los supuestos de normalidad y homocedasticidad. Sin embargo, la suposición de la normalidad de los datos o el empleo de pruebas de bondad de ajuste que no son adecuadas para el tamaño muestral empleado son aspectos habituales. Este hecho implica, en muchas ocasiones, el uso de pruebas estadísticas no ajustadas al tipo de distribución real y, consecuentemente, el establecimiento de conclusiones erróneas. Por ello, en el presente estudio se ha analizado el poder de detección de cinco pruebas de bondad de ajuste (Kolmogorov-Smirnov, Kolmogorov-Smirnov-Lilliefors, Shapiro-Wilk, Anderson-Darling y Jarque-Bera) en distribuciones simétricas con seis tamaños muestrales entre 30 y 1000 participantes generados mediante una simulación Monte Carlo. Los resultados muestran una tendencia conservadora generalizada a medida que se incrementa el tamaño muestral. En cuanto a los tamaños muestrales, las pruebas con un mejor poder de detección de la no normalidad son Kolmogorov-Smirnov-Lilliefors y Anderson-Darling para muestra pequeñas, la prueba de Kolmogorov-Smirnov si se emplean tamaños muestrales medios (200 participantes) y la prueba de Shapiro-Wilk cuando se analizan muestras superiores a 500 participantes. Además, la prueba clásica de Kolmogorov-Smirnov se considera absolutamente ineficaz independientemente del tamaño muestral.
The use of nonparametric tests is recommended when the data do not meet the assumptions of normality and homoscedasticity. However, the assumptions of normality of the data or the use of goodness of fit tests that are not appropriate for the assessed sample are common aspects. In many cases, this implies the use of statistical tests unadjusted for the real data distribution and, consequently, the establishment of inaccurate conclusions. Therefore, in this paper the detection power of five tests of goodness of fit (Kolmogorov-Smirnov-Lilliefors, Kolmogorov-Smirnov, Shapiro-Wilk, Anderson-Darling and Jarque -Bera) in symmetric distributions is analysed in six sample sizes between 30 and 1000 participants generated by Monte Carlo simulation. Results show a marked conservative tendency as the sample size becomes larger. Regarding sample sizes to detect non-normality: analysing small samples the best results are provided by Kolmogorov-Smirnov-Lilliefors and Anderson-Darling tests, if the sample is medium-sized (200 participants) the Kolmogorov-Smirnov, and when samples are over 500 participants the Shapiro-Wilk test is recommended. In addition, the classic test of Kolmogorov-Smirnov is considered absolutely ineffective regardless the sample size.
Subject(s)
Statistics, Nonparametric , Sample SizeABSTRACT
In this research article, we propose a class of models for positive and zero responses by means of a zero-augmented mixed regression model. Under this class, we are particularly interested in studying positive responses whose distribution accommodates skewness. At the same time, responses can be zero, and therefore, we justify the use of a zero-augmented mixture model. We model the mean of the positive response in a logarithmic scale and the mixture probability in a logit scale, both as a function of fixed and random effects. Moreover, the random effects link the two random components through their joint distribution and incorporate within-subject correlation because of the repeated measurements and between-subject heterogeneity. A Markov chain Monte Carlo algorithm is tailored to obtain Bayesian posterior distributions of the unknown quantities of interest, and Bayesian case-deletion influence diagnostics based on the q-divergence measure is performed. We apply the proposed method to a dataset from a 24 hour dietary recall study conducted in the city of São Paulo and present a simulation study to evaluate the performance of the proposed methods.
Subject(s)
Diet/statistics & numerical data , Models, Statistical , Algorithms , Bayes Theorem , Brazil , Computer Simulation , Humans , Likelihood Functions , Linear Models , Markov Chains , Mental Recall , Monte Carlo Method , Poisson DistributionABSTRACT
Among the test to show differences between means, the Student t test is the most characteristic. Its basic algebraic structure shows the difference between two means weighted by their dispersion. In this way, you can estimate the p value and the 95 % confidence interval of the mean difference. An essential feature is that the variable from which the mean is going to be calculated must have a normal distribution. The Student t test is used to compare two unrelated means (compared between two maneuvers), this is known as t test for independent samples. It is also used to compare two related means (a comparison before and after a maneuver in just one group), what is called paired t test. When the comparison is between more than two means (three or more dependent means, or three or more independent means) an ANOVA test (or an analysis of variance) it is used to perform the analysis.
Dentro de las pruebas para demostrar diferencia de medias, la más característica es la t de Student. La estructura algebraica base de esta prueba muestra la diferencia ponderada del promedio de una variable menos el promedio de otra entre su dispersión; de esta manera, se puede calcular el valor de p y el intervalo de confianza de 95 % para dicha diferencia de medias. Una característica indispensable es que la variable de la cual se va a calcular la media tenga distribución normal. La prueba t de Student igual se utiliza para dos medias de muestras no relacionadas (se compara entre dos maniobras) a lo que se le conoce como prueba t para muestras independientes, o para dos medias de muestras relacionadas (una comparación de antes y después de una maniobra), a lo que se le denomina t pareada. Cuando la comparación va más allá de dos medias (tres medias dependientes, o tres medias de grupos distintos) el análisis a realizar es un ANOVA (analysis of variance, por sus siglas en inglés).