Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
1.
PLoS One ; 17(6): e0269369, 2022.
Article in English | MEDLINE | ID: mdl-35709188

ABSTRACT

Recently there have been tremendous efforts to develop statistical procedures which allow to determine subgroups of patients for which certain treatments are effective. This article focuses on the selection of prognostic and predictive genetic biomarkers based on a relatively large number of candidate Single Nucleotide Polymorphisms (SNPs). We consider models which include prognostic markers as main effects and predictive markers as interaction effects with treatment. We compare different high-dimensional selection approaches including adaptive lasso, a Bayesian adaptive version of the Sorted L-One Penalized Estimator (SLOBE) and a modified version of the Bayesian Information Criterion (mBIC2). These are compared with classical multiple testing procedures for individual markers. Having identified predictive markers we consider several different approaches how to specify subgroups susceptible to treatment. Our main conclusion is that selection based on mBIC2 and SLOBE has similar predictive performance as the adaptive lasso while including substantially fewer biomarkers.


Subject(s)
Genomics , Polymorphism, Single Nucleotide , Bayes Theorem , Biomarkers , Genetic Markers , Humans , Prognosis
2.
Quant Finance ; 22(2): 349-366, 2022.
Article in English | MEDLINE | ID: mdl-35465255

ABSTRACT

Index tracking and hedge fund replication aim at cloning the return time series properties of a given benchmark, by either using only a subset of its original constituents or by a set of risk factors. In this paper, we propose a model that relies on the Sorted ℓ 1 Penalized Estimator, called SLOPE, for index tracking and hedge fund replication. We show that SLOPE is capable of not only providing sparsity, but also to form groups among assets depending on their partial correlation with the index or the hedge fund return times series. The grouping structure can then be exploited to create individual investment strategies that allow building portfolios with a smaller number of active positions, but still comparable tracking properties. Considering equity index data and hedge fund returns, we discuss the real-world properties of SLOPE based approaches with respect to state-of-the art approaches.

3.
Genetics ; 217(3)2021 03 31.
Article in English | MEDLINE | ID: mdl-33789342

ABSTRACT

Ghost quantitative trait loci (QTL) are the false discoveries in QTL mapping, that arise due to the "accumulation" of the polygenic effects, uniformly distributed over the genome. The locations on the chromosome that are strongly correlated with the total of the polygenic effects depend on a specific sample correlation structure determined by the genotypes at all loci. The problem is particularly severe when the same genotypes are used to study multiple QTL, e.g. using recombinant inbred lines or studying the expression QTL. In this case, the ghost QTL phenomenon can lead to false hotspots, where multiple QTL show apparent linkage to the same locus. We illustrate the problem using the classic backcross design and suggest that it can be solved by the application of the extended mixed effect model, where the random effects are allowed to have a nonzero mean. We provide formulas for estimating the thresholds for the corresponding t-test statistics and use them in the stepwise selection strategy, which allows for a simultaneous detection of several QTL. Extensive simulation studies illustrate that our approach eliminates ghost QTL/false hotspots, while preserving a high power of true QTL detection.


Subject(s)
Crosses, Genetic , Models, Genetic , Multifactorial Inheritance , Quantitative Trait Loci , Animals , Breeding/methods , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Plants/genetics
4.
Stat Med ; 39(22): 2901-2920, 2020 09 30.
Article in English | MEDLINE | ID: mdl-32478905

ABSTRACT

Human health is strongly associated with person's lifestyle and levels of physical activity. Therefore, characterization of daily human activity is an important task. Accelerometers have been used to obtain precise measurements of body acceleration. Wearable accelerometers collect data as a three-dimensional time series with frequencies up to 100 Hz. Using such accelerometry signal, we are able to classify different types of physical activity. In our work, we present a novel procedure for physical activity classification based on the raw accelerometry signal. Our proposal is based on the spherical representation of the data. We classify four activity types: resting, upper body activities (sitting), upper body activities (standing), and lower body activities. The classifier is constructed using decision trees with extracted features consisting of spherical coordinates summary statistics, moving averages of the radius and the angles, radius variance, and spherical variance. The classification accuracy of our method has been tested on data collected on a sample of 47 elderly individuals who performed a series of activities in laboratory settings. The achieved classification accuracy is over 90% when the subject-specific data are used and 84% when the group data are used. Main contributor to the classification accuracy is the angular part of the collected signal, especially spherical variance. To the best of our knowledge, spherical variance has never been previously used in the analysis of the raw accelerometry data. Its major advantage over other angular measures is its invariance to the accelerometer location shifts.


Subject(s)
Accelerometry , Algorithms , Aged , Exercise , Human Activities , Humans
5.
J Am Stat Assoc ; 114(525): 419-433, 2019.
Article in English | MEDLINE | ID: mdl-31217649

ABSTRACT

Sorted L-One Penalized Estimation (SLOPE, Bogdan et al., 2013, 2015) is a relatively new convex optimization procedure which allows for adaptive selection of regressors under sparse high dimensional designs. Here we extend the idea of SLOPE to deal with the situation when one aims at selecting whole groups of explanatory variables instead of single regressors. Such groups can be formed by clustering strongly correlated predictors or groups of dummy variables corresponding to different levels of the same qualitative predictor. We formulate the respective convex optimization problem, gSLOPE (group SLOPE), and propose an efficient algorithm for its solution. We also define a notion of the group false discovery rate (gFDR) and provide a choice of the sequence of tuning parameters for gSLOPE so that gFDR is provably controlled at a prespecified level if the groups of variables are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. We also provide a method for the choice of the regularizing sequence when variables in different groups are not orthogonal but statistically independent and illustrate its good properties with computer simulations. Finally, we illustrate the advantages of gSLOPE in the context of Genome Wide Association Studies. R package grpSLOPE with an implementation of our method is available on CRAN.

6.
Orphanet J Rare Dis ; 13(1): 77, 2018 05 11.
Article in English | MEDLINE | ID: mdl-29751809

ABSTRACT

BACKGROUND: IDeAl (Integrated designs and analysis of small population clinical trials) is an EU funded project developing new statistical design and analysis methodologies for clinical trials in small population groups. Here we provide an overview of IDeAl findings and give recommendations to applied researchers. METHOD: The description of the findings is broken down by the nine scientific IDeAl work packages and summarizes results from the project's more than 60 publications to date in peer reviewed journals. In addition, we applied text mining to evaluate the publications and the IDeAl work packages' output in relation to the design and analysis terms derived from in the IRDiRC task force report on small population clinical trials. RESULTS: The results are summarized, describing the developments from an applied viewpoint. The main result presented here are 33 practical recommendations drawn from the work, giving researchers a comprehensive guidance to the improved methodology. In particular, the findings will help design and analyse efficient clinical trials in rare diseases with limited number of patients available. We developed a network representation relating the hot topics developed by the IRDiRC task force on small population clinical trials to IDeAl's work as well as relating important methodologies by IDeAl's definition necessary to consider in design and analysis of small-population clinical trials. These network representation establish a new perspective on design and analysis of small-population clinical trials. CONCLUSION: IDeAl has provided a huge number of options to refine the statistical methodology for small-population clinical trials from various perspectives. A total of 33 recommendations developed and related to the work packages help the researcher to design small population clinical trial. The route to improvements is displayed in IDeAl-network representing important statistical methodological skills necessary to design and analysis of small-population clinical trials. The methods are ready for use.


Subject(s)
Rare Diseases , Clinical Trials as Topic , Data Interpretation, Statistical , Humans , Research Design
7.
Genet Epidemiol ; 41(6): 555-566, 2017 09.
Article in English | MEDLINE | ID: mdl-28657151

ABSTRACT

In genome-wide association studies (GWAS) genetic loci that influence complex traits are localized by inspecting associations between genotypes of genetic markers and the values of the trait of interest. On the other hand, admixture mapping, which is performed in case of populations consisting of a recent mix of two ancestral groups, relies on the ancestry information at each locus (locus-specific ancestry). Recently it has been proposed to jointly model genotype and locus-specific ancestry within the framework of single marker tests. Here, we extend this approach for population-based GWAS in the direction of multimarker models. A modified version of the Bayesian information criterion is developed for building a multilocus model that accounts for the differential correlation structure due to linkage disequilibrium (LD) and admixture LD. Simulation studies and a real data example illustrate the advantages of this new approach compared to single-marker analysis or modern model selection strategies based on separately analyzing genotype and ancestry data, as well as to single-marker analysis combining genotypic and ancestry information. Depending on the signal strength, our procedure automatically chooses whether genotypic or locus-specific ancestry markers are added to the model. This results in a good compromise between the power to detect causal mutations and the precision of their localization. The proposed method has been implemented in R and is available at http://www.math.uni.wroc.pl/~mbogdan/admixtures/.


Subject(s)
Genealogy and Heraldry , Genetics, Population , Genome-Wide Association Study , Algorithms , Computer Simulation , Humans , Linkage Disequilibrium/genetics , Models, Genetic , Multivariate Analysis , Phenotype , Polymorphism, Single Nucleotide/genetics , Women's Health
8.
Genetics ; 205(1): 61-75, 2017 01.
Article in English | MEDLINE | ID: mdl-27784720

ABSTRACT

With the rise of both the number and the complexity of traits of interest, control of the false discovery rate (FDR) in genetic association studies has become an increasingly appealing and accepted target for multiple comparison adjustment. While a number of robust FDR-controlling strategies exist, the nature of this error rate is intimately tied to the precise way in which discoveries are counted, and the performance of FDR-controlling procedures is satisfactory only if there is a one-to-one correspondence between what scientists describe as unique discoveries and the number of rejected hypotheses. The presence of linkage disequilibrium between markers in genome-wide association studies (GWAS) often leads researchers to consider the signal associated to multiple neighboring SNPs as indicating the existence of a single genomic locus with possible influence on the phenotype. This a posteriori aggregation of rejected hypotheses results in inflation of the relevant FDR. We propose a novel approach to FDR control that is based on prescreening to identify the level of resolution of distinct hypotheses. We show how FDR-controlling strategies can be adapted to account for this initial selection both with theoretical results and simulations that mimic the dependence structure to be expected in GWAS. We demonstrate that our approach is versatile and useful when the data are analyzed using both tests based on single markers and multiple regression. We provide an R package that allows practitioners to apply our procedure on standard GWAS format data, and illustrate its performance on lipid traits in the North Finland Birth Cohort 66 cohort study.


Subject(s)
Genetic Association Studies/methods , Genome-Wide Association Study/methods , Models, Genetic , Cohort Studies , False Positive Reactions , Genetic Predisposition to Disease , Genome, Human , Genomics/methods , Humans , Linear Models , Linkage Disequilibrium , Polymorphism, Single Nucleotide , Predictive Value of Tests
9.
Ann Appl Stat ; 9(3): 1103-1140, 2015.
Article in English | MEDLINE | ID: mdl-26709357

ABSTRACT

We introduce a new estimator for the vector of coefficients ß in the linear model y = Xß + z, where X has dimensions n × p with p possibly larger than n. SLOPE, short for Sorted L-One Penalized Estimation, is the solution to [Formula: see text]where λ1 ≥ λ2 ≥ … ≥ λ p ≥ 0 and [Formula: see text] are the decreasing absolute values of the entries of b. This is a convex program and we demonstrate a solution algorithm whose computational complexity is roughly comparable to that of classical ℓ1 procedures such as the Lasso. Here, the regularizer is a sorted ℓ1 norm, which penalizes the regression coefficients according to their rank: the higher the rank-that is, stronger the signal-the larger the penalty. This is similar to the Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B57 (1995) 289-300] procedure (BH) which compares more significant p-values with more stringent thresholds. One notable choice of the sequence {λ i } is given by the BH critical values [Formula: see text], where q ∈ (0, 1) and z(α) is the quantile of a standard normal distribution. SLOPE aims to provide finite sample guarantees on the selected model; of special interest is the false discovery rate (FDR), defined as the expected proportion of irrelevant regressors among all selected predictors. Under orthogonal designs, SLOPE with λBH provably controls FDR at level q. Moreover, it also appears to have appreciable inferential properties under more general designs X while having substantial power, as demonstrated in a series of experiments running on both simulated and real data.

10.
Stat Appl Genet Mol Biol ; 13(1): 83-104, 2014 Feb.
Article in English | MEDLINE | ID: mdl-24413217

ABSTRACT

To locate multiple interacting quantitative trait loci (QTL) influencing a trait of interest within experimental populations, usually methods as the Cockerham's model are applied. Within this framework, interactions are understood as the part of the joined effect of several genes which cannot be explained as the sum of their additive effects. However, if a change in the phenotype (as disease) is caused by Boolean combinations of genotypes of several QTLs, this Cockerham's approach is often not capable to identify them properly. To detect such interactions more efficiently, we propose a logic regression framework. Even though with the logic regression approach a larger number of models has to be considered (requiring more stringent multiple testing correction) the efficient representation of higher order logic interactions in logic regression models leads to a significant increase of power to detect such interactions as compared to a Cockerham's approach. The increase in power is demonstrated analytically for a simple two-way interaction model and illustrated in more complex settings with simulation study and real data analysis.


Subject(s)
Epistasis, Genetic , Genetic Association Studies , Algorithms , Animals , Computer Simulation , Gallstones/genetics , Linear Models , Male , Mice , Mice, 129 Strain , Models, Genetic , Quantitative Trait Loci
11.
Stat Appl Genet Mol Biol ; 11(4): Article 2, 2012 May 18.
Article in English | MEDLINE | ID: mdl-22628351

ABSTRACT

The problem of locating quantitative trait loci (QTL) for experimental populations can be approached by multiple regression analysis. In this context variable selection using a modification of the Bayesian Information Criterion (mBIC) has been well established in the past. In this article a memetic algorithm (MA) is introduced to find the model which minimizes the selection criterion. Apart from mBIC also a second modification (mBIC2) is considered, which has the property of controlling the false discovery rate. Given the Bayesian nature of our selection criteria, we are not only interested in finding the best model, but also in computing marker posterior probabilities using all models visited by MA. In a simulation study MA (with mBIC and mBIC2) is compared with a parallel genetic algorithm (PGA) which has been previously suggested for QTL mapping. It turns out that MA in combination with mBIC2 performs best, where determining QTL positions based on marker posterior probabilities yields even better results than using the best model selected by MA. Finally we consider a real data set from the literature and show that MA can also be extended to multiple interval mapping, which potentially increases the precision with which the exact location of QTLs can be estimated.


Subject(s)
Algorithms , Bayes Theorem , Chromosome Mapping/methods , Quantitative Trait Loci , Animals , Humans , Models, Genetic , Regression Analysis
12.
Stat Appl Genet Mol Biol ; 9: Article26, 2010.
Article in English | MEDLINE | ID: mdl-20597852

ABSTRACT

We consider the problem of locating multiple interacting quantitative trait loci (QTL) influencing traits measured in counts. In many applications the distribution of the count variable has a spike at zero. Zero-inflated generalized Poisson regression (ZIGPR) allows for an additional probability mass at zero and hence an improvement in the detection of significant loci. Classical model selection criteria often overestimate the QTL number. Therefore, modified versions of the Bayesian Information Criterion (mBIC and EBIC) were successfully used for QTL mapping. We apply these criteria based on ZIGPR as well as simpler models. An extensive simulation study shows their good power detecting QTL while controlling the false discovery rate. We illustrate how the inability of the Poisson distribution to account for over-dispersion leads to an overestimation of the QTL number and hence strongly discourages its application for identifying factors influencing count data. The proposed method is used to analyze the mice gallstone data of Lyons et al. (2003). Our results suggest the existence of a novel QTL on chromosome 4 interacting with another QTL previously identified on chromosome 5. We provide the corresponding code in R.


Subject(s)
Quantitative Trait Loci , Animals , Bayes Theorem , Chromosomes , Gallstones , Humans , Mice , Poisson Distribution , Probability
13.
Biometrics ; 64(4): 1162-9, 2008 Dec.
Article in English | MEDLINE | ID: mdl-18266892

ABSTRACT

SUMMARY: The modified version of Bayesian Information Criterion (mBIC) is a relatively simple model selection procedure that can be used when locating multiple interacting quantitative trait loci (QTL). Our earlier work demonstrated the statistical properties of mBIC for situations where the average genetic map interval is at least 5 cM. In this work mBIC is adapted to genome searches based on a dense map and, more importantly, to the situation where consecutive QTL and interactions are located by multiple interval mapping. Easy to use formulas for the extended mBIC are given. A simulation study, as well as the analysis of real data, confirm the good properties of the extended mBIC.


Subject(s)
Bayes Theorem , Chromosome Mapping/methods , Quantitative Trait Loci , Chromosome Mapping/statistics & numerical data , Computer Simulation , Genomics
14.
Genetics ; 176(3): 1845-54, 2007 Jul.
Article in English | MEDLINE | ID: mdl-17507685

ABSTRACT

In previous work, a modified version of the Bayesian information criterion (mBIC) was proposed to locate multiple interacting quantitative trait loci (QTL). Simulation studies and real data analysis demonstrate good properties of the mBIC in situations where the error distribution is approximately normal. However, as with other standard techniques of QTL mapping, the performance of the mBIC strongly deteriorates when the trait distribution is heavy tailed or when the data contain a significant proportion of outliers. In the present article, we propose a suitable robust version of the mBIC that is based on ranks. We investigate the properties of the resulting method on the basis of theoretical calculations, computer simulations, and a real data analysis. Our simulation results show that for the sample sizes typically used in QTL mapping, the methods based on ranks are almost as efficient as standard techniques when the data are normal and are much better when the data come from some heavy-tailed distribution or include a proportion of outliers.


Subject(s)
Models, Genetic , Quantitative Trait Loci , Bayes Theorem , Computer Simulation , Methods , Reproducibility of Results
15.
Genetics ; 173(3): 1693-703, 2006 Jul.
Article in English | MEDLINE | ID: mdl-16624924

ABSTRACT

A modified version (mBIC) of the Bayesian Information Criterion (BIC) has been previously proposed for backcross designs to locate multiple interacting quantitative trait loci. In this article, we extend the method to intercross designs. We also propose two modifications of the mBIC. First we investigate a two-stage procedure in the spirit of empirical Bayes methods involving an adaptive (i.e., data-based) choice of the penalty. The purpose of the second modification is to increase the power of detecting epistasis effects at loci where main effects have already been detected. We investigate the proposed methods by computer simulations under a wide range of realistic genetic models, with nonequidistant marker spacings and missing data. In the case of large intermarker distances we use imputations according to Haley and Knott regression to reduce the distance between searched positions to not more than 10 cM. Haley and Knott regression is also used to handle missing data. The simulation study as well as real data analyses demonstrates good properties of the proposed method of QTL detection.


Subject(s)
Crosses, Genetic , Quantitative Trait Loci , Algorithms , Animals , Bayes Theorem , Computer Simulation , Drosophila/genetics , Epistasis, Genetic , Genetic Markers , Mice , Models, Genetic , Regression Analysis
16.
Genetics ; 167(2): 989-99, 2004 Jun.
Article in English | MEDLINE | ID: mdl-15238547

ABSTRACT

The problem of locating multiple interacting quantitative trait loci (QTL) can be addressed as a multiple regression problem, with marker genotypes being the regressor variables. An important and difficult part in fitting such a regression model is the estimation of the QTL number and respective interactions. Among the many model selection criteria that can be used to estimate the number of regressor variables, none are used to estimate the number of interactions. Our simulations demonstrate that epistatic terms appearing in a model without the related main effects cause the standard model selection criteria to have a strong tendency to overestimate the number of interactions, and so the QTL number. With this as our motivation we investigate the behavior of the Schwarz Bayesian information criterion (BIC) by explaining the phenomenon of the overestimation and proposing a novel modification of BIC that allows the detection of main effects and pairwise interactions in a backcross population. Results of an extensive simulation study demonstrate that our modified version of BIC performs very well in practice. Our methodology can be extended to general populations and higher-order interactions.


Subject(s)
Bayes Theorem , Models, Genetic , Quantitative Trait Loci , Computer Simulation , Regression Analysis
17.
Bioinformatics ; 20(6): 881-7, 2004 Apr 12.
Article in English | MEDLINE | ID: mdl-14751984

ABSTRACT

MOTIVATION: Pairwise local sequence alignment is commonly used to search data bases for sequences related to some query sequence. Alignments are obtained using a scoring matrix that takes into account the different frequencies of occurrence of the various types of amino acid substitutions. Software like BLAST provides the user with a set of scoring matrices available to choose from, and in the literature it is sometimes recommended to try several scoring matrices on the sequences of interest. The significance of an alignment is usually assessed by looking at E-values and p-values. While sequence lengths and data base sizes enter the standard calculations of significance, it is much less common to take the use of several scoring matrices on the same sequences into account. Altschul proposed corrections of the p-value that account for the simultaneous use of an infinite number of PAM matrices. Here we consider the more realistic situation where the user may choose from a finite set of popular PAM and BLOSUM matrices, in particular the ones available in BLAST. It turns out that the significance of a result can be considerably overestimated, if a set of substitution matrices is used in an alignment problem and the most significant alignment is then quoted. RESULTS: Based on extensive simulations, we study the multiple testing problem that occurs when several scoring matrices for local sequence alignment are used. We consider a simple Bonferroni correction of the p-values and investigate its accuracy. Finally, we propose a more accurate correction based on extreme value distributions fitted to the maximum of the normalized scores obtained from different scoring matrices. For various sets of matrices we provide correction factors which can be easily applied to adjust p- and E-values reported by software packages.


Subject(s)
Models, Genetic , Models, Statistical , Proteins/chemistry , Proteins/genetics , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Data Interpretation, Statistical , Databases, Protein , Reproducibility of Results , Sensitivity and Specificity
SELECTION OF CITATIONS
SEARCH DETAIL
...