Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 8 de 8
Filter
Add more filters










Database
Language
Publication year range
1.
Stat Appl Genet Mol Biol ; 14(5): 413-28, 2015 Nov.
Article in English | MEDLINE | ID: mdl-26461845

ABSTRACT

In co-expression analyses of gene expression data, it is often of interest to interpret clusters of co-expressed genes with respect to a set of external information, such as a potentially incomplete list of functional properties for which a subset of genes may be annotated. Based on the framework of finite mixture models, we propose a model selection criterion that takes into account such external gene annotations, providing an efficient tool for selecting a relevant number of clusters and clustering model. This criterion, called the integrated completed annotated likelihood (ICAL), is defined by adding an entropy term to a penalized likelihood to measure the concordance between a clustering partition and the external annotation information. The ICAL leads to the choice of a model that is more easily interpretable with respect to the known functional gene annotations. We illustrate the interest of this model selection criterion in conjunction with Gaussian mixture models on simulated gene expression data and on real RNA-seq data.


Subject(s)
Molecular Sequence Annotation , Algorithms , Cluster Analysis , Data Interpretation, Statistical , Gene Expression , Gene Expression Profiling , Models, Genetic , Sequence Analysis, RNA
2.
Bioinformatics ; 31(9): 1420-7, 2015 May 01.
Article in English | MEDLINE | ID: mdl-25563332

ABSTRACT

MOTIVATION: In recent years, gene expression studies have increasingly made use of high-throughput sequencing technology. In turn, research concerning the appropriate statistical methods for the analysis of digital gene expression (DGE) has flourished, primarily in the context of normalization and differential analysis. RESULTS: In this work, we focus on the question of clustering DGE profiles as a means to discover groups of co-expressed genes. We propose a Poisson mixture model using a rigorous framework for parameter estimation as well as the choice of the appropriate number of clusters. We illustrate co-expression analyses using our approach on two real RNA-seq datasets. A set of simulation studies also compares the performance of the proposed model with that of several related approaches developed to cluster RNA-seq or serial analysis of gene expression data. AVAILABILITY AND AND IMPLEMENTATION: The proposed method is implemented in the open-source R package HTSCluster, available on CRAN. CONTACT: andrea.rau@jouy.inra.fr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, RNA/methods , Animals , Cell Line , Cluster Analysis , Drosophila melanogaster/embryology , Drosophila melanogaster/genetics , Humans , Liver/metabolism , Models, Statistical , Poisson Distribution
3.
J Soc Fr Statistique (2009) ; 155(2): 57-71, 2014.
Article in English | MEDLINE | ID: mdl-25279246

ABSTRACT

We compare two major approaches to variable selection in clustering: model selection and regularization. Based on previous results, we select the method of Maugis et al. (2009b), which modified the method of Raftery and Dean (2006), as a current state of the art model selection method. We select the method of Witten and Tibshirani (2010) as a current state of the art regularization method. We compared the methods by simulation in terms of their accuracy in both classification and variable selection. In the first simulation experiment all the variables were conditionally independent given cluster membership. We found that variable selection (of either kind) yielded substantial gains in classification accuracy when the clusters were well separated, but few gains when the clusters were close together. We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting variables. In our second simulation experiment, there were correlations among the variables given the cluster memberships. We found that the model selection approach was substantially more accurate in terms of both classification and variable selection than the regularization approach, and that both gave more accurate classifications than K-means without variable selection. But the model selection approach is not available in a very high dimension context.

4.
Bioinformatics ; 29(17): 2146-52, 2013 Sep 01.
Article in English | MEDLINE | ID: mdl-23821648

ABSTRACT

MOTIVATION: RNA sequencing is now widely performed to study differential expression among experimental conditions. As tests are performed on a large number of genes, stringent false-discovery rate control is required at the expense of detection power. Ad hoc filtering techniques are regularly used to moderate this correction by removing genes with low signal, with little attention paid to their impact on downstream analyses. RESULTS: We propose a data-driven method based on the Jaccard similarity index to calculate a filtering threshold for replicated RNA sequencing data. In comparisons with alternative data filters regularly used in practice, we demonstrate the effectiveness of our proposed method to correctly filter lowly expressed genes, leading to increased detection power for moderately to highly expressed genes. Interestingly, this data-driven threshold varies among experiments, highlighting the interest of the method proposed here. AVAILABILITY: The proposed filtering method is implemented in the R package HTSFilter available on Bioconductor.


Subject(s)
Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, RNA/methods , Animals , Humans , Mice
5.
J Comput Graph Stat ; 9(2): 332-353, 2010 Jun 01.
Article in English | MEDLINE | ID: mdl-20953302

ABSTRACT

Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K. These clusterings can be compared on substantive grounds, and we also describe an automatic way of selecting the number of clusters via a piecewise linear regression fit to the rescaled entropy plot. We illustrate the method with simulated data and a flow cytometry dataset. Supplemental Materials are available on the journal Web site and described at the end of the paper.

6.
Biometrics ; 65(3): 701-9, 2009 Sep.
Article in English | MEDLINE | ID: mdl-19210744

ABSTRACT

This article is concerned with variable selection for cluster analysis. The problem is regarded as a model selection problem in the model-based cluster analysis context. A model generalizing the model of Raftery and Dean (2006, Journal of the American Statistical Association 101, 168-178) is proposed to specify the role of each variable. This model does not need any prior assumptions about the linear link between the selected and discarded variables. Models are compared with Bayesian information criterion. Variable role is obtained through an algorithm embedding two backward stepwise algorithms for variable selection for clustering and linear regression. The model identifiability is established and the consistency of the resulting criterion is proved under regularity conditions. Numerical experiments on simulated datasets and a genomic application highlight the interest of the procedure.


Subject(s)
Biometry/methods , Clinical Trials as Topic , Cluster Analysis , Data Interpretation, Statistical , Effect Modifier, Epidemiologic , Models, Statistical , Regression Analysis , Computer Simulation , Normal Distribution , Proportional Hazards Models
7.
Lifetime Data Anal ; 12(4): 481-504, 2006 Dec.
Article in English | MEDLINE | ID: mdl-17021959

ABSTRACT

A simple competing risk distribution as a possible alternative to the Weibull distribution in lifetime analysis is proposed. This distribution corresponds to the minimum between exponential and Weibull distributions. Our motivation is to take account of both accidental and aging failures in lifetime data analysis. First, the main characteristics of this distribution are presented. Then, the estimation of its parameters are considered through maximum likelihood and Bayesian inference. In particular, the existence of a unique consistent root of the likelihood equations is proved. Decision tests to choose between an exponential, Weibull and this competing risk distribution are presented. And this alternative model is compared to the Weibull model from numerical experiments on both real and simulated data sets, especially in an industrial context.


Subject(s)
Aging , Models, Statistical , Algorithms , Bayes Theorem , Biometry , Humans , Life Expectancy , Likelihood Functions , Mortality , Risk , Time Factors
8.
IEEE Trans Pattern Anal Mach Intell ; 28(4): 544-54, 2006 Apr.
Article in English | MEDLINE | ID: mdl-16566504

ABSTRACT

This paper is concerned with the selection of a generative model for supervised classification. Classical criteria for model selection assess the fit of a model rather than its ability to produce a low classification error rate. A new criterion, the Bayesian Entropy Criterion (BEC), is proposed. This criterion takes into account the decisional purpose of a model by minimizing the integrated classification entropy. It provides an interesting alternative to the cross-validated error rate which is computationally expensive. The asymptotic behavior of the BEC criterion is presented. Numerical experiments on both simulated and real data sets show that BEC performs better than the BIC criterion to select a model minimizing the classification error rate and provides analogous performance to the cross-validated error rate.


Subject(s)
Algorithms , Artificial Intelligence , Cluster Analysis , Information Storage and Retrieval/methods , Models, Statistical , Pattern Recognition, Automated/methods , Computer Simulation
SELECTION OF CITATIONS
SEARCH DETAIL
...