Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 9 de 9
Filter
1.
Bioinformatics ; 34(16): 2773-2780, 2018 08 15.
Article in English | MEDLINE | ID: mdl-29547902

ABSTRACT

Motivation: Large scale genome-wide association studies (GWAS) are tools of choice for discovering associations between genotypes and phenotypes. To date, many studies rely on univariate statistical tests for association between the phenotype and each assayed single nucleotide polymorphism (SNP). However, interaction between SNPs, namely epistasis, must be considered when tackling the complexity of underlying biological mechanisms. Epistasis analysis at large scale entails a prohibitive computational burden when addressing the detection of more than two interacting SNPs. In this paper, we introduce a stochastic causal graph-based method, SMMB, to analyze epistatic patterns in GWAS data. Results: We present Stochastic Multiple Markov Blanket algorithm (SMMB), which combines both ensemble stochastic strategy inspired from random forests and Bayesian Markov blanket-based methods. We compared SMMB with three other recent algorithms using both simulated and real datasets. Our method outperforms the other compared methods for a majority of simulated cases of 2-way and 3-way epistasis patterns (especially in scenarii where minor allele frequencies of causal SNPs are low). Our approach performs similarly as two other compared methods for large real datasets, in terms of power, and runs faster. Availability and implementation: Parallel version available on https://ls2n.fr/listelogicielsequipe/DUKe/128/. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Epistasis, Genetic , Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide , Bayes Theorem , Humans
2.
BMC Bioinformatics ; 19(1): 106, 2018 03 27.
Article in English | MEDLINE | ID: mdl-29587628

ABSTRACT

BACKGROUND: Genome-wide association studies (GWASs) have been widely used to discover the genetic basis of complex phenotypes. However, standard single-SNP GWASs suffer from lack of power. In particular, they do not directly account for linkage disequilibrium, that is the dependences between SNPs (Single Nucleotide Polymorphisms). RESULTS: We present the comparative study of two multilocus GWAS strategies, in the random forest-based framework. The first method, T-Trees, was designed by Botta and collaborators (Botta et al., PLoS ONE 9(4):e93379, 2014). We designed the other method, which is an innovative hybrid method combining T-Trees with the modeling of linkage disequilibrium. Linkage disequilibrium is modeled through a collection of tree-shaped Bayesian networks with latent variables, following our former works (Mourad et al., BMC Bioinformatics 12(1):16, 2011). We compared the two methods, both on simulated and real data. For dominant and additive genetic models, in either of the conditions simulated, the hybrid approach always slightly performs better than T-Trees. We assessed predictive powers through the standard ROC technique on 14 real datasets. For 10 of the 14 datasets analyzed, the already high predicted power observed for T-Trees (0.910-0.946) can still be increased by up to 0.030. We also assessed whether the distributions of SNPs' scores obtained from T-Trees and the hybrid approach differed. Finally, we thoroughly analyzed the intersections of top 100 SNPs output by any two or the three methods amongst T-Trees, the hybrid approach, and the single-SNP method. CONCLUSIONS: The sophistication of T-Trees through finer linkage disequilibrium modeling is shown beneficial. The distributions of SNPs' scores generated by T-Trees and the hybrid approach are shown statistically different, which suggests complementary of the methods. In particular, for 12 of the 14 real datasets, the distribution tail of highest SNPs' scores shows larger values for the hybrid approach. Thus are pinpointed more interesting SNPs than by T-Trees, to be provided as a short list of prioritized SNPs, for a further analysis by biologists. Finally, among the 211 top 100 SNPs jointly detected by the single-SNP method, T-Trees and the hybrid approach over the 14 datasets, we identified 72 and 38 SNPs respectively present in the top25s and top10s for each method.


Subject(s)
Algorithms , Genetic Loci , Genome-Wide Association Study , Linkage Disequilibrium/genetics , Models, Genetic , Bayes Theorem , Chromosomes, Human, Pair 22/genetics , Computer Simulation , Databases, Genetic , Humans , Phenotype , Polymorphism, Single Nucleotide/genetics
3.
Front Genet ; 6: 285, 2015.
Article in English | MEDLINE | ID: mdl-26442103

ABSTRACT

During the past decade, findings of genome-wide association studies (GWAS) improved our knowledge and understanding of disease genetics. To date, thousands of SNPs have been associated with diseases and other complex traits. Statistical analysis typically looks for association between a phenotype and a SNP taken individually via single-locus tests. However, geneticists admit this is an oversimplified approach to tackle the complexity of underlying biological mechanisms. Interaction between SNPs, namely epistasis, must be considered. Unfortunately, epistasis detection gives rise to analytic challenges since analyzing every SNP combination is at present impractical at a genome-wide scale. In this review, we will present the main strategies recently proposed to detect epistatic interactions, along with their operating principle. Some of these methods are exhaustive, such as multifactor dimensionality reduction, likelihood ratio-based tests or receiver operating characteristic curve analysis; some are non-exhaustive, such as machine learning techniques (random forests, Bayesian networks) or combinatorial optimization approaches (ant colony optimization, computational evolution system).

4.
Brief Bioinform ; 13(1): 20-33, 2012 Jan.
Article in English | MEDLINE | ID: mdl-21450805

ABSTRACT

Probabilistic graphical models have been widely recognized as a powerful formalism in the bioinformatics field, especially in gene expression studies and linkage analysis. Although less well known in association genetics, many successful methods have recently emerged to dissect the genetic architecture of complex diseases. In this review article, we cover the applications of these models to the population association studies' context, such as linkage disequilibrium modeling, fine mapping and candidate gene studies, and genome-scale association studies. Significant breakthroughs of the corresponding methods are highlighted, but emphasis is also given to their current limitations, in particular, to the issue of scalability. Finally, we give promising directions for future research in this field.


Subject(s)
Computational Biology/methods , Genetic Association Studies/methods , Models, Statistical , Animals , Genetic Linkage , Genome , Humans , Linkage Disequilibrium , Models, Genetic
5.
PLoS One ; 6(12): e27320, 2011.
Article in English | MEDLINE | ID: mdl-22174739

ABSTRACT

Linkage disequilibrium study represents a major issue in statistical genetics as it plays a fundamental role in gene mapping and helps us to learn more about human history. The linkage disequilibrium complex structure makes its exploratory data analysis essential yet challenging. Visualization methods, such as the triangular heat map implemented in Haploview, provide simple and useful tools to help understand complex genetic patterns, but remain insufficient to fully describe them. Probabilistic graphical models have been widely recognized as a powerful formalism allowing a concise and accurate modeling of dependences between variables. In this paper, we propose a method for short-range, long-range and chromosome-wide linkage disequilibrium visualization using forests of hierarchical latent class models. Thanks to its hierarchical nature, our method is shown to provide a compact view of both pairwise and multilocus linkage disequilibrium spatial structures for the geneticist. Besides, a multilocus linkage disequilibrium measure has been designed to evaluate linkage disequilibrium in hierarchy clusters. To learn the proposed model, a new scalable algorithm is presented. It constrains the dependence scope, relying on physical positions, and is able to deal with more than one hundred thousand single nucleotide polymorphisms. The proposed algorithm is fast and does not require phase genotypic data.


Subject(s)
Genetic Loci/genetics , Linkage Disequilibrium/genetics , Models, Genetic , Chromosomes, Human, Pair 1/genetics , Databases, Genetic , Genetics, Population , Humans , Polymorphism, Single Nucleotide
6.
BMC Bioinformatics ; 12: 16, 2011 Jan 12.
Article in English | MEDLINE | ID: mdl-21226914

ABSTRACT

BACKGROUND: Discovering the genetic basis of common genetic diseases in the human genome represents a public health issue. However, the dimensionality of the genetic data (up to 1 million genetic markers) and its complexity make the statistical analysis a challenging task. RESULTS: We present an accurate modeling of dependences between genetic markers, based on a forest of hierarchical latent class models which is a particular class of probabilistic graphical models. This model offers an adapted framework to deal with the fuzzy nature of linkage disequilibrium blocks. In addition, the data dimensionality can be reduced through the latent variables of the model which synthesize the information borne by genetic markers. In order to tackle the learning of both forest structure and probability distributions, a generic algorithm has been proposed. A first implementation of our algorithm has been shown to be tractable on benchmarks describing 105 variables for 2000 individuals. CONCLUSIONS: The forest of hierarchical latent class models offers several advantages for genome-wide association studies: accurate modeling of linkage disequilibrium, flexible data dimensionality reduction and biological meaning borne by latent variables.


Subject(s)
Genetic Diseases, Inborn/genetics , Genome-Wide Association Study , Linkage Disequilibrium , Models, Statistical , Algorithms , Bayes Theorem , Genetic Markers , Genome, Human , Humans , Polymorphism, Single Nucleotide
7.
J Bioinform Comput Biol ; 7(5): 833-52, 2009 Oct.
Article in English | MEDLINE | ID: mdl-19785048

ABSTRACT

Though nowadays high-throughput genotyping techniques' quality improves, missing data still remains fairly common. Studies have shown that even a low percentage of missing SNPs is detrimental to the reliability of down-stream analyses such as SNP-disease association tests. This paper investigates the potentiality for improving the accuracy of an SNP inference method based on the algorithm formerly designed by Roberts and co-workers (NPUTE, 2007). This initial algorithm performs a single scan of an SNP array, inferring missing SNPs in the context of sliding windows. We have first designed a variant, KNNWinOpti, which fully exploits backward and forward dependencies between the overlapping windows and thus restores the genuine dependency of inference upon direction scanning. Our major contribution, algorithm SNPShuttle, therefore iterates bi-directional scanning to predict SNP values with more confidence. We have run simulations on realistic benchmarks built after the high resolution map of mouse strains published by the Perlegen Project. For each of the 20 mouse chromosomes and for missing data percentage varying in range 5%-30%, SNPShuttle has always been shown to increase yet high KNNWinOpti's accuracies.


Subject(s)
Algorithms , Oligonucleotide Array Sequence Analysis/statistics & numerical data , Polymorphism, Single Nucleotide , Animals , Chromosomes/genetics , Computational Biology , Haplotypes , Mice
8.
Biosystems ; 98(3): 149-59, 2009 Dec.
Article in English | MEDLINE | ID: mdl-19446002

ABSTRACT

The modelling of gene regulatory networks (GRNs) has classically been addressed through very different approaches. Among others, extensions of Thomas's asynchronous Boolean approach have been proposed, to better fit the dynamics of biological systems: genes may reach different discrete expression levels, depending on the states of other genes, called the regulators: thus, activations and inhibitions are triggered conditionally on the proper expression levels of these regulators. In contrast, some fine-grained propositions have focused on the molecular level as modelling the evolution of biological compound concentrations through differential equation systems. Both approaches are limited. The first one leads to an oversimplification of the system, whereas the second is incapable to tackle large GRNs. In this context, hybrid paradigms, that mix discrete and continuous features underlying distinct biological properties, achieve significant advances for investigating biological properties. One of these hybrid formalisms proposes to focus, within a GRN abstraction, on the time delay to pass from a gene expression level to the next. Until now, no research work has been carried out, which attempts to benefit from the modelling of a GRN by differential equations, converting it into a multi-valued logical formalism of Thomas, with the aim of performing biological applications. This paper fills this gap by describing a whole pipelined process which orchestrates the following stages: (i) model conversion from a piece-wise affine differential equation (PADE) modelization scheme into a discrete model with focal points, (ii) characterization of subgraphs through a graph simplification phase which is based on probabilistic criteria, (iii) conversion of the subgraphs into parametric linear hybrid automata, (iv) analysis of dynamical properties (e.g. cyclic behaviours) using hybrid model-checking techniques. The present work is the outcome of a methodological investigation launched to cope with the GRN responsible for the reaction of Escherichia coli bacterium to carbon starvation. As expected, we retrieve a remarkable cycle already exhibited by a previous analysis of the PADE model. Above all, hybrid model-checking enables us to infer temporal properties, whose biological signification is then discussed.


Subject(s)
Gene Regulatory Networks , Models, Theoretical , Automation , Carbon/metabolism , Escherichia coli/metabolism
9.
Nucleic Acids Res ; 36(10): 3332-40, 2008 Jun.
Article in English | MEDLINE | ID: mdl-18440978

ABSTRACT

This article compares 32 bacterial genomes with respect to their high transcription potentialities. The sigma70 promoter has been widely studied for Escherichia coli model and a consensus is known. Since transcriptional regulations are known to compensate for promoter weakness (i.e. when the promoter similarity with regard to the consensus is rather low), predicting functional promoters is a hard task. Instead, the research work presented here comes within the scope of investigating potentially high ORF expression, in relation with three criteria: (i) high similarity to the sigma70 consensus (namely, the consensus variant appropriate for each genome), (ii) transcription strength reinforcement through a supplementary binding site--the upstream promoter (UP) element--and (iii) enhancement through an optimal Shine-Dalgarno (SD) sequence. We show that in the AT-rich Firmicutes' genomes, frequencies of potentially strong sigma70-like promoters are exceptionally high. Besides, though they contain a low number of strong promoters (SPs), some genomes may show a high proportion of promoters harbouring an UP element. Putative SPs of lesser quality are more frequently associated with an UP element than putative strong promoters of better quality. A meaningful difference is statistically ascertained when comparing bacterial genomes with similarly AT-rich genomes generated at random; the difference is the highest for Firmicutes. Comparing some Firmicutes genomes with similarly AT-rich Proteobacteria genomes, we confirm the Firmicutes specificity. We show that this specificity is neither explained by AT-bias nor genome size bias; neither does it originate in the abundance of optimal SD sequences, a typical and significant feature of Firmicutes more thoroughly analysed in our study.


Subject(s)
DNA-Directed RNA Polymerases/metabolism , Genome, Bacterial , Promoter Regions, Genetic , Sigma Factor/metabolism , Transcription, Genetic , AT Rich Sequence , Base Sequence , Computational Biology , Consensus Sequence , Data Interpretation, Statistical , Enhancer Elements, Genetic , Escherichia coli/genetics , Genomics , Open Reading Frames , Thermotoga maritima/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...