Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
1.
BMC Bioinformatics ; 7 Suppl 1: S5, 2006 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-16723008

RESUMO

BACKGROUND: We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem--predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree. METHODS: In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data. RESULTS: Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast--the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors--and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from http://www.cs.columbia.edu/compbio/robust-geneclass.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica , Algoritmos , Motivos de Aminoácidos , Sítios de Ligação , Interpretação Estatística de Dados , Bases de Dados de Proteínas , Proteínas Fúngicas/química , Proteínas de Choque Térmico/metabolismo , Chaperonas Moleculares/química , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Saccharomyces cerevisiae/metabolismo
2.
Phys Rev E Stat Nonlin Soft Matter Phys ; 71(4 Pt 2): 046117, 2005 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-15903736

RESUMO

Exploiting recent developments in information theory, we propose, illustrate, and validate a principled information-theoretic algorithm for module discovery and the resulting measure of network modularity. This measure is an order parameter (a dimensionless number between 0 and 1). Comparison is made with other approaches to module discovery and to quantifying network modularity (using Monte Carlo generated Erdös-like modular networks). Finally, the network information bottleneck (NIB) algorithm is applied to a number of real world networks, including the "social" network of co-authors at the 2004 APS March Meeting.


Assuntos
Biofísica/métodos , Teoria de Sistemas , Algoritmos , Fenômenos Fisiológicos Bacterianos , Simulação por Computador , Escherichia coli/fisiologia , Modelos Estatísticos , Método de Monte Carlo , Redes Neurais de Computação , Análise de Sistemas , Transcrição Gênica
3.
Phys Rev E Stat Nonlin Soft Matter Phys ; 71(1 Pt 2): 016110, 2005 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-15697661

RESUMO

We present a graph embedding space (i.e., a set of measures on graphs) for performing statistical analyses of networks. Key improvements over existing approaches include discovery of "motif hubs" (multiple overlapping significant subgraphs), computational efficiency relative to subgraph census, and flexibility (the method is easily generalizable to weighted and signed graphs). The embedding space is based on scalars, functionals of the adjacency matrix representing the network. Scalars are global, involving all nodes; although they can be related to subgraph enumeration, there is not a one-to-one mapping between scalars and subgraphs. Improvements in network randomization and significance testing--we learn the distribution rather than assuming Gaussianity--are also presented. The resulting algorithm establishes a systematic approach to the identification of the most significant scalars and suggests machine-learning techniques for network classification.


Assuntos
Biologia Computacional/métodos , Redes Neurais de Computação , Algoritmos , Inteligência Artificial , Simulação por Computador , Escherichia coli/fisiologia , Distribuição Normal , Saccharomyces cerevisiae/fisiologia
4.
Proc Natl Acad Sci U S A ; 102(9): 3192-7, 2005 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-15728374

RESUMO

Naturally occurring networks exhibit quantitative features revealing underlying growth mechanisms. Numerous network mechanisms have recently been proposed to reproduce specific properties such as degree distributions or clustering coefficients. We present a method for inferring the mechanism most accurately capturing a given network topology, exploiting discriminative tools from machine learning. The Drosophila melanogaster protein network is confidently and robustly (to noise and training data subsampling) classified as a duplication-mutation-complementation network over preferential attachment, small-world, and a duplication-mutation mechanism without complementation. Systematic classification, rather than statistical study of specific properties, provides a discriminative approach to understand the design of complex networks.


Assuntos
Proteínas de Drosophila/metabolismo , Algoritmos , Animais , Drosophila melanogaster , Ligação Proteica
5.
Artigo em Inglês | MEDLINE | ID: mdl-17044183

RESUMO

Our goal is to cluster genes into transcriptional modules--sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene's promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to "modules" of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course data sets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.


Assuntos
Algoritmos , Inteligência Artificial , Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica/fisiologia , Família Multigênica/fisiologia , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Fatores de Transcrição/metabolismo , Simulação por Computador , Modelos Genéticos , Modelos Estatísticos , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de DNA/métodos , Fatores de Tempo
6.
BMC Bioinformatics ; 5: 181, 2004 Nov 22.
Artigo em Inglês | MEDLINE | ID: mdl-15555081

RESUMO

BACKGROUND: Recent genomic and bioinformatic advances have motivated the development of numerous network models intending to describe graphs of biological, technological, and sociological origin. In most cases the success of a model has been evaluated by how well it reproduces a few key features of the real-world data, such as degree distributions, mean geodesic lengths, and clustering coefficients. Often pairs of models can reproduce these features with indistinguishable fidelity despite being generated by vastly different mechanisms. In such cases, these few target features are insufficient to distinguish which of the different models best describes real world networks of interest; moreover, it is not clear a priori that any of the presently-existing algorithms for network generation offers a predictive description of the networks inspiring them. RESULTS: We present a method to assess systematically which of a set of proposed network generation algorithms gives the most accurate description of a given biological network. To derive discriminative classifiers, we construct a mapping from the set of all graphs to a high-dimensional (in principle infinite-dimensional) "word space". This map defines an input space for classification schemes which allow us to state unambiguously which models are most descriptive of a given network of interest. Our training sets include networks generated from 17 models either drawn from the literature or introduced in this work. We show that different duplication-mutation schemes best describe the E. coli genetic network, the S. cerevisiae protein interaction network, and the C. elegans neuronal network, out of a set of network models including a linear preferential attachment model and a small-world model. CONCLUSIONS: Our method is a first step towards systematizing network models and assessing their predictability, and we anticipate its usefulness for a number of communities.


Assuntos
Biologia Computacional/métodos , Modelos Biológicos , Redes Neurais de Computação , Animais , Caenorhabditis elegans/fisiologia , Escherichia coli K12/genética , Modelos Genéticos , Modelos Neurológicos , Rede Nervosa/fisiologia , Mapeamento de Interação de Proteínas , Saccharomyces cerevisiae/fisiologia , Proteínas de Saccharomyces cerevisiae/metabolismo
7.
Bioinformatics ; 20 Suppl 1: i232-40, 2004 Aug 04.
Artigo em Inglês | MEDLINE | ID: mdl-15262804

RESUMO

MOTIVATION: Studying gene regulatory mechanisms in simple model organisms through analysis of high-throughput genomic data has emerged as a central problem in computational biology. Most approaches in the literature have focused either on finding a few strong regulatory patterns or on learning descriptive models from training data. However, these approaches are not yet adequate for making accurate predictions about which genes will be up- or down-regulated in new or held-out experiments. By introducing a predictive methodology for this problem, we can use powerful tools from machine learning and assess the statistical significance of our predictions. RESULTS: We present a novel classification-based method for learning to predict gene regulatory response. Our approach is motivated by the hypothesis that in simple organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular experiment based on (1) the presence of binding site subsequences ('motifs') in the gene's regulatory region and (2) the expression levels of regulators such as transcription factors in the experiment ('parents'). Thus, our learning task integrates two qualitatively different data sources: genome-wide cDNA microarray data across multiple perturbation and mutant experiments along with motif profile data from regulatory sequences. We convert the regression task of predicting real-valued gene expression measurements to a classification task of predicting +1 and -1 labels, corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. The learning algorithm employed is boosting with a margin-based generalization of decision trees, alternating decision trees. This large-margin classifier is sufficiently flexible to allow complex logical functions, yet sufficiently simple to give insight into the combinatorial mechanisms of gene regulation. We observe encouraging prediction accuracy on experiments based on the Gasch S.cerevisiae dataset, and we show that we can accurately predict up- and down-regulation on held-out experiments. We also show how to extract significant regulators, motifs and motif-regulator pairs from the learned models for various stress responses. Our method thus provides predictive hypotheses, suggests biological experiments, and provides interpretable insight into the structure of genetic regulatory networks. AVAILABILITY: The MLJava package is available upon request to the authors. Supplementary: Additional results are available from http://www.cs.columbia.edu/compbio/geneclass


Assuntos
Mapeamento Cromossômico/métodos , Regulação da Expressão Gênica/fisiologia , Modelos Genéticos , Proteoma/metabolismo , Elementos Reguladores de Transcrição/genética , Transdução de Sinais/genética , Fatores de Transcrição/genética , Sítios de Ligação , Simulação por Computador , Ligação Proteica , Proteínas de Saccharomyces cerevisiae/fisiologia , Análise de Sequência de DNA/métodos , Ativação Transcricional/fisiologia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...