Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Proc Natl Acad Sci U S A ; 110(26): 10872-7, 2013 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-23757502

RESUMO

Biological process enrichment is a widely used metric for evaluating the quality of multiprotein modules. In this study, we examine possible optimization criteria for detecting homologous multiprotein modules and quantify their effects on biological process enrichment. We find that modularity, linear density, and module size are the most important criteria considered, complementary to each other, and that graph theoretical attributes account for 36% of the variance in biological process enrichment. Variations in protein interaction similarity within module pairs have only minor effects on biological process enrichment. As random modules increase in size, both biological process enrichment and modularity tend to improve, although modularity does not show this upward trend in modules with size at most 50 proteins. To adjust for these trends, we recommend a size correction based on random sampling of modules when using biological process enrichment or other attributes to evaluate module boundaries. Characteristics of homologous multiprotein modules optimized for each of the optimization criteria are examined.


Assuntos
Complexos Multiproteicos/química , Algoritmos , Animais , Fenômenos Biológicos , Bases de Dados de Proteínas , Proteínas de Drosophila/química , Proteínas de Drosophila/genética , Drosophila melanogaster/química , Drosophila melanogaster/genética , Humanos , Complexos Multiproteicos/genética , Mapas de Interação de Proteínas , Homologia Estrutural de Proteína , Biologia de Sistemas
2.
J Comput Biol ; 20(3): 249-57, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23286509

RESUMO

Since the first emergence of protein-protein interaction networks more than a decade ago, they have been viewed as static scaffolds of the signaling-regulatory events taking place in cells, and their analysis has been mainly confined to topological aspects. Recently, functional models of these networks have been suggested, ranging from Boolean to constraint-based methods. However, learning such models from large-scale data remains a formidable task, and most modeling approaches rely on extensive human curation. Here we provide a generic approach to learning Boolean models automatically from data. We apply our approach to growth and inflammatory signaling systems in humans and show how the learning phase can improve the fit of the model to experimental data, remove spurious interactions, and lead to better understanding of the system at hand.


Assuntos
Modelos Biológicos , Transdução de Sinais , Algoritmos , Receptores ErbB/metabolismo , Humanos , Interleucina-1/metabolismo
3.
J Comput Biol ; 19(9): 998-1014, 2012 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-22897201

RESUMO

Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the estimated pedigree to the true pedigree. In this article, we consider two main pedigree comparison problems. The first is the pedigree isomorphism problem, for which we present a linear-time algorithm for leaf-labeled pedigrees. The second is the pedigree edit distance problem, for which we present (1) several algorithms that are fast and exact in various special cases, and (2) a general, randomized heuristic algorithm. In the negative direction, we first prove that the pedigree isomorphism problem is as hard as the general graph isomorphism problem, and that the sub-pedigree isomorphism problem is NP-hard. We then show that the pedigree edit distance problem is APX-hard in general and NP-hard on leaf-labeled pedigrees. We use simulated pedigrees to compare our edit-distance algorithms to each other as well as to a branch-and-bound algorithm that always finds an optimal solution.


Assuntos
Algoritmos , Simulação por Computador , Modelos Genéticos , Linhagem , Inteligência Artificial , Humanos
4.
Artigo em Inglês | MEDLINE | ID: mdl-21968956

RESUMO

Detecting essential multiprotein modules that change infrequently during evolution is a challenging algorithmic task that is important for understanding the structure, function, and evolution of the biological cell. In this paper, we define a measure of modularity for interactomes and present a linear-time algorithm, Produles, for detecting multiprotein modularity conserved during evolution that improves on the running time of previous algorithms for related problems and offers desirable theoretical guarantees. We present a biologically motivated graph theoretic set of evaluation measures complementary to previous evaluation measures, demonstrate that Produles exhibits good performance by all measures, and describe certain recurrent anomalies in the performance of previous algorithms that are not detected by previous measures. Consideration of the newly defined measures and algorithm performance on these measures leads to useful insights on the nature of interactomics data and the goals of previous and current algorithms. Through randomization experiments, we demonstrate that conserved modularity is a defining characteristic of interactomes. Computational experiments on current experimentally derived interactomes for Homo sapiens and Drosophila melanogaster, combining results across algorithms, show that nearly 10 percent of current interactome proteins participate in multiprotein modules with good evidence in the protein interaction data of being conserved between human and Drosophila.


Assuntos
Algoritmos , Biologia Computacional/métodos , Evolução Molecular , Mapeamento de Interação de Proteínas/métodos , Animais , Drosophila melanogaster , Humanos , Mapas de Interação de Proteínas , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Reprodutibilidade dos Testes , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos
5.
J Comput Biol ; 18(11): 1481-93, 2011 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-22035331

RESUMO

Can we find the family trees, or pedigrees, that relate the haplotypes of a group of individuals? Collecting the genealogical information for how individuals are related is a very time-consuming and expensive process. Methods for automating the construction of pedigrees could stream-line this process. While constructing single-generation families is relatively easy given whole genome data, reconstructing multi-generational, possibly inbred, pedigrees is much more challenging. This article addresses the important question of reconstructing monogamous, regular pedigrees, where pedigrees are regular when individuals mate only with other individuals at the same generation. This article introduces two multi-generational pedigree reconstruction methods: one for inbreeding relationships and one for outbreeding relationships. In contrast to previous methods that focused on the independent estimation of relationship distances between every pair of typed individuals, here we present methods that aim at the reconstruction of the entire pedigree. We show that both our methods out-perform the state-of-the-art and that the outbreeding method is capable of reconstructing pedigrees at least six generations back in time with high accuracy. The two programs are available at http://cop.icsi.berkeley.edu/cop/.


Assuntos
Simulação por Computador , Modelos Genéticos , Linhagem , Algoritmos , Consanguinidade , Haplótipos , Humanos , Distribuição de Poisson , Recombinação Genética , Análise de Sequência de DNA
6.
PLoS One ; 5(10): e13367, 2010 Oct 19.
Artigo em Inglês | MEDLINE | ID: mdl-20976054

RESUMO

BACKGROUND: Molecular studies of the human disease transcriptome typically involve a search for genes whose expression is significantly dysregulated in sick individuals compared to healthy controls. Recent studies have found that only a small number of the genes in human disease-related pathways show consistent dysregulation in sick individuals. However, those studies found that some pathway genes are affected in most sick individuals, but genes can differ among individuals. While a pathway is usually defined as a set of genes known to share a specific function, pathway boundaries are frequently difficult to assign, and methods that rely on such definition cannot discover novel pathways. Protein interaction networks can potentially be used to overcome these problems. METHODOLOGY/PRINCIPAL FINDINGS: We present DEGAS (DysrEgulated Gene set Analysis via Subnetworks), a method for identifying connected gene subnetworks significantly enriched for genes that are dysregulated in specimens of a disease. We applied DEGAS to seven human diseases and obtained statistically significant results that appear to home in on compact pathways enriched with hallmarks of the diseases. In Parkinson's disease, we provide novel evidence for involvement of mRNA splicing, cell proliferation, and the 14-3-3 complex in the disease progression. DEGAS is available as part of the MATISSE software package (http://acgt.cs.tau.ac.il/matisse). CONCLUSIONS/SIGNIFICANCE: The subnetworks identified by DEGAS can provide a signature of the disease potentially useful for diagnosis, pinpoint possible pathways affected by the disease, and suggest targets for drug intervention.


Assuntos
Doença/genética , Regulação da Expressão Gênica , Algoritmos , Perfilação da Expressão Gênica , Humanos , Regulação para Cima
7.
J Comput Biol ; 17(3): 237-52, 2010 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-20377443

RESUMO

In the network querying problem, one is given a protein complex or pathway of species A and a protein-protein interaction network of species B; the goal is to identify subnetworks of B that are similar to the query in terms of sequence, topology, or both. Existing approaches mostly depend on knowledge of the interaction topology of the query in the network of species A; however, in practice, this topology is often not known. To address this problem, we develop a topology-free querying algorithm, which we call Torque. Given a query, represented as a set of proteins, Torque seeks a matching set of proteins that are sequence-similar to the query proteins and span a connected region of the network, while allowing both insertions and deletions. The algorithm uses alternatively dynamic programming and integer linear programming for the search task. We test Torque with queries from yeast, fly, and human, where we compare it to the QNet topology-based approach, and with queries from less studied species, where only topology-free algorithms apply. Torque detects many more matches than QNet, while giving results that are highly functionally coherent.


Assuntos
Mapeamento de Interação de Proteínas/métodos , Proteínas/metabolismo , Algoritmos , Animais , Biologia Computacional , Drosophila/metabolismo , Humanos , Ligação Proteica , Saccharomyces cerevisiae/metabolismo , Fatores de Tempo
8.
J Comput Biol ; 17(3): 269-80, 2010 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-20377445

RESUMO

Despite the desirable information contained in complex pedigree data sets, analysis methods struggle to efficiently process these data. The attractiveness of pedigree data is their power for detecting rare variants, particularly in comparison with studies of unrelated individuals. In addition, rather than assuming individuals in a study are unrelated, knowledge of their relationships can avoid spurious results due to confounding population structure effects. However, a major challenge for applying pedigree methods is difficulty in handling complex pedigrees having multiple founding lineages, inbreeding, and half-sibling relationships. A key ingredient in association studies is imputation and inference of haplotypes from genotype data. Existing haplotype inference methods either do not efficiently scale to complex pedigrees or are of limited accuracy. In this article, we present algorithms for efficient haplotype inference and imputation in complex pedigrees. Our method, PhyloPed, leverages the perfect phylogeny model, resulting in an efficient method with high accuracy. PhyloPed effectively combines the founder haplotype information from different lineages and is immune to inaccuracies in prior information about the founders. In addition, we demonstrate that inference of missing data, using PhyloPed, can substantially improve disease association. For Online Supplementary Material, see www.liebertonline.com.


Assuntos
Haplótipos/genética , Linhagem , Simulação por Computador , Doença/genética , Ligação Genética , Genótipo , Humanos , Filogenia , Curva ROC , Recombinação Genética/genética , Fatores de Tempo
9.
PLoS Genet ; 5(12): e1000782, 2009 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-20041197

RESUMO

This work demonstrates how gene association studies can be analyzed to map a global landscape of genetic interactions among protein complexes and pathways. Despite the immense potential of gene association studies, they have been challenging to analyze because most traits are complex, involving the combined effect of mutations at many different genes. Due to lack of statistical power, only the strongest single markers are typically identified. Here, we present an integrative approach that greatly increases power through marker clustering and projection of marker interactions within and across protein complexes. Applied to a recent gene association study in yeast, this approach identifies 2,023 genetic interactions which map to 208 functional interactions among protein complexes. We show that such interactions are analogous to interactions derived through reverse genetic screens and that they provide coverage in areas not yet tested by reverse genetic analysis. This work has the potential to transform gene association studies, by elevating the analysis from the level of individual markers to global maps of genetic interactions. As proof of principle, we use synthetic genetic screens to confirm numerous novel genetic interactions for the INO80 chromatin remodeling complex.


Assuntos
Genoma Fúngico/genética , Estudo de Associação Genômica Ampla , Complexos Multiproteicos/genética , Proteínas de Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/genética , Análise por Conglomerados , Redes Reguladoras de Genes , Marcadores Genéticos , Ligação Proteica/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/metabolismo
10.
Nucleic Acids Res ; 37(Web Server issue): W106-8, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19491310

RESUMO

TORQUE is a tool for cross-species querying of protein-protein interaction networks. It aims to answer the following question: given a set of proteins constituting a known complex or a pathway in one species, can a similar complex or pathway be found in the protein network of another species? To this end, Torque seeks a matching set of proteins that are sequence similar to the query proteins and span a connected region of the target network, while allowing for both insertions and deletions. Unlike existing approaches, TORQUE does not require knowledge of the interconnections among the query proteins. It can handle large queries of up to 25 proteins. The Torque web server is freely available for use at http://www.cs.tau.ac.il/~bnet/torque.html.


Assuntos
Mapeamento de Interação de Proteínas , Software , Internet , Análise de Sequência de Proteína
11.
Am J Hum Genet ; 83(6): 675-83, 2008 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-19026399

RESUMO

The central questions asked in whole-genome association studies are how to locate associated regions in the genome and how to estimate the significance of these findings. Researchers usually do this by testing each SNP separately for association and then applying a suitable correction for multiple-hypothesis testing. However, SNPs are correlated by the unobserved genealogy of the population, and a more powerful statistical methodology would attempt to take this genealogy into account. Leveraging the genealogy in association studies is challenging, however, because the inference of the genealogy from the genotypes is a computationally intensive task, in particular when recombination is modeled, as in ancestral recombination graphs. Furthermore, if large numbers of genealogies are imputed from the genotypes, the power of the study might decrease if these imputed genealogies create an additional multiple-hypothesis testing burden. Indeed, we show in this paper that several existing methods that aim to address this problem suffer either from low power or from a very high false-positive rate; their performance is generally not better than the standard approach of separate testing of SNPs. We suggest a new genealogy-based approach, CAMP (coalescent-based association mapping), that takes into account the trade-off between the complexity of the genealogy and the power lost due to the additional multiple hypotheses. Our experiments show that CAMP yields a significant increase in power relative to that of previous methods and that it can more accurately locate the associated region.


Assuntos
Mapeamento Cromossômico/métodos , Predisposição Genética para Doença , Genoma Humano , Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Algoritmos , Estudos de Casos e Controles , Distribuição de Qui-Quadrado , Simulação por Computador , Genética Populacional , Genótipo , Haplótipos , Humanos , Modelos Genéticos , Mutação , Linhagem , Filogenia , Recombinação Genética
12.
Mol Syst Biol ; 4: 162, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18319721

RESUMO

Analysis of expression quantitative trait loci (eQTLs) is an emerging technique in which individuals are genotyped across a panel of genetic markers and, simultaneously, phenotyped using DNA microarrays. Because of the spacing of markers and linkage disequilibrium, each marker may be near many genes making it difficult to finely map which of these genes are the causal factors responsible for the observed changes in the downstream expression. To address this challenge, we present an efficient method for prioritizing candidate genes at a locus. This approach, called 'eQTL electrical diagrams' (eQED), integrates eQTLs with protein interaction networks by modeling the two data sets as a wiring diagram of current sources and resistors. eQED achieved a 79% accuracy in recovering a reference set of regulator-target pairs in yeast, which is significantly higher than the performance of three competing methods. eQED also annotates 368 protein-protein interactions with their directionality of information flow with an accuracy of approximately 75%.


Assuntos
Biologia Computacional/métodos , Proteínas/metabolismo , Locos de Características Quantitativas , Biologia de Sistemas/métodos , Animais , Proteínas Fúngicas/metabolismo , Genótipo , Humanos , Desequilíbrio de Ligação , Modelos Genéticos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos , Fenótipo , Mapeamento de Interação de Proteínas , Reprodutibilidade dos Testes
13.
Am J Hum Genet ; 81(5): 895-905, 2007 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-17924333

RESUMO

Population stratification can be a serious obstacle in the analysis of genomewide association studies. We propose a method for evaluating the significance of association scores in whole-genome cohorts with stratification. Our approach is a randomization test akin to a standard permutation test. It conditions on the genotype matrix and thus takes into account not only the population structure but also the complex linkage disequilibrium structure of the genome. As we show in simulation experiments, our method achieves higher power and significantly better control over false-positive rates than do existing methods. In addition, it can be easily applied to whole-genome association studies.


Assuntos
Predisposição Genética para Doença , Genoma Humano , Modelos Genéticos , Algoritmos , Bases de Dados Genéticas , Humanos , Polimorfismo de Nucleotídeo Único/genética
14.
J Comput Biol ; 14(7): 892-907, 2007 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-17803369

RESUMO

We present a method that compares the protein interaction networks of two species to detect functionally similar (conserved) protein modules between them. The method is based on an algorithm we developed to identify matching subgraphs between two graphs. Unlike previous network comparison methods, our algorithm has provable guarantees on correctness and efficiency. Our algorithm framework also admits quite general criteria that define when two subgraphs match and constitute a conserved module. We apply our method to pairwise comparisons of the yeast protein network with the human, fruit fly and nematode worm protein networks, using a lenient criterion based on connectedness and matching edges, coupled with a clustering heuristic. In evaluations of the detected conserved modules against reference yeast protein complexes, our method performs competitively with and sometimes better than two previous network comparison methods. Further, under some conditions (proper homolog and species selection), our method performs better than a popular single-species clustering method. Beyond these evaluations, we discuss the biology of a couple of conserved modules detected by our method. We demonstrate the utility of network comparison for transferring annotations from yeast proteins to human ones, and validate the predicted annotations. Supplemental text is available at www.cs.berkeley.edu/ approximately nmani/M-and-S/supplement.pdf.


Assuntos
Algoritmos , Mapeamento de Interação de Proteínas/métodos , Proteínas , Bases de Dados de Proteínas , Humanos , Modelos Biológicos , Proteínas/genética , Proteínas/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo
15.
Bioinformatics ; 23(22): 3048-55, 2007 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-17895275

RESUMO

MOTIVATION: The search for genetic variants that are linked to complex diseases such as cancer, Parkinson's;, or Alzheimer's; disease, may lead to better treatments. Since haplotypes can serve as proxies for hidden variants, one method of finding the linked variants is to look for case-control associations between the haplotypes and disease. Finding these associations requires a high-quality estimation of the haplotype frequencies in the population. To this end, we present, HaploPool, a method of estimating haplotype frequencies from blocks of consecutive SNPs. RESULTS: HaploPool leverages the efficiency of DNA pools and estimates the population haplotype frequencies from pools of disjoint sets, each containing two or three unrelated individuals. We study the trade-off between pooling efficiency and accuracy of haplotype frequency estimates. For a fixed genotyping budget, HaploPool performs favorably on pools of two individuals as compared with a state-of-the-art non-pooled phasing method, PHASE. Of independent interest, HaploPool can be used to phase non-pooled genotype data with an accuracy approaching that of PHASE. We compared our algorithm to three programs that estimate haplotype frequencies from pooled data. HaploPool is an order of magnitude more efficient (at least six times faster), and considerably more accurate than previous methods. In contrast to previous methods, HaploPool performs well with missing data, genotyping errors and long haplotype blocks (of between 5 and 25 SNPs).


Assuntos
DNA/genética , Frequência do Gene/genética , Pool Gênico , Variação Genética/genética , Haplótipos/genética , Modelos Genéticos , Análise de Sequência de DNA/métodos , Simulação por Computador , Filogenia
16.
J Comput Biol ; 13(2): 133-44, 2006 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-16597231

RESUMO

The interpretation of large-scale protein network data depends on our ability to identify significant substructures in the data, a computationally intensive task. Here we adapt and extend efficient techniques for finding paths and trees in graphs to the problem of identifying pathways in protein interaction networks. We present linear-time algorithms for finding paths and trees in networks under several biologically motivated constraints. We apply our methodology to search for protein pathways in the yeast protein-protein interaction network. We demonstrate that our algorithm is capable of reconstructing known signaling pathways and identifying functionally enriched paths and trees in an unsupervised manner. The algorithm is very efficient, computing optimal paths of length 8 within minutes and paths of length 10 in about three hours.


Assuntos
Algoritmos , Mapeamento de Interação de Proteínas , Proteínas de Saccharomyces cerevisiae/química , Proteínas de Saccharomyces cerevisiae/metabolismo , Transdução de Sinais , Simulação por Computador , Modelos Biológicos , Ligação Proteica , Proteoma/metabolismo , Saccharomyces cerevisiae/metabolismo
17.
J Comput Biol ; 12(6): 835-46, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16108720

RESUMO

Mounting evidence shows that many protein complexes are conserved in evolution. Here we use conservation to find complexes that are common to the yeast S. cerevisiae and the bacteria H. pylori. Our analysis combines protein interaction data that are available for each of the two species and orthology information based on protein sequence comparison. We develop a detailed probabilistic model for protein complexes in a single species and a model for the conservation of complexes between two species. Using these models, one can recast the question of finding conserved complexes as a problem of searching for heavy subgraphs in an edge- and node-weighted graph, whose nodes are orthologous protein pairs. We tested this approach on the data currently available for yeast and bacteria and detected 11 significantly conserved complexes. Several of these complexes match very well with prior experimental knowledge on complexes in yeast only and serve for validation of our methodology. The complexes suggest new functions for a variety of uncharacterized proteins. By identifying a conserved complex whose yeast proteins function predominantly in the nuclear pore complex, we propose that the corresponding bacterial proteins function as a coherent cellular membrane transport system. We also compare our results to two alternative methods for detecting complexes and demonstrate that our methodology obtains a much higher specificity.


Assuntos
Proteínas de Bactérias/metabolismo , Helicobacter pylori/metabolismo , Mapeamento de Interação de Proteínas , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Bases de Dados de Proteínas , Complexos Multiproteicos , Alinhamento de Sequência
18.
Proc Natl Acad Sci U S A ; 102(6): 1974-9, 2005 Feb 08.
Artigo em Inglês | MEDLINE | ID: mdl-15687504

RESUMO

To elucidate cellular machinery on a global scale, we performed a multiple comparison of the recently available protein-protein interaction networks of Caenorhabditis elegans, Drosophila melanogaster, and Saccharomyces cerevisiae. This comparison integrated protein interaction and sequence information to reveal 71 network regions that were conserved across all three species and many exclusive to the metazoans. We used this conservation, and found statistically significant support for 4,645 previously undescribed protein functions and 2,609 previously undescribed protein interactions. We tested 60 interaction predictions for yeast by two-hybrid analysis, confirming approximately half of these. Significantly, many of the predicted functions and interactions would not have been identified from sequence similarity alone, demonstrating that network comparisons provide essential biological information beyond what is gleaned from the genome.


Assuntos
Proteínas de Caenorhabditis elegans/metabolismo , Proteínas de Drosophila/metabolismo , Proteínas de Saccharomyces cerevisiae/metabolismo , Sequência de Aminoácidos , Animais , Proteínas de Caenorhabditis elegans/genética , Bases de Dados de Ácidos Nucleicos , Proteínas de Drosophila/genética , Proteínas de Saccharomyces cerevisiae/genética , Alinhamento de Sequência , Homologia de Sequência do Ácido Nucleico , Técnicas do Sistema de Duplo-Híbrido
19.
J Comput Biol ; 11(2-3): 476-92, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15285903

RESUMO

We study a design and optimization problem that occurs, for example, when single nucleotide polymorphisms (SNPs) are to be genotyped using a universal DNA tag array. The problem of optimizing the universal array to avoid disruptive cross-hybridization between universal components of the system was addressed in previous work. Cross-hybridization can, however, also occur assay specifically, due to unwanted complementarity involving assay-specific components. Here we examine the problem of identifying the most economic experimental configuration of the assay-specific components that avoids cross-hybridization. Our formalization translates this problem into the problem of covering the vertices of one side of a bipartite graph by a minimum number of balanced subgraphs of maximum degree 1. We show that the general problem is NP-complete. However, in the real biological setting, the vertices that need to be covered have degrees bounded by d. We exploit this restriction and develop an O(d)-approximation algorithm for the problem. We also give an O(d)-approximation for a variant of the problem in which the covering subgraphs are required to be vertex disjoint. In addition, we propose a stochastic model for the input data and use it to prove a lower bound on the cover size. We complement our theoretical analysis by implementing two heuristic approaches and testing their performance on synthetic data as well as on simulated SNP data.


Assuntos
Biologia Computacional/métodos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Algoritmos , Biologia Computacional/estatística & dados numéricos , Genótipo , Reação em Cadeia da Polimerase , Polimorfismo de Nucleotídeo Único
20.
J Bioinform Comput Biol ; 2(1): 127-54, 2004 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-15272436

RESUMO

The complexity of the global organization and internal structure of motifs in higher eukaryotic organisms raises significant challenges for motif detection techniques. To achieve successful de novo motif detection, it is necessary to model the complex dependencies within and among motifs and to incorporate biological prior knowledge. In this paper, we present LOGOS, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two interacting submodels: HMDM, a local alignment model capturing biological prior knowledge and positional dependency within the motif local structure; and HMM, a global motif distribution model modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using training motifs within an empirical Bayesian framework. A variational EM algorithm is developed for de novo motif detection. LOGOS improves over existing models that ignore biological priors and dependencies in motif structures and motif occurrences, and demonstrates superior performance on both semi-realistic test data and cis-regulatory sequences from yeast and Drosophila genomes with regard to sensitivity, specificity, flexibility and extensibility.


Assuntos
Algoritmos , Motivos de Aminoácidos , DNA/química , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Teorema de Bayes , Modelos Biológicos , Modelos Estatísticos , Homologia de Sequência do Ácido Nucleico
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...