Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 45
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bull Math Biol ; 85(3): 21, 2023 02 13.
Artigo em Inglês | MEDLINE | ID: mdl-36780044

RESUMO

The study of native motifs of RNA secondary structures helps us better understand the formation and eventually the functions of these molecules. Commonly known structural motifs include helices, hairpin loops, bulges, interior loops, exterior loops and multiloops. However, enumerative results and generating algorithms taking into account the joint distribution of these motifs are sparse. In this paper, we present progress on deriving such distributions employing a tree-bijection of RNA secondary structures obtained by Schmitt and Waterman and a novel rake decomposition of plane trees. The key feature of the latter is that the derived components encode motifs of the RNA secondary structures without pseudoknots associated with the plane trees very well. As an application, we present an algorithm (RakeSamp) generating uniformly random secondary structures without pseudoknots that satisfy fine motif specifications on the length and degree of various types of loops as well as helices.


Assuntos
Conceitos Matemáticos , RNA , RNA/química , Conformação de Ácido Nucleico , Modelos Biológicos , Algoritmos
2.
IEEE Trans Inf Theory ; 67(6): 3287-3294, 2021 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-34257466

RESUMO

Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.

3.
J Comput Biol ; 28(3): 248-256, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33275493

RESUMO

COVID-19 is an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The viral genome is considered to be relatively stable and the mutations that have been observed and reported thus far are mainly focused on the coding region. This article provides evidence that macrolevel pandemic dynamics, such as social distancing, modulate the genomic evolution of SARS-CoV-2. This view complements the prevalent paradigm that microlevel observables control macrolevel parameters such as death rates and infection patterns. First, we observe differences in mutational signals for geospatially separated populations such as the prevalence of A23404G in CA versus NY and WA. We show that the feedback between macrolevel dynamics and the viral population can be captured employing a transfer entropy framework. Second, we observe complex interactions within mutational clades. Namely, when C14408T first appeared in the viral population, the frequency of A23404G spiked in the subsequent week. Third, we identify a noncoding mutation, G29540A, within the segment between the coding gene of the N protein and the ORF10 gene, which is largely confined to NY (>95%). These observations indicate that macrolevel sociobehavioral measures have an impact on the viral genomics and may be useful for the dashboard-like tracking of its evolution. Finally, despite the fact that SARS-CoV-2 is a genetically robust organism, our findings suggest that we are dealing with a high degree of adaptability. Owing to its ample spread, mutations of unusual form are observed and a high complexity of mutational interaction is exhibited.


Assuntos
COVID-19/virologia , Evolução Molecular , Genoma Viral , SARS-CoV-2/genética , COVID-19/epidemiologia , COVID-19/transmissão , Biologia Computacional , Frequência do Gene , Comportamentos Relacionados com a Saúde , Política de Saúde , Humanos , Modelos Genéticos , Mutação , Pandemias , Filogenia , Distanciamento Físico , SARS-CoV-2/patogenicidade , SARS-CoV-2/fisiologia , Glicoproteína da Espícula de Coronavírus/genética
4.
Genome Biol ; 20(1): 144, 2019 07 25.
Artigo em Inglês | MEDLINE | ID: mdl-31345254

RESUMO

BACKGROUND: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS: Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION: The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.


Assuntos
Análise de Sequência , Benchmarking , Transferência Genética Horizontal , Internet , Filogenia , Sequências Reguladoras de Ácido Nucleico , Alinhamento de Sequência , Análise de Sequência de Proteína , Software
5.
Bioinformatics ; 35(22): 4596-4606, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-30993316

RESUMO

MOTIVATION: Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. RESULTS: Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. AVAILABILITY AND IMPLEMENTATION: The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Genômica , Algoritmos , Genoma Bacteriano , Metagenômica
6.
Bioinformatics ; 34(4): 617-624, 2018 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-29040382

RESUMO

Motivation: Capturing association patterns in gene expression levels under different conditions or time points is important for inferring gene regulatory interactions. In practice, temporal changes in gene expression may result in complex association patterns that require more sophisticated detection methods than simple correlation measures. For instance, the effect of regulation may lead to time-lagged associations and interactions local to a subset of samples. Furthermore, expression profiles of interest may not be aligned or directly comparable (e.g. gene expression profiles from two species). Results: We propose a count statistic for measuring association between pairs of gene expression profiles consisting of ordered samples (e.g. time-course), where correlation may only exist locally in subsequences separated by a position shift. The statistic is simple and fast to compute, and we illustrate its use in two applications. In a cross-species comparison of developmental gene expression levels, we show our method not only measures association of gene expressions between the two species, but also provides alignment between different developmental stages. In the second application, we applied our statistic to expression profiles from two distinct phenotypic conditions, where the samples in each profile are ordered by the associated phenotypic values. The detected associations can be useful in building correspondence between gene association networks under different phenotypes. On the theoretical side, we provide asymptotic distributions of the statistic for different regions of the parameter space and test its power on simulated data. Availability and implementation: The code used to perform the analysis is available as part of the Supplementary Material. Contact: msw@usc.edu or hhuang@stat.berkeley.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica , Redes Reguladoras de Genes , Software , Algoritmos , Biologia Computacional/métodos , Fenótipo , Análise de Sequência de RNA/métodos
7.
Nucleic Acids Res ; 45(W1): W554-W559, 2017 07 03.
Artigo em Inglês | MEDLINE | ID: mdl-28472388

RESUMO

Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE.


Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Animais , Genoma Microbiano , Internet , Metagenômica , Primatas/genética , Alinhamento de Sequência , Vertebrados/genética
8.
Proc Natl Acad Sci U S A ; 111(46): 16371-6, 2014 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-25288767

RESUMO

With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.


Assuntos
Biologia Computacional/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes , Algoritmos , Arabidopsis/genética , Arabidopsis/metabolismo , Proteínas de Arabidopsis/biossíntese , Proteínas de Arabidopsis/genética , Proteínas de Ciclo Celular/biossíntese , Proteínas de Ciclo Celular/genética , Biologia Computacional/métodos , Simulação por Computador , Regulação Fúngica da Expressão Gênica , Regulação da Expressão Gênica de Plantas , Genes Fúngicos , Genes de Plantas , Modelos Genéticos , Método de Monte Carlo , Saccharomyces cerevisiae/citologia , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/biossíntese , Proteínas de Saccharomyces cerevisiae/genética , Fatores de Tempo
9.
Brief Bioinform ; 15(3): 343-53, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24064230

RESUMO

With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.


Assuntos
Biologia Computacional/métodos , Análise de Sequência/métodos , Algoritmos , Biologia Computacional/tendências , Genômica/métodos , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala , Cadeias de Markov , Modelos Estatísticos , Alinhamento de Sequência , Análise de Sequência/estatística & dados numéricos
10.
J Comput Biol ; 20(7): 471-85, 2013 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-23829649

RESUMO

Local alignment-free sequence comparison arises in the context of identifying similar segments of sequences that may not be alignable in the traditional sense. We propose a randomized approximation algorithm that is both accurate and efficient. We show that under D2 and its important variant [Formula: see text] as the similarity measure, local alignment-free comparison between a pair of sequences can be formulated as the problem of finding the maximum bichromatic dot product between two sets of points in high dimensions. We introduce a geometric framework that reduces this problem to that of finding the bichromatic closest pair (BCP), allowing the properties of the underlying metric to be leveraged. Local alignment-free sequence comparison can be solved by making a quadratic number of alignment-free substring comparisons. We show both theoretically and through empirical results on simulated data that our approximation algorithm requires a subquadratic number of such comparisons and trades only a small amount of accuracy to achieve this efficiency. Therefore, our algorithm can extend the current usage of alignment-free-based methods and can also be regarded as a substitute for local alignment algorithms in many biological studies.


Assuntos
Algoritmos , Interpretação Estatística de Dados , Alinhamento de Sequência , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Simulação por Computador , Humanos , Software
11.
J Comput Biol ; 19(6): 839-54, 2012 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-22697250

RESUMO

Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).


Assuntos
Algoritmos , Mapeamento Cromossômico/estatística & dados numéricos , Análise de Sequência de DNA/estatística & dados numéricos , Software , Mapeamento Cromossômico/métodos , Fator de Transcrição de Proteínas de Ligação GA/genética , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Modelos Estatísticos , Distribuição de Poisson , Análise de Sequência de DNA/métodos , Transativadores/genética
12.
J Theor Biol ; 284(1): 106-16, 2011 Sep 07.
Artigo em Inglês | MEDLINE | ID: mdl-21723298

RESUMO

Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.


Assuntos
Sequências Reguladoras de Ácido Nucleico/genética , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Interpretação Estatística de Dados , Drosophila/genética , Evolução Molecular , HIV-1/genética , Modelos Estatísticos , Alinhamento de Sequência , Homologia de Sequência do Ácido Nucleico
13.
PLoS Comput Biol ; 7(6): e1001106, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-21698123

RESUMO

The rapid accumulation of biological networks poses new challenges and calls for powerful integrative analysis tools. Most existing methods capable of simultaneously analyzing a large number of networks were primarily designed for unweighted networks, and cannot easily be extended to weighted networks. However, it is known that transforming weighted into unweighted networks by dichotomizing the edges of weighted networks with a threshold generally leads to information loss. We have developed a novel, tensor-based computational framework for mining recurrent heavy subgraphs in a large set of massive weighted networks. Specifically, we formulate the recurrent heavy subgraph identification problem as a heavy 3D subtensor discovery problem with sparse constraints. We describe an effective approach to solving this problem by designing a multi-stage, convex relaxation protocol, and a non-uniform edge sampling technique. We applied our method to 130 co-expression networks, and identified 11,394 recurrent heavy subgraphs, grouped into 2,810 families. We demonstrated that the identified subgraphs represent meaningful biological modules by validating against a large set of compiled biological knowledge bases. We also showed that the likelihood for a heavy subgraph to be meaningful increases significantly with its recurrence in multiple networks, highlighting the importance of the integrative approach to biological network analysis. Moreover, our approach based on weighted graphs detects many patterns that would be overlooked using unweighted graphs. In addition, we identified a large number of modules that occur predominately under specific phenotypes. This analysis resulted in a genome-wide mapping of gene network modules onto the phenome. Finally, by comparing module activities across many datasets, we discovered high-order dynamic cooperativeness in protein complex networks and transcriptional regulatory networks.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Redes Reguladoras de Genes , Modelos Genéticos , Processamento de Sinais Assistido por Computador , Algoritmos , Bases de Dados de Proteínas , Expressão Gênica , Regulação da Expressão Gênica , Humanos , Análise de Sequência com Séries de Oligonucleotídeos , Fenótipo , Proteínas
14.
J Comput Biol ; 18(5): 677-91, 2011 May.
Artigo em Inglês | MEDLINE | ID: mdl-21554016

RESUMO

Sequence alignment depends on the scoring function that defines similarity between pairs of letters. For local alignment, the computational algorithm searches for the most similar segments in the sequences according to the scoring function. The choice of this scoring function is important for correctly detecting segments of interest. We formulate sequence alignment as a hypothesis testing problem, and conduct extensive simulation experiments to study the relationship between the scoring function and the distribution of aligned pairs within the aligned segment under this framework. We cut through the many ways to construct scoring functions and showed that any scoring function with negative expectation used in local alignment corresponds to a hypothesis test between the background distribution of sequence letters and a statistical distribution of letter pairs determined by the scoring function. The results indicate that the log-likelihood ratio scoring function is statistically most powerful and has the highest accuracy for detecting the segments of interest that are defined by the statistical distribution of aligned letter pairs.


Assuntos
Algoritmos , Alinhamento de Sequência , Funções Verossimilhança , Matemática , Modelos Teóricos
15.
J Comput Biol ; 17(11): 1467-90, 2010 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-20973742

RESUMO

Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.


Assuntos
Sequência de Bases , Biologia Computacional/métodos , Modelos Estatísticos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Matemática , Software
16.
Proc Natl Acad Sci U S A ; 107(24): 10848-53, 2010 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-20534489

RESUMO

Variation in genome structure is an important source of human genetic polymorphism: It affects a large proportion of the genome and has a variety of phenotypic consequences relevant to health and disease. In spite of this, human genome structure variation is incompletely characterized due to a lack of approaches for discovering a broad range of structural variants in a global, comprehensive fashion. We addressed this gap with Optical Mapping, a high-throughput, high-resolution single-molecule system for studying genome structure. We used Optical Mapping to create genome-wide restriction maps of a complete hydatidiform mole and three lymphoblast-derived cell lines, and we validated the approach by demonstrating a strong concordance with existing methods. We also describe thousands of new variants with sizes ranging from kb to Mb.


Assuntos
Genoma Humano , Mapeamento por Restrição Óptica/métodos , Algoritmos , Linhagem Celular , Linhagem Celular Tumoral , Feminino , Variação Genética , Estudo de Associação Genômica Ampla , Humanos , Mola Hidatiforme/genética , Linfócitos/metabolismo , Mapeamento por Restrição Óptica/estatística & dados numéricos , Gravidez , Neoplasias Uterinas/genética
17.
J Comput Biol ; 17(4): 581-92, 2010 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-20426691

RESUMO

The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.


Assuntos
Cadeias de Markov , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de DNA/métodos , Composição de Bases/genética , Sequência de Bases , Ilhas de CpG/genética , Internet , Análise Numérica Assistida por Computador , Distribuição de Poisson
18.
BMC Bioinformatics ; 11 Suppl 1: S62, 2010 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-20122238

RESUMO

BACKGROUND: Complex human diseases are often caused by multiple mutations, each of which contributes only a minor effect to the disease phenotype. To study the basis for these complex phenotypes, we developed a network-based approach to identify coexpression modules specifically activated in particular phenotypes. We integrated these modules, protein-protein interaction data, Gene Ontology annotations, and our database of gene-phenotype associations derived from literature to predict novel human gene-phenotype associations. Our systematic predictions provide us with the opportunity to perform a global analysis of human gene pleiotropy and its underlying regulatory mechanisms. RESULTS: We applied this method to 338 microarray datasets, covering 178 phenotype classes, and identified 193,145 phenotype-specific coexpression modules. We trained random forest classifiers for each phenotype and predicted a total of 6,558 gene-phenotype associations. We showed that 40.9% genes are pleiotropic, highlighting that pleiotropy is more prevalent than previously expected. We collected 77 ChIP-chip datasets studying 69 transcription factors binding over 16,000 targets under various phenotypic conditions. Utilizing this unique data source, we confirmed that dynamic transcriptional regulation is an important force driving the formation of phenotype specific gene modules. CONCLUSION: We created a genome-wide gene to phenotype mapping that has many potential implications, including providing potential new drug targets and uncovering the basis for human disease phenotypes. Our analysis of these phenotype-specific coexpression modules reveals a high prevalence of gene pleiotropy, and suggests that phenotype-specific transcription factor binding may contribute to phenotypic diversity. All resources from our study are made freely available on our online Phenotype Prediction Database.


Assuntos
Biologia Computacional/métodos , Genoma , Fenótipo , Bases de Dados Genéticas , Perfilação da Expressão Gênica
19.
J Comput Sci Technol ; 25(1): 3-9, 2010 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-22121326

RESUMO

New generation sequencing systems are changing how molecular biology is practiced. The widely promoted $1000 genome will be a reality with attendant changes for healthcare, including personalized medicine. More broadly the genomes of many new organisms with large samplings from populations will be commonplace. What is less appreciated is the explosive demands on computation, both for CPU cycles and storage as well as the need for new computational methods. In this article we will survey some of these developments and demands.

20.
J Comput Biol ; 16(12): 1615-34, 2009 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-20001252

RESUMO

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count statistic, which we call D(2)(S) and D(2)(*). For D(2)(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D(2)(*), outperforms D(2)(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D(2)(*), we cannot provide a closed form for power calculations.


Assuntos
Biologia Computacional/métodos , Modelos Estatísticos , Homologia de Sequência do Ácido Nucleico , Composição de Bases/genética , Alinhamento de Sequência
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...