Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 32(12): i209-i215, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27307619

RESUMO

MOTIVATION: Transposable elements (TEs) and repetitive DNA make up a sizable fraction of Eukaryotic genomes, and their annotation is crucial to the study of the structure, organization, and evolution of any newly sequenced genome. Although RepeatMasker and nHMMER are useful for identifying these repeats, they require a pre-compiled repeat library-which is not always available. De novo identification tools such as Recon, RepeatScout or RepeatGluer serve to identify TEs purely from sequence content, but are either limited by runtimes that prohibit whole-genome use or degrade in quality in the presence of substitutions that disrupt the sequence patterns. RESULTS: phRAIDER is a de novo TE identification tool that address the issues of excessive runtime without sacrificing sensitivity as compared to competing tools. The underlying model is a new definition of elementary repeats that incorporates the PatternHunter spaced seed model, allowing for greater sensitivity in the presence of genomic substitutions. As compared with the premier tool in the literature, RepeatScout, phRAIDER shows an average 10× speedup on any single human chromosome and has the ability to process the whole human genome in just over three hours. Here we discuss the tool, the theoretical model underlying the tool, and the results demonstrating its effectiveness. AVAILABILITY AND IMPLEMENTATION: phRAIDER is an open source tool available from https://github.com/karroje/phRAIDER CONTACT: : karroje@miamiOH.edu or SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Elementos de DNA Transponíveis , Biblioteca Gênica , Genoma Humano , Genômica , Humanos , Análise de Sequência de DNA
2.
Genomics ; 104(3): 157-62, 2014 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-25087770

RESUMO

BACKGROUND: mRNA polyadenylation, the addition of a poly(A) tail to the 3'-end of pre-mRNA, is a process critical to gene expression and regulation in eukaryotes. To understand the molecular mechanisms governing polyadenylation and other relevant biological processes, it is important to identify these poly(A) tails accurately in transcriptome sequencing data and differentiate them from artificial adapter sequences added in the sequencing process. But the annotation of these tails is complicated by the presence of sequencing errors and post-transcriptional modifications. While determining that a tail is present in a given transcript fragment is straight-forward, these obfuscations make the problem of boundary identification a challenge; conventional seed-and-extend algorithms struggle to accurately identify these poly(A) tail end-points. Further, all existing tools that we are aware of focus exclusively on the trimming of poly(A) tails, failing to provide the detailed information needed for studying the polyadenylation process. RESULTS: We have created SCOPE++, an open-source tool for finding the precise border of poly(A) tails and other homopolymers in raw mRNA sequence reads. Based on a Hidden Markov Model (HMM) approach, SCOPE++ accurately identifies specific homopolymer sequences in error-prone EST/cDNA data or RNA-Seq data at a speed appropriate for large sequence sets. CONCLUSIONS: We demonstrate that our tool can precisely identify poly(A) tails with near perfect accuracy at the speed required for high-throughput applications, providing a valuable resource for polyadenylation research.


Assuntos
Poliadenilação , RNA Mensageiro/química , Análise de Sequência de RNA/métodos , Software , Sinais de Poliadenilação na Ponta 3' do RNA , RNA Mensageiro/metabolismo , Transcriptoma
3.
Bioinformatics ; 30(6): 887-8, 2014 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-24215021

RESUMO

SUMMARY: Palindromic sequences, or inverted repeats (IRs), in DNA sequences involve important biological processes such as DNA-protein binding, DNA replication and DNA transposition. Development of bioinformatics tools that are capable of accurately detecting perfect IRs can enable genome-wide studies of IR patterns in both prokaryotes and eukaryotes. Different from conventional string-comparison approaches, we propose a novel algorithm that uses a cumulative score system based on a prime number representation of nucleotide bases. We then implemented this algorithm as a MATLAB-based program for perfect IR detection. In comparison with other existing tools, our program demonstrates a high accuracy in detecting nested and overlapping IRs. AVAILABILITY AND IMPLEMENTATION: The source code is freely available on (http://bioinfolab.miamioh.edu/bioinfolab/palindrome.php) CONTACT: liangc@miamioh.edu or karroje@miamioh.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de DNA/métodos , Algoritmos , Genoma , Software
4.
Am J Primatol ; 76(5): 460-71, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24166824

RESUMO

The attribution of goal-directed behavior to observations of primate foraging and ranging requires that simpler explanations for observed behavior patterns be eliminated. Computer-generated simulations of non-goal-directed foraging behavior can be used as null models for higher complexity cognitive foraging, and can provide quantifiable data against which to compare the observed behavioral patterns in wild primates. In this paper, we compare the results of two variations of computer simulated null models with observed foraging behavior of wild spider monkeys (Ateles belzebuth). One model simulates monkeys searching using a modified random-walk model in which monkeys alternate 100-m steps with turn angles derived from observed behavior. The second model constrains travel to an observed route system derived from observations of wild spider monkeys. Simulated monkeys in each model searched among increasing densities of feeding trees ranging from 10 to 1,000. We compared travel distance, travel directness, and accuracy of starting direction for each feeding tree discovered for the two models. We then compared these results with those derived from observations of wild spider monkeys. Route-model monkeys traveled shorter distances and more directly to feeding trees than did randomly foraging monkeys, and discovered trees in the direction they started more often. Observed spider monkeys outperformed simulated monkeys from both models in all variables, allowing us to reject the null hypothesis that observed foraging and ranging behavior could be explained by non-goal-directed travel.


Assuntos
Comportamento Apetitivo , Atelinae/psicologia , Animais , Comportamento Animal , Simulação por Computador , Equador , Comportamento Alimentar , Feminino , Locomoção , Árvores
5.
BMC Biotechnol ; 12: 16, 2012 May 03.
Artigo em Inglês | MEDLINE | ID: mdl-22554190

RESUMO

BACKGROUND: Expressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. However, some of GenBank dbEST sequences have proven to be "unclean". Identification of cDNA termini/ends and their structures in raw ESTs not only facilitates data quality control and accurate delineation of transcription ends, but also furthers our understanding of the potential sources of data abnormalities/errors present in the wet-lab procedures for cDNA library construction. RESULTS: After analyzing a total of 309,976 raw Pinus taeda ESTs, we uncovered many distinct variations of cDNA termini, some of which prove to be good indicators of wet-lab artifacts, and characterized each raw EST by its cDNA terminus structure patterns. In contrast to the expected patterns, many ESTs displayed complex and/or abnormal patterns that represent potential wet-lab errors such as: a failure of one or both of the restriction enzymes to cut the plasmid vector; a failure of the restriction enzymes to cut the vector at the correct positions; the insertion of two cDNA inserts into a single vector; the insertion of multiple and/or concatenated adapters/linkers; the presence of 3'-end terminal structures in designated 5'-end sequences or vice versa; and so on. With a close examination of these artifacts, many problematic ESTs that have been deposited into public databases by conventional bioinformatics pipelines or tools could be cleaned or filtered by our methodology. We developed a software tool for Abnormality Filtering and Sequence Trimming for ESTs (AFST, http://code.google.com/p/afst/) using a pattern analysis approach. To compare AFST with other pipelines that submitted ESTs into dbEST, we reprocessed 230,783 Pinus taeda and 38,709 Arachis hypogaea GenBank ESTs. We found 7.4% of Pinus taeda and 29.2% of Arachis hypogaea GenBank ESTs are "unclean" or abnormal, all of which could be cleaned or filtered by AFST. CONCLUSIONS: cDNA terminal pattern analysis, as implemented in the AFST software tool, can be utilized to reveal wet-lab errors such as restriction enzyme cutting abnormities and chimeric EST sequences, detect various data abnormalities embedded in existing Sanger EST datasets, improve the accuracy of identifying and extracting bona fide cDNA inserts from raw ESTs, and therefore greatly benefit downstream EST-based applications.


Assuntos
Enzimas de Restrição do DNA/metabolismo , DNA Complementar/metabolismo , Etiquetas de Sequências Expressas , Biblioteca Gênica , Biologia Computacional , DNA Complementar/genética , Bases de Dados Genéticas , Genes de Plantas , Vetores Genéticos/genética , Vetores Genéticos/metabolismo , Internet , Pinus taeda/genética , Software
6.
BMC Evol Biol ; 9: 89, 2009 May 05.
Artigo em Inglês | MEDLINE | ID: mdl-19416516

RESUMO

BACKGROUND: The rate at which neutral (non-functional) bases undergo substitution is highly dependent on their location within a genome. However, it is not clear how fast these location-dependent rates change, or to what extent the substitution rate patterns are conserved between lineages. To address this question, which is critical not only for understanding the substitution process but also for evaluating phylogenetic footprinting algorithms, we examine ancestral repeats: a predominantly neutral dataset with a significantly higher genomic density than other datasets commonly used to study substitution rate variation. Using this repeat data, we measure the extent to which orthologous ancestral repeat sequences exhibit similar substitution patterns in separate mammalian lineages, allowing us to ascertain how well local substitution rates have been preserved across species. RESULTS: We calculated substitution rates for each ancestral repeat in each of three independent mammalian lineages (primate - from human/macaque alignments, rodent - from mouse/rat alignments, and laurasiatheria - from dog/cow alignments). We then measured the correlation of local substitution rates among these lineages. Overall we found the correlations between lineages to be statistically significant, but too weak to have much predictive power (r2 <5%). These correlations were found to be primarily driven by regional effects at the scale of several hundred kb or larger. A few repeat classes (e.g. 7SK, Charlie8, and MER121) also exhibited stronger conservation of rate patterns, likely due to the effect of repeat-specific purifying selection. These classes should be excluded when estimating local neutral substitution rates. CONCLUSION: Although local neutral substitution rates have some correlations among mammalian species, these correlations have little predictive power on the scale of individual repeats. This indicates that local substitution rates have changed significantly among the lineages we have studied, and are likely to have changed even more for more diverged lineages. The correlations that do persist are too weak to be responsible for many of the highly conserved elements found by phylogenetic footprinting algorithms, leading us to conclude that such elements must be conserved due to selective forces.


Assuntos
Análise Mutacional de DNA , Evolução Molecular , Genoma , Mamíferos/genética , Algoritmos , Animais , Sequência Conservada , Humanos , Camundongos , Modelos Genéticos , Filogenia , Alinhamento de Sequência
7.
Genome Biol ; 9(4): R76, 2008 Apr 30.
Artigo em Inglês | MEDLINE | ID: mdl-18447906

RESUMO

BACKGROUND: The evolutionary distance between human and macaque is particularly attractive for investigating local variation in neutral substitution rates, because substitutions can be inferred more reliably than in comparisons with rodents and are less influenced by the effects of current and ancient diversity than in comparisons with closer primates. Here we investigate the human-macaque neutral substitution rate as a function of a number of genomic parameters. RESULTS: Using regression analyses we find that male mutation bias, male (but not female) recombination rate, distance to telomeres and substitution rates computed from orthologous regions in mouse-rat and dog-cow comparisons are prominent predictors of the neutral rate. Additionally, we demonstrate that the previously observed biphasic relationship between neutral rate and GC content can be accounted for by properly combining rates at CpG and non-CpG sites. Finally, we find the neutral rate to be negatively correlated with the densities of several classes of computationally predicted functional elements, and less so with the densities of certain classes of experimentally verified functional elements. CONCLUSION: Our results suggest that while female recombination may be mainly responsible for driving evolution in GC content, male recombination may be mutagenic, and that other mutagenic mechanisms acting near telomeres, and mechanisms whose effects are shared across mammalian genomes, play significant roles. We also have evidence that the nonlinear increase in rates at high GC levels may be largely due to hyper-mutability of CpG dinucleotides. Finally, our results suggest that the performance of conservation-based prediction methods can be improved by accounting for neutral rates.


Assuntos
Mutação , Recombinação Genética , Animais , Composição de Bases , Feminino , Humanos , Cinética , Macaca , Masculino , Análise de Regressão , Fatores Sexuais , Telômero
8.
Nucleic Acids Res ; 35(Database issue): D55-60, 2007 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-17099229

RESUMO

The Pseudogene.org knowledgebase serves as a comprehensive repository for pseudogene annotation. The definition of a pseudogene varies within the literature, resulting in significantly different approaches to the problem of identification. Consequently, it is difficult to maintain a consistent collection of pseudogenes in detail necessary for their effective use. Our database is designed to address this issue. It integrates a variety of heterogeneous resources and supports a subset structure that highlights specific groups of pseudogenes that are of interest to the research community. Tools are provided for the comparison of sets and the creation of layered set unions, enabling researchers to derive a current 'consensus' set of pseudogenes. Additional features include versatile search, the capacity for robust interaction with other databases, the ability to reconstruct older versions of the database (accounting for changing genome builds) and an underlying object-oriented interface designed for researchers with a minimal knowledge of programming. At the present time, the database contains more than 100,000 pseudogenes spanning 64 prokaryote and 11 eukaryote genomes, including a collection of human annotations compiled from 16 sources.


Assuntos
Bases de Dados Genéticas , Pseudogenes , Humanos , Internet , Software , Interface Usuário-Computador
9.
Bioinformatics ; 22(12): 1437-9, 2006 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-16574694

RESUMO

MOTIVATION: Mammalian genomes contain many 'genomic fossils' i.e. pseudogenes. These are disabled copies of functional genes that have been retained in the genome by gene duplication or retrotransposition events. Pseudogenes are important resources in understanding the evolutionary history of genes and genomes. RESULTS: We have developed a homology-based computational pipeline ('PseudoPipe') that can search a mammalian genome and identify pseudogene sequences in a comprehensive and consistent manner. The key steps in the pipeline involve using BLAST to rapidly cross-reference potential "parent" proteins against the intergenic regions of the genome and then processing the resulting "raw hits" -- i.e. eliminating redundant ones, clustering together neighbors, and associating and aligning clusters with a unique parent. Finally, pseudogenes are classified based on a combination of criteria including homology, intron-exon structure, and existence of stop codons and frameshifts.


Assuntos
Biologia Computacional/métodos , Algoritmos , Animais , Automação , Evolução Molecular , Genoma , Humanos , Modelos Genéticos , Pseudogenes , Reprodutibilidade dos Testes , Software
10.
Genome Res ; 16(2): 271-81, 2006 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-16365382

RESUMO

A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements. A central issue in designing tiling arrays is that of arriving at a single-copy tile path, as significant sequence cross-hybridization can result from the presence of non-unique probes on the array. Due to the fragmentation of genomic DNA caused by the widespread distribution of repetitive elements, the problem of obtaining adequate sequence coverage increases with the sizes of subsequence tiles that are to be included in the design. This becomes increasingly problematic when considering complex eukaryotic genomes that contain many thousands of interspersed repeats. The general problem of sequence tiling can be framed as finding an optimal partitioning of non-repetitive subsequences over a prescribed range of tile sizes, on a DNA sequence comprising repetitive and non-repetitive regions. Exact solutions to the tiling problem become computationally infeasible when applied to large genomes, but successive optimizations are developed that allow their practical implementation. These include an efficient method for determining the degree of similarity of many oligonucleotide sequences over large genomes, and two algorithms for finding an optimal tile path composed of longer sequence tiles. The first algorithm, a dynamic programming approach, finds an optimal tiling in linear time and space; the second applies a heuristic search to reduce the space complexity to a constant requirement. A Web resource has also been developed, accessible at http://tiling.gersteinlab.org, to generate optimal tile paths from user-provided DNA sequences.


Assuntos
Algoritmos , Perfilação da Expressão Gênica , Genoma Humano , Sequências Repetitivas Dispersas , Análise de Sequência com Séries de Oligonucleotídeos , Animais , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/normas , Genoma Humano/genética , Humanos , Sequências Repetitivas Dispersas/genética , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência com Séries de Oligonucleotídeos/normas , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
11.
J Mol Biol ; 349(1): 27-45, 2005 May 27.
Artigo em Inglês | MEDLINE | ID: mdl-15876366

RESUMO

Pseudogenes are inheritable genetic elements formally defined by two properties: their similarity to functioning genes and their presumed lack of activity. However, their precise characterization, particularly with respect to the latter quality, has proven elusive. An opportunity to explore this issue arises from the recent emergence of tiling-microarray data showing that intergenic regions (containing pseudogenes) are transcribed to a great degree. Here we focus on the transcriptional activity of pseudogenes on human chromosome 22. First, we integrated several sets of annotation to define a unified list of 525 pseudogenes on the chromosome. To characterize these further, we developed a comprehensive list of genomic features based on conservation in related organisms, expression evidence, and the presence of upstream regulatory sites. Of the 525 unified pseudogenes we could confidently classify 154 as processed and 49 as duplicated. Using data from tiling microarrays, especially from recent high-resolution oligonucleotide arrays, we found some evidence that up to a fifth of the 525 pseudogenes are potentially transcribed. Expressed sequence tags (EST) comparison further validated a number of these, and overall we found 17 pseudogenes with strong support for transcription. In particular, one of the pseudogenes with both EST and microarray evidence for transcription turned out to be a duplicated pseudogene in the cat eye syndrome critical region. Although we could not identify a meaningful number of transcription factor-binding sites (based on chromatin immunoprecipitation-chip data) near pseudogenes, we did find that approximately 12% of the pseudogenes had upstream CpG islands. Finally, analysis of corresponding syntenic regions in the mouse, rat and chimp genomes indicates, as previously suggested, that pseudogenes are less conserved than genes, but more preserved than the intergenic background (all notation is available from http://www.pseudogene.org).


Assuntos
Cromossomos Humanos Par 22 , Pseudogenes/fisiologia , Transcrição Gênica/fisiologia , Animais , Sequência de Bases , Sítios de Ligação , Mapeamento Cromossômico , Etiquetas de Sequências Expressas , Humanos , Camundongos , Dados de Sequência Molecular , Análise de Sequência com Séries de Oligonucleotídeos , Pan troglodytes/genética , Mutação Puntual , Ratos , Fatores de Transcrição/metabolismo
12.
Nucleic Acids Res ; 32(1): 328-37, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-14724320

RESUMO

Biological networks are a topic of great current interest, particularly with the publication of a number of large genome-wide interaction datasets. They are globally characterized by a variety of graph-theoretic statistics, such as the degree distribution, clustering coefficient, characteristic path length and diameter. Moreover, real protein networks are quite complex and can often be divided into many sub-networks through systematic selection of different nodes and edges. For instance, proteins can be sub-divided by expression level, length, amino-acid composition, solubility, secondary structure and function. A challenging research question is to compare the topologies of sub- networks, looking for global differences associated with different types of proteins. TopNet is an automated web tool designed to address this question, calculating and comparing topological characteristics for different sub-networks derived from any given protein network. It provides reasonable solutions to the calculation of network statistics for sub-networks embedded within a larger network and gives simplified views of a sub-network of interest, allowing one to navigate through it. After constructing TopNet, we applied it to the interaction networks and protein classes currently available for yeast. We were able to find a number of potential biological correlations. In particular, we found that soluble proteins had more interactions than membrane proteins. Moreover, amongst soluble proteins, those that were highly expressed, had many polar amino acids, and had many alpha helices, tended to have the most interaction partners. Interestingly, TopNet also turned up some systematic biases in the current yeast interaction network: on average, proteins with a known functional classification had many more interaction partners than those without. This phenomenon may reflect the incompleteness of the experimentally determined yeast interaction network.


Assuntos
Proteínas/química , Proteínas/metabolismo , Software , Algoritmos , Biologia Computacional , Bases de Dados de Proteínas , Genômica , Internet , Peso Molecular , Ligação Proteica , Estrutura Secundária de Proteína , Proteínas/genética , Proteômica , RNA Mensageiro/análise , Proteínas de Saccharomyces cerevisiae/química , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Solubilidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...