Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 5 de 5
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
J Bioinform Comput Biol ; 10(2): 1241005, 2012 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-22809341

RESUMO

The new generation of short-read sequencing technologies requires reliable measures of data quality. Such measures are especially important for variant calling. However, in the particular case of SNP calling, a great number of false-positive SNPs may be obtained. One needs to distinguish putative SNPs from sequencing or other errors. We found that not only the probability of sequencing errors (i.e. the quality value) is important to distinguish an FP-SNP but also the conditional probability of "correcting" this error (the "second best call" probability, conditional on that of the first call). Surprisingly, around 80% of mismatches can be "corrected" with this second call. Another way to reduce the rate of FP-SNPs is to retrieve DNA motifs that seem to be prone to sequencing errors, and to attach a corresponding conditional quality value to these motifs. We have developed several measures to distinguish between sequence errors and candidate SNPs, based on a base call's nucleotide context and its mismatch type. In addition, we suggested a simple method to correct the majority of mismatches, based on conditional probability of their "second" best intensity call. We attach a corresponding second call confidence (quality value) of being corrected to each mismatch.


Assuntos
Análise de Sequência de DNA/métodos , Algoritmos , Motivos de Nucleotídeos , Polimorfismo de Nucleotídeo Único , Projetos de Pesquisa
2.
Biosystems ; 91(1): 183-94, 2008 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-18029086

RESUMO

In this paper we analyse the efficiency of two methods, rescaled range analysis and detrended fluctuation analysis, in distinguishing between coding DNA, regulatory DNA and non-coding non-regulatory DNA of Drosophila melanogaster. Both methods were used to estimate the degree of sequential dependence (or persistence) among nucleotides. We found that these three types of DNA can be discriminated by both methods, although rescaled range analysis performs slightly better than detrended fluctuation analysis. On average, non-coding, non-regulatory DNA has the highest degree of sequential persistence. Coding DNA could be characterised as being anti-persistent, which is in line with earlier findings of latent periodicity. Regulatory regions are shown to possess intermediate sequential dependency. Together with other available methods, rescaled range and detrended fluctuation analysis on the basis of a combined purine/pyrimidine and weak/strong classification of the nucleotides are useful tools for refined structural and functional segmentation of DNA.


Assuntos
Fases de Leitura Aberta/genética , RNA não Traduzido/genética , Sequências Reguladoras de Ácido Nucleico/genética , Animais , Computadores , DNA/análise , DNA/genética , Genoma/genética , Humanos
3.
J Bioinform Comput Biol ; 4(2): 425-41, 2006 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-16819793

RESUMO

One of the main goals of analysing DNA sequences is to understand the temporal and positional information that specifies gene expression. An important step in this process is the recognition of gene expression regulatory elements. Experimental procedures for this are slow and costly. In this paper we present a computational non-supervised algorithm that facilitates the process by statistically identifying the most likely regions within a putative regulatory sequence. A probabilistic technique is presented, based on the approximation of regulatory DNA with a Markov chain, for the location of putative transcription factor binding sites in a single stretch of DNA. Hereto we developed a procedure to approximate the order of Markov model for a given DNA sequence that circumvents some of the prohibitive assumptions underlying Markov modeling. Application of the algorithm to data from 55 genes in five species shows the high sensitivity of this Markov search algorithm. Our algorithm does not require any prior knowledge in the form of description or cross-genomic comparison; it is context sensitive and takes DNA heterogeneity into account.


Assuntos
Mapeamento Cromossômico/métodos , DNA/genética , Sequências Reguladoras de Ácido Nucleico/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Fatores de Transcrição/genética , Inteligência Artificial , Sítios de Ligação , Simulação por Computador , DNA/química , Cadeias de Markov , Modelos Genéticos , Modelos Estatísticos , Reconhecimento Automatizado de Padrão , Ligação Proteica , Fatores de Transcrição/química
4.
J Bioinform Comput Biol ; 4(2): 523-36, 2006 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-16819800

RESUMO

Identifying regions of DNA with extreme statistical characteristics is an important aspect of the structural analysis of complete genomes. Linguistic methods, mainly based on estimating word frequency, can be used for this as they allow for the delineation of regions of low complexity. Low complexity may be due to biased nucleotide composition, by tandem- or dispersed repeats, by palindrome-hairpin structures, as well as by a combination of all these features. We developed software tools in which various numerical measures of text complexity are implemented, including combinatorial and linguistic ones. We also added Hurst exponent estimate to the software to measure dependencies in DNA sequences. By applying these tools to various functional genomic regions, we demonstrate that the complexity of introns and regulatory regions is lower than that of coding regions, whilst Hurst exponent is larger. Further analysis of promoter sequences revealed that the lower complexity of these regions is associated with long-range correlations caused by transcription factor binding sites.


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Modelos Genéticos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Sequência de Bases , Simulação por Computador , Interpretação Estatística de Dados , Entropia , Modelos Estatísticos , Dados de Sequência Molecular , Software
5.
BMC Bioinformatics ; 6: 109, 2005 Apr 27.
Artigo em Inglês | MEDLINE | ID: mdl-15857505

RESUMO

BACKGROUND: This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information. RESULTS: We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA. CONCLUSION: We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.


Assuntos
Biologia Computacional/métodos , DNA/química , Drosophila melanogaster/genética , Genoma , Análise de Sequência de DNA , Algoritmos , Motivos de Aminoácidos , Animais , Sequência de Bases , Sítios de Ligação , Núcleo Celular/metabolismo , Cromatina/metabolismo , Análise por Conglomerados , Bases de Dados Genéticas , Genes de Insetos , Genes Reguladores , Genômica , Modelos Estatísticos , Dados de Sequência Molecular , Transcrição Gênica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...