Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
2.
Nucleic Acids Res ; 26(20): 4748-57, 1998 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-9753745

RESUMO

Prediction of splice site selection and efficiency from sequence inspection is of fundamental interest (testing the current knowledge of requisite sequence features) and practical importance (genome annotation, design of mutant or transgenic organisms). In plants, the dominant variables affecting splice site selection and efficiency include the degree of matching to the extended splice site consensus and the local gradient of U- and G+C-composition (introns being U-rich and exons G+C-rich). We present a novel method for splice site prediction, which was particularly trained for maize and Arabidopsis thaliana. The method extends our previous algorithm based on logitlinear models by considering three variables simultaneously: intrinsic splice site strength, local optimality and fit with respect to the overall splice pattern prediction. We show that the method considerably improves prediction specificity without compromising the high degree of sensitivity required in gene prediction algorithms. Applications to gene identification are illustrated for Arabidopsis and suggest that successful methods must combine scoring for splice sites, coding potential and similarity with potential homologs in non-trivial ways. A WWW version of the SplicePredictor program is available at http:/gnomic.stanford.edu/volker/SplicePredi ctor.html/


Assuntos
Arabidopsis/genética , Genes de Plantas/genética , Splicing de RNA/genética , Algoritmos , Simulação por Computador , DNA de Plantas/genética , Genoma de Planta , Internet , Íntrons , Regiões não Traduzidas , Zea mays/genética
3.
Bioinformatics ; 14(3): 232-43, 1998.
Artigo em Inglês | MEDLINE | ID: mdl-9614266

RESUMO

MOTIVATION: We developed GeneGenerator because of the need for a tool to predict gene structure without knowing in advance how to score potential exons and introns in order to obtain the best results, pertinent in particular to less well-studied organisms for which suitable training sets are small. GeneGenerator is a very flexible algorithm which for a given genomic sequence generates a number of feasible gene structures satisfying user-defined constraints. The specific implementation described in detail requires minimum scoring for translation start and donor and acceptor splice sites according to previously trained logitlinear models. In addition, potential exons and introns are required to exceed specified minimal lengths and threshold scores for coding or non-coding potential derived as log-likelihood ratios of appropriate Markov sequence models. RESULTS: A database of 46 non-redundant genomic sequences from maize is used for illustration. It is shown that the correct gene structures do not always maximize the considered target function. However, in most cases, the correct or nearly correct structures are found in a small set of high-scoring structures. A critical review of the generated structures sometimes allows the choices to be narrowed by considering additional variables such as predicted splice site strength or local optimality of splice site scores. Summary statistics for prediction accuracy over all 46 maize genes are derived under cross-validation and non-cross-validation training conditions for the Markov sequence models. The algorithm achieved exon sensitivity of 0.81 and specificity of 0.75 on an independent set of 14 novel maize genomic segments. AVAILABILITY: GeneGenerator runs under Borland-Pascal 7.0 using MS-DOS and C on UNIX work stations. The source code is available upon request. CONTACT: jkleffe@euler.grumed.fu-berlin-de


Assuntos
Algoritmos , Genes de Plantas/genética , Análise de Sequência de DNA/métodos , Software , Zea mays/genética , Biologia Computacional/métodos , Proteínas de Ligação a DNA/genética , Éxons , Glucosiltransferases/genética , Íntrons , Zíper de Leucina , Modelos Logísticos , Cadeias de Markov , Modelos Genéticos , Proteínas de Plantas , Validação de Programas de Computador , Fatores de Transcrição/genética
4.
J Mol Biol ; 276(1): 85-104, 1998 Feb 13.
Artigo em Inglês | MEDLINE | ID: mdl-9514728

RESUMO

Heterologous introns are often inaccurately or inefficiently processed in higher plants. The precise features that distinguish the process of pre-mRNA splicing in plants from splicing in yeast and mammals are unclear. One contributing factor is the prominent base compositional contrast between U-rich plant introns and flanking G + C-rich exons. Inclusion of this contrast factor in recently developed statistical methods for splice site prediction from sequence inspection significantly improved prediction accuracy. We applied the prediction tools to re-analyze experimental data on splice site selection and splicing efficiency for native and more than 170 mutated plant introns. In almost all cases, the experimentally determined preferred sites correspond to the highest scoring sites predicted by the model. In native genes, about 90% of splice sites are the locally highest scoring sites within the bounds of the flanking exon and intron. We propose that, in most cases, local context (about 50 bases upstream and downstream from a potential intron end) is sufficient to account for intrinsic splice site strength, and that competition for transacting factors determines splice site selection in vivo. We suggest that computer-aided splice site prediction can be a powerful tool for experimental design and interpretation.


Assuntos
Precursores de RNA/química , Splicing de RNA , RNA de Plantas/química , Animais , Arabidopsis/genética , Composição de Bases , Sequência de Bases , Éxons/genética , Genes Sintéticos , Íntrons/genética , Mamíferos/genética , Modelos Químicos , Pisum sativum/genética , RNA/genética , Precursores de RNA/genética , RNA de Plantas/genética , Especificidade da Espécie , Transgenes , Zea mays/genética
5.
Nucleic Acids Res ; 24(23): 4709-18, 1996 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-8972857

RESUMO

Pre-mRNA splicing in plants, while generally similar to the processes in vertebrates and yeast, is thought to involve plant specific cis-acting elements. Both monocot and dicot introns are typically strongly enriched in U nucleotides, and AU- or U-rich segments are thought to be involved in intron recognition, splice site selection, and splicing efficiency. We have applied logitlinear models to find optimal combinations of splice site variables for the purpose of separating true splice sites from a large excess of potential sites. It is shown that plant splice site prediction from sequence inspection is greatly improved when compositional contrast between exons and introns is considered in addition to degree of matching to the splice site consensus (signal quality). The best model involves subclassification of splice sites according to the identity of the base immediately upstream of the GU and AG signals and gives substantial performance gains compared with conventional profile methods.


Assuntos
Modelos Lineares , Precursores de RNA/química , Splicing de RNA , RNA Mensageiro/química , RNA de Plantas/química , Algoritmos , Arabidopsis/genética , Éxons , Íntrons , Dados de Sequência Molecular , Zea mays/genética
6.
Comput Appl Biosci ; 12(2): 119-27, 1996 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-8744774

RESUMO

SCL (Sequence Class Library) is a class library written in the C++ programming language. Designed using object-oriented programming principles, SCL consists of classes of objects performing tasks typically needed for analyzing DNA or protein sequences. Among them are very flexible sequence classes, classes accessing databases in various formats, classes managing collections of sequences, as well as classes performing higher-level tasks like calculating a pairwise sequence alignment. SCL also includes classes that provide general programming support, like a dynamically growing array, sets, matrices, strings, classes performing file input/output, and utilities for error handling. By providing these components, SCL fosters an explorative programming style: experimenting with algorithms and alternative implementations is encouraged rather than punished. A description of SCL's overall structure as well as an overview of its classes is given. Important aspects of the work with SCL are discussed in the context of a sample program.


Assuntos
Linguagens de Programação , Análise de Sequência/métodos , Algoritmos , Bases de Dados Factuais , Estudos de Avaliação como Assunto , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência/estatística & dados numéricos
7.
Comput Chem ; 20(1): 123-33, 1996 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-16749185

RESUMO

We have explored the performance of the GeneMark gene identification method using cross-validation over learning samples of E. coli DNA sequences. The computations gave more accurate estimations of the error rates in comparison with previous results when a sample of non-coding regions was derived from GenBank sequences with many true coding regions unannotated. The error rate components have been classified and delineated. It was shown that the method performs differently on class I, II and III genes. The most frequent errors come from misinterpreting the coding potential of the complementary sequence in the same frame. The effects of stop-codons present in alternative frames were also studied to understand better the main factors contributing to GeneMark performance.


Assuntos
Genes/genética , Biologia Computacional , Escherichia coli/genética , Fases de Leitura Aberta/genética , Estatística como Assunto
8.
Comput Appl Biosci ; 11(4): 449-55, 1995 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-8521055

RESUMO

DNASTAT is a collection of Pascal routines for researchers who develop their own application programs for statistical analysis of DNA and protein sequences. Dynamic and file-based data structures allow users to process sets of sequences by simple loop control without limitations on the number of sequences and their individual sizes. This frees the programmer from potentially error-prone tasks like dynamic memory allocation and controlling array sizes. Sequences can be stored in databases along with biological and statistical attributes. Individual sequences can be accessed by column name and row number as with spread-sheets. DNASTAT allows large sets of sequences to be processed using a PC with standard configuration. Its small size, simplicity and free availability make it attractive to students of mathematical biology. Use of DNASTAT is illustrated by two sample programs that generate a database of coding regions from the GenBank entry of the tobacco chloroplast genome. A version of DNASTAT written in ANSI-C for PCs and Unix workstations is also available.


Assuntos
DNA/genética , Linguagens de Programação , Proteínas/genética , Análise de Sequência de DNA/métodos , Análise de Sequência/métodos , Sequência de Bases , Interpretação Estatística de Dados , Bases de Dados Factuais , Genoma de Planta , Dados de Sequência Molecular , Plantas Tóxicas , Software , Nicotiana/genética
9.
Comput Appl Biosci ; 9(3): 275-83, 1993 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-8324628

RESUMO

A method was previously developed for computation of pattern probabilities in random sequences under Markov chain models. We extend this method to the calculation of the joint distribution for two patterns. An application yields the distribution of the right choice measure for expressivity and how significance bounds depend on sequence length. These bounds are used to show that the choice of pyrimidine in codon position 3 of Escherichia coli genes deviates considerably from a general Markov process model for coding regions. We also derive some statistical evidence that this significant deviation is limited to codon position 3.


Assuntos
Sequência de Bases , Códon/genética , Cadeias de Markov , Modelos Moleculares , Algoritmos , Escherichia coli/genética , Regulação da Expressão Gênica/genética , Computação Matemática , Probabilidade , Distribuição Aleatória , Software , Processos Estocásticos
10.
Comput Appl Biosci ; 8(5): 433-41, 1992 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-1422876

RESUMO

An exact expression for the variance of random frequency that a given word has in text generated by a Markov chain is presented. The result is applied to periodic Markov chains, which describe the protein-coding DNA sequences better than simple Markov chains. A new solution to the problem of word overlap is proposed. It was found that the expected frequency and overlapping properties determine most of the variance. The expectation and variance of counts for triplets are compared with experimental counts in Escherichia coli coding sequences.


Assuntos
Sequência de Bases , Cadeias de Markov , Modelos Estatísticos , Algoritmos , Modelos Moleculares , Processos Estocásticos
11.
Comput Appl Biosci ; 6(4): 347-53, 1990 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-2257495

RESUMO

Observed patterns in macromolecular sequences are often considered as words and compared with their probabilities of occurring in random sequences. Calculation of these probabilities, however, often lacks rigour. We have developed an algorithm for exact computation of such probabilities for stochastic sequences that follow a Markov chain model. The method is applicable to the case that a random sequence contains one out of two given patterns P and Q, or both simultaneously. Another application yields the probability function P(x) that a sequence contains pattern P exactly x times. An application to patterns that include wild-card characters yields probabilities for homonucleotide clusters of a given length. We prove the probability of multiple runs of single nucleotides in the SV40 genome to be in accordance with the dinucleotide composition of the sequence, although it is in conflict with mononucleotide composition.


Assuntos
Algoritmos , DNA/química , Cadeias de Markov , Sequência de Bases , Microcomputadores , Linguagens de Programação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...