Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Science ; 291(5507): 1304-51, 2001 02 16.
Artigo em Inglês | MEDLINE | ID: mdl-11181995

RESUMO

A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.


Assuntos
Genoma Humano , Projeto Genoma Humano , Análise de Sequência de DNA , Algoritmos , Animais , Bandeamento Cromossômico , Mapeamento Cromossômico , Cromossomos Artificiais Bacterianos , Biologia Computacional , Sequência Consenso , Ilhas de CpG , DNA Intergênico , Bases de Dados Factuais , Evolução Molecular , Éxons , Feminino , Duplicação Gênica , Genes , Variação Genética , Humanos , Íntrons , Masculino , Fenótipo , Mapeamento Físico do Cromossomo , Polimorfismo de Nucleotídeo Único , Proteínas/genética , Proteínas/fisiologia , Pseudogenes , Sequências Repetitivas de Ácido Nucleico , Retroelementos , Análise de Sequência de DNA/métodos , Especificidade da Espécie
2.
Mol Diagn ; 6(4): 243-52, 2001 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-11774190

RESUMO

The approach of whole-genome shotgun sequencing coupled with the availability of computational algorithms to facilitate the assembly, gene prediction, and functional annotation of entire genomes has sparked a revolution in our understanding of the biology of free-living organisms. More than 40 bacterial genomes have been sequenced to date, of which several are important human pathogens. The capacity to sequence and assemble entire genomes of bacteria, pathogenic protozoans, and fungi in a rapid and cost-effective way has energized every aspect of microbial science. Comparative genome analysis allows us to dissect the evolutionary forces at work and provides insights into adaptations of microbes to their unique ecological niches. Factors that shape host-pathogen interactions and their outcomes include genetic polymorphisms in the microbial pathogen and host, both of which can impact on microbial virulence or host immune responses to infection. The availability of the genome sequence of entire organisms, together with the use of high-throughput sequence-based genomic technologies to define microbial and host physiological states, provides the unparalleled opportunity to better define clinical outcomes in the field of infectious diseases. There is one overarching lesson: completion of the genomic sequence of any species answers many questions, while at the same time it invites totally new questions.


Assuntos
Doenças Transmissíveis/genética , Genoma , Animais , Infecções Bacterianas/genética , Doenças Transmissíveis/diagnóstico , Genoma Bacteriano , Genoma Fúngico , Genoma de Protozoário , Humanos , Micoses/genética , Infecções por Protozoários/genética
8.
Pac Symp Biocomput ; : 217-27, 1998.
Artigo em Inglês | MEDLINE | ID: mdl-9697184

RESUMO

This paper presents a computer system for analyzing and annotating large-scale genomic sequences. The core of the system is a multiple-gene structure identification program, which predicts the most "probable" gene structures based on the given evidence, including pattern recognition, EST and protein homology information. A graphics-based user interface provides an environment which allows the user to interactively control the evidence to be used in the gene identification process. To overcome the computational bottleneck in the database similarity search used in the gene identification process, we have developed an effective way to partition a database into a set of sub-databases of "related" sequences, and reduced the search problem on a large database to a signature identification problem and a search problem on a much smaller sub-database. This reduces the number of sequences to be searched from N to O ([square root of] N) on average, and hence greatly reduces the search time, where N is the number of sequences in the original database. The system provides the user with the ability to facilitate and modify the analysis and modeling in real time.


Assuntos
Sequência de Bases , Gráficos por Computador , DNA/química , DNA/genética , Bases de Dados Factuais , Genoma , Modelos Genéticos , Simulação por Computador , Éxons , Etiquetas de Sequências Expressas , Reconhecimento Automatizado de Padrão , Software
10.
Artigo em Inglês | MEDLINE | ID: mdl-9322060

RESUMO

Computational methods for gene identification in genomic sequences typically have two phases: coding region prediction and gene parsing. While there are many effective methods for predicting coding regions (exons), parsing the predicted exons into proper gene structures, to a large extent, remains an unsolved problem. This paper presents an algorithm for inferring gene structures from predicted exon candidates, based on Expressed Sequence Tags (ESTs) and biological intuition/rules. The algorithm first finds all the related ESTs in the EST database (dbEST) for each predicted exon, and infers the boundaries of one or a series of genes based on the available EST information and biological rules. Then it constructs gene models within each pair of gene boundaries, that are most consistent with the EST information. By exploiting EST information and biological rules, the algorithm can (1) model complicated multiple gene structures, including embedded genes, (2) identify falsely-predicted exons and locate missed exons, and (3) make more accurate exon boundary predictions. The algorithm has been implemented and tested on long genomic sequences with a number of genes. Test results show that very accurate (predicted) gene models can be expected when related ESTs exist for the predicted exons.


Assuntos
Algoritmos , Expressão Gênica , Técnicas Genéticas , Genoma Humano , DNA/genética , Bases de Dados Factuais , Éxons , Humanos , Modelos Genéticos , Software
11.
Comput Chem ; 20(1): 135-40, 1996 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-8867844

RESUMO

Detection of RNA polymerase II promoters and polyadenylation sites helps to locate gene boundaries and can enhance accurate gene recognition and modeling in genomic DNA sequence. We describe a system which can be used to detect polyadenylation sites and thus delineate the 3' boundary of a gene, and discuss improvements to a system first described in Matis et al. (1995) [Matis S., Shah M., Mural R. J. & Uberbacher E.C. (1995) Proc. First Wrld Conf. Computat. Med., Public Hlth, Biotechnol. (Wrld Sci.) (in press).], which predicts a large subset of RNA polymerase II promoters. The promoter system used statistical matrices and distance information as inputs for a neural network which was trained to provide initial promoter recognition. The output of the network was further refined by applying rules which use the gene context information predicted by GRAIL. We have reconstructed the rule-based system which uses gene context information and significantly improved the sensitivity and selectivity of promoter detection.


Assuntos
Regiões Promotoras Genéticas/genética , RNA Polimerase II/análise , Algoritmos , DNA/química , Bases de Dados Factuais , Expressão Gênica/genética , Humanos , Redes Neurais de Computação , RNA Polimerase II/genética , RNA Mensageiro/genética , Análise de Sequência , TATA Box/genética
12.
J Comput Biol ; 3(3): 333-44, 1996.
Artigo em Inglês | MEDLINE | ID: mdl-8891953

RESUMO

Insertion and deletion (indel) sequencing errors in DNA coding regions disrupt DNA-to-protein translation frames, and hence make most frame-sensitive coding recognition approaches fail. This paper extends the authors' previous work on indel detection and "correction" algorithms, and presents a more effective algorithm for localizing indels that appear in DNA coding regions and "correcting" the located indels by inserting or deleting DNA bases. The algorithm localizes indels by discovering changes of the preferred translation frames within presumed coding regions, and then "corrects" them to restore a consistent translation frame within each coding region. An iterative strategy is exploited to repeatedly localize and "correct" indels until no more indels can be found. Test results have shown that this improved algorithm can detect and "correct" more indels while not worsening the rate of introduction of false indels when compared to the authors' previous work.


Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Elementos de DNA Transponíveis , Humanos , Deleção de Sequência
14.
Comput Appl Biosci ; 11(2): 117-24, 1995 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-7620982

RESUMO

This paper presents an algorithm for detecting and 'correcting' sequencing errors that occur in DNA coding regions. The types of sequencing errors addressed are insertions and deletions (indels) of DNA bases. The goal is to provide a capability which makes single-pass or low-redundancy sequence data more informative, reducing the need for high-redundancy sequencing for gene identification and characterization purposes. This would permit improved sequencing efficiency and reduce genome sequencing costs. The algorithm detects sequencing errors by discovering changes in the statistically preferred reading frame within a putative coding region and then inserts a number of 'neutral' bases at a perceived reading frame transition point to make the putative exon candidate frame consistent. We have implemented the algorithm as a front-end subsystem of the GRAIL DNA sequence analysis system to construct a version which is very error tolerant and also intend to use this as a testbed for further development of sequencing error-correction technology. Preliminary test results have shown the usefulness of this algorithm and also exhibited some of its weakness, providing possible directions for further improvement. On a test set consisting of 68 human DNA sequences with 1% randomly generated indels in coding regions, the algorithm detected and corrected 76% of the indels. The average distance between the position of an indel and the predicted one was 9.4 bases. With this subsystem in place, GRAIL correctly predicted 89% of the coding messages with 10% false message on the 'corrected' sequences, compared to 69% correctly predicted coding messages and 11% falsely predicted messages on the 'corrupted' sequences using standard GRAIL II method (version 1.2).(ABSTRACT TRUNCATED AT 250 WORDS)


Assuntos
Análise de Sequência de DNA/normas , Software , Algoritmos , Éxons , Humanos , Biossíntese de Proteínas , Análise de Sequência de DNA/métodos
16.
Artigo em Inglês | MEDLINE | ID: mdl-7584472

RESUMO

An important open problem in molecular biology is how to use computational methods to understand the structure and function of proteins given only their primary sequences. We describe and evaluate an original machine-learning approach to classifying protein sequences according to their structural folding class. Our work is novel in several respects: we use a set of protein classes that previously have not been used for classifying primary sequences, and we use a unique set of attributes to represent protein sequences to the learners. We evaluate our approach by measuring its ability to correctly classify proteins that were not in its training set. We compare our input representation to a commonly used input representation--amino acid composition--and show that our approach more accurately classifies proteins that have very limited homology to the sequences on which the systems are trained.


Assuntos
Sequência de Aminoácidos , Dobramento de Proteína , Estrutura Secundária de Proteína , Proteínas/química , Algoritmos , Bases de Dados Factuais , Árvores de Decisões , Proteínas/metabolismo , Homologia de Sequência de Aminoácidos
17.
Comput Appl Biosci ; 10(6): 613-23, 1994 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-7704660

RESUMO

This paper presents a computationally efficient algorithm, the Gene Assembly Program III (GAP III), for constructing gene models from a set of accurately-predicted 'exons'. The input to the algorithm is a set of clusters of exon candidates, generated by a new version of the GRAIL coding region recognition system. The exon candidates of a cluster differ in their presumed edges and occasionally in their reading frames. Each exon candidate has a numerical score representing its 'probability' of being an actual exon. GAP III uses a dynamic programming algorithm to construct a gene model, complete or partial, by optimizing a predefined objective function. The optimal gene models constructed by GAP III correspond very well with the structures of genes which have been determined experimentally and reported in the Genome Sequence Database (GSDB). On a test set of 137 human and mouse DNA sequences consisting of 954 true exons, GAP III constructed 137 gene models using 892 exons, among which 859 (859/954 = 90%) are true exons and 33 (33/892 = 3%) are false positive. Among the 859 true positives, 635 (74%) match the actual exons exactly, and 838 (98%) have at least one edge correct. GAP III is computationally efficient. If we use E and C to represent the total number of exon candidates in all clusters and the number of clusters, respectively, the running time of GAP III is proportional to (E x C).


Assuntos
Algoritmos , Éxons , Modelos Genéticos , Software , Animais , Humanos , Camundongos , Design de Software
19.
Artigo em Inglês | MEDLINE | ID: mdl-7584416

RESUMO

A new version of the GRAIL system (Uberbacher and Mural, 1991; Mural et al., 1992; Uberbacher et al., 1993), called GRAIL II, has recently been developed (Xu et al., 1994). GRAIL II is a hybrid AI system that supports a number of DNA sequence analysis tools including protein-coding region recognition, PolyA site and transcription promoter recognition, gene model construction, translation to protein, and DNA/protein database searching capabilities. This paper presents the core of GRAIL II, the coding exon recognition and gene model construction algorithms. The exon recognition algorithm recognizes coding exons by combining coding feature analysis and edge signal (acceptor/donor/translation-start sites) detection. Unlike the original GRAIL system (Uberbacher and Mural, 1991; Mural et al., 1992), this algorithm uses variable-length windows tailored to each potential exon candidate, making its performance almost exon length-independent. In this algorithm, the recognition process is divided into four steps. Initially a large number of possible coding exon candidates are generated. Then a rule-based prescreening algorithm eliminates the majority of the improbable candidates. As the kernel of the recognition algorithm, three neural networks are trained to evaluate the remaining candidates. The outputs of the neural networks are then divided into clusters of candidates, corresponding to presumed exons. The algorithm makes its final prediction by picking the best canadidate from each cluster. The gene construction algorithm (Xu, Mural and Uberbacher, 1994) uses a dynamic programming approach to build gene models by using as input the clusters predicted by the exon recognition algorithm. Extensive testing has been done on these two algorithms.(ABSTRACT TRUNCATED AT 250 WORDS)


Assuntos
DNA/análise , Éxons , Redes Neurais de Computação , Algoritmos , Humanos , Software
20.
Genet Eng (N Y) ; 16: 241-53, 1994.
Artigo em Inglês | MEDLINE | ID: mdl-7765200

RESUMO

We have described an improved neural network system for recognizing protein coding regions (exons) in human genomic DNA sequences. This coding region recognition system is part of a new version of GRAIL, GRAIL II, and represents a significant improvement over the coding recognition performance of the previous GRAIL system. GRAIL II divides the process of locating exons into four steps. It first generates an exon candidate pool consisting of all possible (translation start-donor), (acceptor-donor), and (acceptor-translation stop) pairs within all open reading frames of the test sequence. The vast majority of these exon candidates are eliminated from consideration by applying a set of heuristic rules. After reducing the size of the candidate pool, GRAIL II uses three trained neural networks to evaluate the coding potential and accuracy of the edges of starting exon, internal exon and terminal exon candidates. These networks output a set of overlapping candidates for each exon which differ by their scores and position of their edges. Multiple candidates for a given exon are grouped into a cluster based on their locations relative to candidates corresponding to other exons, and the highest scoring candidate for each cluster is used as the "best" prediction of the corresponding exon. Unlike the previous GRAIL version, GRAIL II uses variable-length windows to evaluate exon candidates and its performance is nearly independent of exon length. In addition to several strong indicators of coding potential, the system uses several other types of information including scores for splice junctions, GC composition, and the properties of the regions adjacent to an exon candidate, to aid in the discrimination process. On a large set of sequences from Genbank (3), GRAIL II located 93% of all exons regardless of size with a false positive rate of 12%. Among the true positives, 62% match the actual exons exactly (the exons edges are correct to the base), and 93% match at least one edge correctly. These statistics are further improved, especially the false positive rate and accuracy of the edges, through a process of gene model construction by the Gene Assembly Program (GAP III) (4) module of GRAIL II, which uses the scored exon candidates as input and constructs optimal gene models. The gene modeling system will be described elsewhere.


Assuntos
Éxons , Redes Neurais de Computação , Sequência de Aminoácidos , Inteligência Artificial , Genoma Humano , Humanos , Dados de Sequência Molecular
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...