Search | VHL Regional Portal

Design of a compartmentalized shotgun assembler for the human genome.

Huson, D H; Reinert, K; Kravitz, S A; Remington, K A; Delcher, A L; Dew, I M; Flanigan, M; Halpern, A L; Lai, Z; Mobarry, C M; Sutton, G G; Myers, E W.

Bioinformatics ; 17 Suppl 1: S132-9, 2001.

Article in English | MEDLINE | ID: mdl-11473002

ABSTRACT

Two different strategies for determining the human genome are currently being pursued: one is the "clone-by-clone" approach, employed by the publicly funded project, and the other is the "whole genome shotgun assembler" approach, favored by researchers at Celera Genomics. An interim strategy employed at Celera, called compartmentalized shotgun assembly, makes use of preliminary data produced by both approaches. In this paper we describe the design, implementation and operation of the "compartmentalized shotgun assembler".

Subject(s)

Cloning, Molecular/methods , Genome, Human , Chromosomes, Artificial, Bacterial/genetics , Computational Biology , Databases, Nucleic Acid , Humans , Sequence Analysis, DNA/statistics & numerical data , Software

A whole-genome assembly of Drosophila.

Myers, E W; Sutton, G G; Delcher, A L; Dew, I M; Fasulo, D P; Flanigan, M J; Kravitz, S A; Mobarry, C M; Reinert, K H; Remington, K A; Anson, E L; Bolanos, R A; Chou, H H; Jordan, C M; Halpern, A L; Lonardi, S; Beasley, E M; Brandon, R C; Chen, L; Dunn, P J; Lai, Z; Liang, Y; Nusskern, D R; Zhan, M; Zhang, Q; Zheng, X; Rubin, G M; Adams, M D; Venter, J C.

Science ; 287(5461): 2196-204, 2000 Mar 24.

Article in English | MEDLINE | ID: mdl-10731133

ABSTRACT

We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99. 99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.

Subject(s)

Computational Biology , Drosophila melanogaster/genetics , Genome , Sequence Analysis, DNA , Algorithms , Animals , Chromatin/genetics , Contig Mapping , Euchromatin , Genes, Insect , Heterochromatin/genetics , Molecular Sequence Data , Physical Chromosome Mapping , Repetitive Sequences, Nucleic Acid , Sequence Tagged Sites

Improved microbial gene identification with GLIMMER.

Delcher, A L; Harmon, D; Kasif, S; White, O; Salzberg, S L.

Nucleic Acids Res ; 27(23): 4636-41, 1999 Dec 01.

Article in English | MEDLINE | ID: mdl-10556321

ABSTRACT

The GLIMMER system for microbial gene identification finds approximately 97-98% of all genes in a genome when compared with published annotation. This paper reports on two new results: (i) significant technical improvements to GLIMMER that improve its accuracy still further, and (ii) a comprehensive evaluation that demonstrates that the accuracy of the system is likely to be higher than previously recognized. A significant proportion of the genes missed by the system appear to be hypothetical proteins whose existence is only supported by the predictions of other programs. When the analysis is restricted to genes that have significant homology to genes in other organisms, GLIMMER misses <1% of known genes.

Subject(s)

Genes, Bacterial , Genetic Techniques/standards , Algorithms , Markov Chains , Models, Genetic

Interpolated Markov models for eukaryotic gene finding.

Salzberg, S L; Pertea, M; Delcher, A L; Gardner, M J; Tettelin, H.

Genomics ; 59(1): 24-31, 1999 Jul 01.

Article in English | MEDLINE | ID: mdl-10395796

ABSTRACT

Computational gene finding research has emphasized the development of gene finders for bacterial and human DNA. This has left genome projects for some small eukaryotes without a system that addresses their needs. This paper reports on a new system, GlimmerM, that was developed to find genes in the malaria parasite Plasmodium falciparum. Because the gene density in P. falciparum is relatively high, the system design was based on a successful bacterial gene finder, Glimmer. The system was augmented with specially trained modules to find splice sites and was trained on all available data from the P. falciparum genome. Although a precise evaluation of its accuracy is impossible at this time, laboratory tests (using RT-PCR) on a small selection of predicted genes confirmed all of those predictions. With the rapid progress in sequencing the genome of P. falciparum, the availability of this new gene finder will greatly facilitate the annotation process.

Subject(s)

Genes, Protozoan/genetics , Markov Chains , Algorithms , Alternative Splicing , Animals , Chromosomes/genetics , Databases, Factual , Gene Expression , Genome, Protozoan , Internet , Plasmodium falciparum/genetics , Reproducibility of Results , Reverse Transcriptase Polymerase Chain Reaction , Sequence Alignment

Alignment of whole genomes.

Delcher, A L; Kasif, S; Fleischmann, R D; Peterson, J; White, O; Salzberg, S L.

Nucleic Acids Res ; 27(11): 2369-76, 1999 Jun 01.

Article in English | MEDLINE | ID: mdl-10325427

ABSTRACT

A new system for aligning whole genome sequences is described. Using an efficient data structure called a suffix tree, the system is able to rapidly align sequences containing millions of nucleotides. Its use is demonstrated on two strains of Mycoplasma tuberculosis, on two less similar species of Mycoplasma bacteria and on two syntenic sequences from human chromosome 12 and mouse chromosome 6. In each case it found an alignment of the input sequences, using between 30 s and 2 min of computation time. From the system output, information on single nucleotide changes, translocations and homologous genes can easily be extracted. Use of the algorithm should facilitate analysis of syntenic chromosomal regions, strain-to-strain comparisons, evolutionary comparisons and genomic duplications.

Subject(s)

Algorithms , Genome, Bacterial , Mycoplasma/genetics , Sequence Alignment/methods , Animals , Base Sequence , DNA , Humans , Mice , Molecular Sequence Data

Microbial gene identification using interpolated Markov models.

Salzberg, S L; Delcher, A L; Kasif, S; White, O.

Nucleic Acids Res ; 26(2): 544-8, 1998 Jan 15.

Article in English | MEDLINE | ID: mdl-9421513

ABSTRACT

This paper describes a new system, GLIMMER, for finding genes in microbial genomes. In a series of tests on Haemophilus influenzae , Helicobacter pylori and other complete microbial genomes, this system has proven to be very accurate at locating virtually all the genes in these sequences, outperforming previous methods. A conservative estimate based on experiments on H.pylori and H. influenzae is that the system finds >97% of all genes. GLIMMER uses interpolated Markov models (IMMs) as a framework for capturing dependencies between nearby nucleotides in a DNA sequence. An IMM-based method makes predictions based on a variable context; i.e., a variable-length oligomer in a DNA sequence. The context used by GLIMMER changes depending on the local composition of the sequence. As a result, GLIMMER is more flexible and more powerful than fixed-order Markov methods, which have previously been the primary content-based technique for finding genes in microbial DNA.

Subject(s)

DNA, Bacterial/analysis , Markov Chains , Algorithms , Base Sequence , DNA, Bacterial/chemistry , Haemophilus influenzae/genetics , Helicobacter pylori/genetics , Open Reading Frames , Sensitivity and Specificity , Sequence Alignment , Software

A decision tree system for finding genes in DNA.

Salzberg, S; Delcher, A L; Fasman, K H; Henderson, J.

J Comput Biol ; 5(4): 667-80, 1998.

Article in English | MEDLINE | ID: mdl-10072083

ABSTRACT

MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in MORGAN are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that MORGAN has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95 %, with a correlation coefficient of 0.78, and a sensitivity and specificity for coding bases of 83 % and 79%. In addition, MORGAN identifies 58% of coding exons exactly; i.e., both the beginning and end of the coding regions are predicted correctly. This paper describes the MORGAN system, including its decision tree routines and the algorithms for site recognition, and its performance on a benchmark database of vertebrate DNA.

Subject(s)

Algorithms , DNA/genetics , Decision Trees , Genes , DNA/classification , Decision Support Techniques , Markov Chains

Protein secondary structure modelling with probabilistic networks.

Delcher, A L; Kasif, S; Goldberg, H R; Hsu, W H.

Proc Int Conf Intell Syst Mol Biol ; 1: 109-17, 1993.

Article in English | MEDLINE | ID: mdl-7584325

ABSTRACT

In this paper we study the performance of probabilistic networks in the context of protein sequence analysis in molecular biology. Specifically, we report the results of our initial experiments applying this framework to the problem of protein secondary structure prediction. One of the main advantages of the probabilistic approach we describe here is our ability to perform detailed experiments where we can experiment with different models. We can easily perform local substitutions (mutations) and measure (probabilistically) their effect on the global structure. Window-based methods do not support such experimentation as readily. Our method is efficient both during training and during prediction, which is important in order to be able to perform many experiments with different networks. We believe that probabilistic methods are comparable to other methods in prediction quality. In addition, the predictions generated by our methods have precise quantitative semantics which is not shared by other classification methods. Specifically, all the causal and statistical independence assumptions are made explicit in our networks thereby allowing biologists to study and experiment with different causal models in a convenient manner.

Subject(s)

Models, Molecular , Protein Structure, Secondary , Algorithms , Bayes Theorem , Decision Trees , Markov Chains , Models, Genetic , Mutation , Neural Networks, Computer , Reproducibility of Results

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL