Search | VHL Regional Portal

1.

Error, noise and bias in de novo transcriptome assemblies.

Freedman, Adam H; Clamp, Michele; Sackton, Timothy B.

Mol Ecol Resour ; 21(1): 18-29, 2021 Jan.

Article in English | MEDLINE | ID: mdl-32180366

ABSTRACT

De novo transcriptome assembly is a powerful tool, and has been widely used over the last decade for making evolutionary inferences. However, it relies on two implicit assumptions: that the assembled transcriptome is an unbiased representation of the underlying expressed transcriptome, and that expression estimates from the assembly are good, if noisy approximations of the relative abundance of expressed transcripts. Using publicly available data for model organisms, we demonstrate that, across assembly algorithms and data sets, these assumptions are consistently violated. Bias exists at the nucleotide level, with genotyping error rates ranging from 30% to 83%. As a result, diversity is underestimated in transcriptome assemblies, with consistent underestimation of heterozygosity in all but the most inbred samples. Even at the gene level, expression estimates show wide deviations from map-to-reference estimates, and positive bias at lower expression levels. Standard filtering of transcriptome assemblies improves the robustness of gene expression estimates but leads to the loss of a meaningful number of protein-coding genes, including many that are highly expressed. We demonstrate a computational method, length-rescaled CPM, to partly alleviate noise and bias in expression estimates. Researchers should consider ways to minimize the impact of bias in transcriptome assemblies.

Subject(s)

Bias , Gene Expression Profiling , Transcriptome , Algorithms

2.

Whole-Genome Analyses Resolve the Phylogeny of Flightless Birds (Palaeognathae) in the Presence of an Empirical Anomaly Zone.

Cloutier, Alison; Sackton, Timothy B; Grayson, Phil; Clamp, Michele; Baker, Allan J; Edwards, Scott V.

Syst Biol ; 68(6): 937-955, 2019 11 01.

Article in English | MEDLINE | ID: mdl-31135914

ABSTRACT

Palaeognathae represent one of the two basal lineages in modern birds, and comprise the volant (flighted) tinamous and the flightless ratites. Resolving palaeognath phylogenetic relationships has historically proved difficult, and short internal branches separating major palaeognath lineages in previous molecular phylogenies suggest that extensive incomplete lineage sorting (ILS) might have accompanied a rapid ancient divergence. Here, we investigate palaeognath relationships using genome-wide data sets of three types of noncoding nuclear markers, together totaling 20,850 loci and over 41 million base pairs of aligned sequence data. We recover a fully resolved topology placing rheas as the sister to kiwi and emu + cassowary that is congruent across marker types for two species tree methods (MP-EST and ASTRAL-II). This topology is corroborated by patterns of insertions for 4274 CR1 retroelements identified from multispecies whole-genome screening, and is robustly supported by phylogenomic subsampling analyses, with MP-EST demonstrating particularly consistent performance across subsampling replicates as compared to ASTRAL. In contrast, analyses of concatenated data supermatrices recover rheas as the sister to all other nonostrich palaeognaths, an alternative that lacks retroelement support and shows inconsistent behavior under subsampling approaches. While statistically supporting the species tree topology, conflicting patterns of retroelement insertions also occur and imply high amounts of ILS across short successive internal branches, consistent with observed patterns of gene tree heterogeneity. Coalescent simulations and topology tests indicate that the majority of observed topological incongruence among gene trees is consistent with coalescent variation rather than arising from gene tree estimation error alone, and estimated branch lengths for short successive internodes in the inferred species tree fall within the theoretical range encompassing the anomaly zone. Distributions of empirical gene trees confirm that the most common gene tree topology for each marker type differs from the species tree, signifying the existence of an empirical anomaly zone in palaeognaths.

Subject(s)

Genome/genetics , Palaeognathae/classification , Palaeognathae/genetics , Phylogeny , Animals , Genomics

3.

Convergent regulatory evolution and loss of flight in paleognathous birds.

Sackton, Timothy B; Grayson, Phil; Cloutier, Alison; Hu, Zhirui; Liu, Jun S; Wheeler, Nicole E; Gardner, Paul P; Clarke, Julia A; Baker, Allan J; Clamp, Michele; Edwards, Scott V.

Science ; 364(6435): 74-78, 2019 04 05.

Article in English | MEDLINE | ID: mdl-30948549

ABSTRACT

A core question in evolutionary biology is whether convergent phenotypic evolution is driven by convergent molecular changes in proteins or regulatory regions. We combined phylogenomic, developmental, and epigenomic analysis of 11 new genomes of paleognathous birds, including an extinct moa, to show that convergent evolution of regulatory regions, more so than protein-coding genes, is prevalent among developmental pathways associated with independent losses of flight. A Bayesian analysis of 284,001 conserved noncoding elements, 60,665 of which are corroborated as enhancers by open chromatin states during development, identified 2355 independent accelerations along lineages of flightless paleognaths, with functional consequences for driving gene expression in the developing forelimb. Our results suggest that the genomic landscape associated with morphological convergence in ratites has a substantial shared regulatory component.

Subject(s)

Biological Evolution , Epigenesis, Genetic , Evolution, Molecular , Flight, Animal , Palaeognathae/anatomy & histology , Palaeognathae/genetics , Animals , Bayes Theorem , Chromatin/metabolism , Conserved Sequence , Enhancer Elements, Genetic , Epigenomics , Exons/genetics , Extinction, Biological , Forelimb/anatomy & histology , Palaeognathae/physiology , Phenotype , Phylogeny

4.

Metatranscriptomics profile of the gill microbial community during Bathymodiolus azoricus aquarium acclimatization at atmospheric pressure.

Barros, Inês; Froufe, Hugo; Marnellos, George; Egas, Conceição; Delaney, Jennifer; Clamp, Michele; Santos, Ricardo Serrão; Bettencourt, Raul.

AIMS Microbiol ; 4(2): 240-260, 2018.

Article in English | MEDLINE | ID: mdl-31294213

ABSTRACT

BACKGROUND: The deep-sea mussels Bathymodiolus azoricus (Bivalvia: Mytilidae) are the dominant macrofauna subsisting at the hydrothermal vents site Menez Gwen in the Mid-Atlantic Ridge (MAR). Their adaptive success in such challenging environments is largely due to their gill symbiotic association with chemosynthetic bacteria. We examined the response of vent mussels as they adapt to sea-level environmental conditions, through an assessment of the relative abundance of host-symbiont related RNA transcripts to better understand how the gill microbiome may drive host-symbiont interactions in vent mussels during hypothetical venting inactivity. RESULTS: The metatranscriptome of B. azoricus was sequenced from gill tissues sampled at different time-points during a five-week acclimatization experiment, using Next-Generation-Sequencing. After Illumina sequencing, a total of 181,985,262 paired-end reads of 150 bp were generated with an average of 16,544,115 read per sample. Metatranscriptome analysis confirmed that experimental acclimatization in aquaria accounted for global gill transcript variation. Additionally, the analysis of 16S and 18S rRNA sequences data allowed for a comprehensive characterization of host-symbiont interactions, which included the gradual loss of gill endosymbionts and signaling pathways, associated with stress responses and energy metabolism, under experimental acclimatization. Dominant active transcripts were assigned to the following KEGG categories: "Ribosome", "Oxidative phosphorylation" and "Chaperones and folding catalysts" suggesting specific metabolic responses to physiological adaptations in aquarium environment. CONCLUSIONS: Gill metagenomics analyses highlighted microbial diversity shifts and a clear pattern of varying mRNA transcript abundancies and expression during acclimatization to aquarium conditions which indicate change in bacterial community activity. This approach holds potential for the discovery of new host-symbiont associations, evidencing new functional transcripts and a clearer picture of methane metabolism during loss of endosymbionts. Towards the end of acclimatization, we observed trends in three major functional subsystems, as evidenced by an increment of transcripts related to genetic information processes; the decrease of chaperone and folding catalysts and oxidative phosphorylation transcripts; but no change in transcripts of gluconeogenesis and co-factors-vitamins.

5.

A high-resolution map of human evolutionary constraint using 29 mammals.

Lindblad-Toh, Kerstin; Garber, Manuel; Zuk, Or; Lin, Michael F; Parker, Brian J; Washietl, Stefan; Kheradpour, Pouya; Ernst, Jason; Jordan, Gregory; Mauceli, Evan; Ward, Lucas D; Lowe, Craig B; Holloway, Alisha K; Clamp, Michele; Gnerre, Sante; Alföldi, Jessica; Beal, Kathryn; Chang, Jean; Clawson, Hiram; Cuff, James; Di Palma, Federica; Fitzgerald, Stephen; Flicek, Paul; Guttman, Mitchell; Hubisz, Melissa J; Jaffe, David B; Jungreis, Irwin; Kent, W James; Kostka, Dennis; Lara, Marcia; Martins, Andre L; Massingham, Tim; Moltke, Ida; Raney, Brian J; Rasmussen, Matthew D; Robinson, Jim; Stark, Alexander; Vilella, Albert J; Wen, Jiayu; Xie, Xiaohui; Zody, Michael C; Baldwin, Jen; Bloom, Toby; Chin, Chee Whye; Heiman, Dave; Nicol, Robert; Nusbaum, Chad; Young, Sarah; Wilkinson, Jane; Worley, Kim C.

Nature ; 478(7370): 476-82, 2011 Oct 12.

Article in English | MEDLINE | ID: mdl-21993624

ABSTRACT

The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering â¼4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for â¼60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease.

Subject(s)

Evolution, Molecular , Genome, Human/genetics , Genome/genetics , Mammals/genetics , Animals , Disease , Exons/genetics , Genomics , Health , Humans , Molecular Sequence Annotation , Phylogeny , RNA/classification , RNA/genetics , Selection, Genetic/genetics , Sequence Alignment , Sequence Analysis, DNA

6.

Three periods of regulatory innovation during vertebrate evolution.

Lowe, Craig B; Kellis, Manolis; Siepel, Adam; Raney, Brian J; Clamp, Michele; Salama, Sofie R; Kingsley, David M; Lindblad-Toh, Kerstin; Haussler, David.

Science ; 333(6045): 1019-24, 2011 Aug 19.

Article in English | MEDLINE | ID: mdl-21852499

ABSTRACT

The gain, loss, and modification of gene regulatory elements may underlie a substantial proportion of phenotypic changes on animal lineages. To investigate the gain of regulatory elements throughout vertebrate evolution, we identified genome-wide sets of putative regulatory regions for five vertebrates, including humans. These putative regulatory regions are conserved nonexonic elements (CNEEs), which are evolutionarily conserved yet do not overlap any coding or noncoding mature transcript. We then inferred the branch on which each CNEE came under selective constraint. Our analysis identified three extended periods in the evolution of gene regulatory elements. Early vertebrate evolution was characterized by regulatory gains near transcription factors and developmental genes, but this trend was replaced by innovations near extracellular signaling genes, and then innovations near posttranslational protein modifiers.

Subject(s)

Biological Evolution , Conserved Sequence , Evolution, Molecular , Regulatory Elements, Transcriptional , Regulatory Sequences, Nucleic Acid , Vertebrates/genetics , Animals , Cattle , DNA, Intergenic/genetics , Gene Expression Regulation , Genes, Developmental , Genome , Humans , Markov Chains , Mice , Oryzias/genetics , Phylogeny , Protein Processing, Post-Translational/genetics , Selection, Genetic , Sequence Alignment , Smegmamorpha/genetics , Transcription Factors/genetics

7.

Identifying novel constrained elements by exploiting biased substitution patterns.

Garber, Manuel; Guttman, Mitchell; Clamp, Michele; Zody, Michael C; Friedman, Nir; Xie, Xiaohui.

Bioinformatics ; 25(12): i54-62, 2009 Jun 15.

Article in English | MEDLINE | ID: mdl-19478016

ABSTRACT

MOTIVATION: Comparing the genomes from closely related species provides a powerful tool to identify functional elements in a reference genome. Many methods have been developed to identify conserved sequences across species; however, existing methods only model conservation as a decrease in the rate of mutation and have ignored selection acting on the pattern of mutations. RESULTS: We present a new approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection. We describe a new statistical method for modeling biased nucleotide substitutions, a learning algorithm for inferring site-specific substitution biases directly from sequence alignments and a hidden Markov model for detecting constrained elements characterized by biased substitutions. We show that the new approach can identify significantly more degenerate constrained sequences than rate-based methods. Applying it to the ENCODE regions, we identify as much as 10.2% of these regions are under selection. AVAILABILITY: The algorithms are implemented in a Java software package, called SiPhy, freely available at http://www.broadinstitute.org/science/software/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Genomics/methods , Sequence Alignment/methods , Base Sequence , Evolution, Molecular , Software

8.

Jalview Version 2--a multiple sequence alignment editor and analysis workbench.

Waterhouse, Andrew M; Procter, James B; Martin, David M A; Clamp, Michèle; Barton, Geoffrey J.

Bioinformatics ; 25(9): 1189-91, 2009 May 01.

Article in English | MEDLINE | ID: mdl-19151095

ABSTRACT

UNLABELLED: Jalview Version 2 is a system for interactive WYSIWYG editing, analysis and annotation of multiple sequence alignments. Core features include keyboard and mouse-based editing, multiple views and alignment overviews, and linked structure display with Jmol. Jalview 2 is available in two forms: a lightweight Java applet for use in web applications, and a powerful desktop application that employs web services for sequence alignment, secondary structure prediction and the retrieval of alignments, sequences, annotation and structures from public databases and any DAS 1.53 compliant sequence or annotation server. AVAILABILITY: The Jalview 2 Desktop application and JalviewLite applet are made freely available under the GPL, and can be downloaded from www.jalview.org.

Subject(s)

Computational Biology/methods , Proteins/chemistry , Sequence Alignment/methods , Software , Databases, Protein , Sequence Analysis, Protein

9.

Initial sequence and comparative analysis of the cat genome.

Pontius, Joan U; Mullikin, James C; Smith, Douglas R; Lindblad-Toh, Kerstin; Gnerre, Sante; Clamp, Michele; Chang, Jean; Stephens, Robert; Neelam, Beena; Volfovsky, Natalia; Schäffer, Alejandro A; Agarwala, Richa; Narfström, Kristina; Murphy, William J; Giger, Urs; Roca, Alfred L; Antunes, Agostinho; Menotti-Raymond, Marilyn; Yuhki, Naoya; Pecon-Slattery, Jill; Johnson, Warren E; Bourque, Guillaume; Tesler, Glenn; O'Brien, Stephen J.

Genome Res ; 17(11): 1675-89, 2007 Nov.

Article in English | MEDLINE | ID: mdl-17975172

ABSTRACT

The genome sequence (1.9-fold coverage) of an inbred Abyssinian domestic cat was assembled, mapped, and annotated with a comparative approach that involved cross-reference to annotated genome assemblies of six mammals (human, chimpanzee, mouse, rat, dog, and cow). The results resolved chromosomal positions for 663,480 contigs, 20,285 putative feline gene orthologs, and 133,499 conserved sequence blocks (CSBs). Additional annotated features include repetitive elements, endogenous retroviral sequences, nuclear mitochondrial (numt) sequences, micro-RNAs, and evolutionary breakpoints that suggest historic balancing of translocation and inversion incidences in distinct mammalian lineages. Large numbers of single nucleotide polymorphisms (SNPs), deletion insertion polymorphisms (DIPs), and short tandem repeats (STRs), suitable for linkage or association studies were characterized in the context of long stretches of chromosome homozygosity. In spite of the light coverage capturing approximately 65% of euchromatin sequence from the cat genome, these comparative insights shed new light on the tempo and mode of gene/genome evolution in mammals, promise several research applications for the cat, and also illustrate that a comparative approach using more deeply covered mammals provides an informative, preliminary annotation of a light (1.9-fold) coverage mammal genome sequence.

Subject(s)

Cats/genetics , Genome , Genomics , Animals , Dogs , Humans , Mice , MicroRNAs , Microsatellite Repeats , Models, Genetic , Polymorphism, Single Nucleotide , Rats , Repetitive Sequences, Nucleic Acid

10.

Distinguishing protein-coding and noncoding genes in the human genome.

Clamp, Michele; Fry, Ben; Kamal, Mike; Xie, Xiaohui; Cuff, James; Lin, Michael F; Kellis, Manolis; Lindblad-Toh, Kerstin; Lander, Eric S.

Proc Natl Acad Sci U S A ; 104(49): 19428-33, 2007 Dec 04.

Article in English | MEDLINE | ID: mdl-18040051

ABSTRACT

Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of approximately 24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs-specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to approximately 20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes.

Subject(s)

Genetic Code , Genome, Human/genetics , Genomics , Open Reading Frames/genetics , Proteins/genetics , Animals , Base Sequence , DNA Transposable Elements/genetics , Dogs , Genes/genetics , Humans , Mice , Molecular Sequence Data , Pseudogenes/genetics , Sequence Analysis, DNA

11.

Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome.

Margulies, Elliott H; Cooper, Gregory M; Asimenos, George; Thomas, Daryl J; Dewey, Colin N; Siepel, Adam; Birney, Ewan; Keefe, Damian; Schwartz, Ariel S; Hou, Minmei; Taylor, James; Nikolaev, Sergey; Montoya-Burgos, Juan I; Löytynoja, Ari; Whelan, Simon; Pardi, Fabio; Massingham, Tim; Brown, James B; Bickel, Peter; Holmes, Ian; Mullikin, James C; Ureta-Vidal, Abel; Paten, Benedict; Stone, Eric A; Rosenbloom, Kate R; Kent, W James; Bouffard, Gerard G; Guan, Xiaobin; Hansen, Nancy F; Idol, Jacquelyn R; Maduro, Valerie V B; Maskeri, Baishali; McDowell, Jennifer C; Park, Morgan; Thomas, Pamela J; Young, Alice C; Blakesley, Robert W; Muzny, Donna M; Sodergren, Erica; Wheeler, David A; Worley, Kim C; Jiang, Huaiyang; Weinstock, George M; Gibbs, Richard A; Graves, Tina; Fulton, Robert; Mardis, Elaine R; Wilson, Richard K; Clamp, Michele; Cuff, James.

Genome Res ; 17(6): 760-74, 2007 Jun.

Article in English | MEDLINE | ID: mdl-17567995

ABSTRACT

A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.

Subject(s)

Evolution, Molecular , Genome, Human , Mammals/genetics , Open Reading Frames , Phylogeny , Sequence Alignment , Animals , Human Genome Project , Humans

12.

Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences.

Mikkelsen, Tarjei S; Wakefield, Matthew J; Aken, Bronwen; Amemiya, Chris T; Chang, Jean L; Duke, Shannon; Garber, Manuel; Gentles, Andrew J; Goodstadt, Leo; Heger, Andreas; Jurka, Jerzy; Kamal, Michael; Mauceli, Evan; Searle, Stephen M J; Sharpe, Ted; Baker, Michelle L; Batzer, Mark A; Benos, Panayiotis V; Belov, Katherine; Clamp, Michele; Cook, April; Cuff, James; Das, Radhika; Davidow, Lance; Deakin, Janine E; Fazzari, Melissa J; Glass, Jacob L; Grabherr, Manfred; Greally, John M; Gu, Wanjun; Hore, Timothy A; Huttley, Gavin A; Kleber, Michael; Jirtle, Randy L; Koina, Edda; Lee, Jeannie T; Mahony, Shaun; Marra, Marco A; Miller, Robert D; Nicholls, Robert D; Oda, Mayumi; Papenfuss, Anthony T; Parra, Zuly E; Pollock, David D; Ray, David A; Schein, Jacqueline E; Speed, Terence P; Thompson, Katherine; VandeBerg, John L; Wade, Claire M.

Nature ; 447(7141): 167-77, 2007 May 10.

Article in English | MEDLINE | ID: mdl-17495919

ABSTRACT

We report a high-quality draft of the genome sequence of the grey, short-tailed opossum (Monodelphis domestica). As the first metatherian ('marsupial') species to be sequenced, the opossum provides a unique perspective on the organization and evolution of mammalian genomes. Distinctive features of the opossum chromosomes provide support for recent theories about genome evolution and function, including a strong influence of biased gene conversion on nucleotide sequence composition, and a relationship between chromosomal characteristics and X chromosome inactivation. Comparison of opossum and eutherian genomes also reveals a sharp difference in evolutionary innovation between protein-coding and non-coding functional elements. True innovation in protein-coding genes seems to be relatively rare, with lineage-specific differences being largely due to diversification and rapid turnover in gene families involved in environmental interactions. In contrast, about 20% of eutherian conserved non-coding elements (CNEs) are recent inventions that postdate the divergence of Eutheria and Metatheria. A substantial proportion of these eutherian-specific CNEs arose from sequence inserted by transposable elements, pointing to transposons as a major creative force in the evolution of mammalian gene regulation.

Subject(s)

Evolution, Molecular , Genome/genetics , Genomics , Opossums/genetics , Animals , Base Composition , Conserved Sequence/genetics , DNA Transposable Elements/genetics , Humans , Polymorphism, Single Nucleotide/genetics , Protein Biosynthesis , Synteny/genetics , X Chromosome Inactivation/genetics

13.

Genome sequence, comparative analysis and haplotype structure of the domestic dog.

Lindblad-Toh, Kerstin; Wade, Claire M; Mikkelsen, Tarjei S; Karlsson, Elinor K; Jaffe, David B; Kamal, Michael; Clamp, Michele; Chang, Jean L; Kulbokas, Edward J; Zody, Michael C; Mauceli, Evan; Xie, Xiaohui; Breen, Matthew; Wayne, Robert K; Ostrander, Elaine A; Ponting, Chris P; Galibert, Francis; Smith, Douglas R; DeJong, Pieter J; Kirkness, Ewen; Alvarez, Pablo; Biagi, Tara; Brockman, William; Butler, Jonathan; Chin, Chee-Wye; Cook, April; Cuff, James; Daly, Mark J; DeCaprio, David; Gnerre, Sante; Grabherr, Manfred; Kellis, Manolis; Kleber, Michael; Bardeleben, Carolyne; Goodstadt, Leo; Heger, Andreas; Hitte, Christophe; Kim, Lisa; Koepfli, Klaus-Peter; Parker, Heidi G; Pollinger, John P; Searle, Stephen M J; Sutter, Nathan B; Thomas, Rachael; Webber, Caleb; Baldwin, Jennifer; Abebe, Adal; Abouelleil, Amr; Aftuck, Lynne; Ait-Zahra, Mostafa.

Nature ; 438(7069): 803-19, 2005 Dec 08.

Article in English | MEDLINE | ID: mdl-16341006

ABSTRACT

Here we report a high-quality draft genome sequence of the domestic dog (Canis familiaris), together with a dense map of single nucleotide polymorphisms (SNPs) across breeds. The dog is of particular interest because it provides important evolutionary information and because existing breeds show great phenotypic diversity for morphological, physiological and behavioural traits. We use sequence comparison with the primate and rodent lineages to shed light on the structure and evolution of genomes and genes. Notably, the majority of the most highly conserved non-coding sequences in mammalian genomes are clustered near a small subset of genes with important roles in development. Analysis of SNPs reveals long-range haplotypes across the entire dog genome, and defines the nature of genetic diversity within and across breeds. The current SNP map now makes it possible for genome-wide association studies to identify genes responsible for diseases and traits, with important consequences for human and companion animal health.

Subject(s)

Dogs/genetics , Evolution, Molecular , Genome/genetics , Genomics , Haplotypes/genetics , Animals , Conserved Sequence/genetics , Dog Diseases/genetics , Dogs/classification , Female , Humans , Hybridization, Genetic , Male , Mice , Mutagenesis/genetics , Polymorphism, Single Nucleotide/genetics , Rats , Short Interspersed Nucleotide Elements/genetics , Synteny/genetics

14.

An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing.

Margulies, Elliott H; Vinson, Jade P; Miller, Webb; Jaffe, David B; Lindblad-Toh, Kerstin; Chang, Jean L; Green, Eric D; Lander, Eric S; Mullikin, James C; Clamp, Michele.

Proc Natl Acad Sci U S A ; 102(13): 4795-800, 2005 Mar 29.

Article in English | MEDLINE | ID: mdl-15778292

ABSTRACT

With the recent completion of a high-quality sequence of the human genome, the challenge is now to understand the functional elements that it encodes. Comparative genomic analysis offers a powerful approach for finding such elements by identifying sequences that have been highly conserved during evolution. Here, we propose an initial strategy for detecting such regions by generating low-redundancy sequence from a collection of 16 eutherian mammals, beyond the 7 for which genome sequence data are already available. We show that such sequence can be accurately aligned to the human genome and used to identify most of the highly conserved regions. Although not a long-term substitute for generating high-quality genomic sequences from many mammalian species, this strategy represents a practical initial approach for rapidly annotating the most evolutionarily conserved sequences in the human genome, providing a key resource for the systematic study of human genome function.

Subject(s)

Conserved Sequence/genetics , Genome, Human , Genomics/methods , Mammals/genetics , Sequence Analysis, DNA/methods , Animals , Base Sequence , Computational Biology , Humans , Phylogeny , Sequence Alignment

15.

The Ensembl core software libraries.

Stabenau, Arne; McVicker, Graham; Melsopp, Craig; Proctor, Glenn; Clamp, Michele; Birney, Ewan.

Genome Res ; 14(5): 929-33, 2004 May.

Article in English | MEDLINE | ID: mdl-15123588

ABSTRACT

Systems for managing genomic data must store a vast quantity of information. Ensembl stores these data in several MySQL databases. The core software libraries provide a practical and effective means for programmers to access these data. By encapsulating the underlying database structure, the libraries present end users with a simple, abstract interface to a complex data model. Programs that use the libraries rather than SQL to access the data are unaffected by most schema changes. The architecture of the core software libraries, the schema, and the factors influencing their design are described. All code and data are freely available.

Subject(s)

Computational Biology , Software , Animals , Databases, Genetic , Humans , Software Design

16.

The Ensembl analysis pipeline.

Potter, Simon C; Clarke, Laura; Curwen, Val; Keenan, Stephen; Mongin, Emmanuel; Searle, Stephen M J; Stabenau, Arne; Storey, Roy; Clamp, Michele.

Genome Res ; 14(5): 934-41, 2004 May.

Article in English | MEDLINE | ID: mdl-15123589

ABSTRACT

The Ensembl pipeline is an extension to the Ensembl system which allows automated annotation of genomic sequence. The software comprises two parts. First, there is a set of Perl modules ("Runnables" and "RunnableDBs") which are 'wrappers' for a variety of commonly used analysis tools. These retrieve sequence data from a relational database, run the analysis, and write the results back to the database. They inherit from a common interface, which simplifies the writing of new wrapper modules. On top of this sits a job submission system (the "RuleManager") which allows efficient and reliable submission of large numbers of jobs to a compute farm. Here we describe the fundamental software components of the pipeline, and we also highlight some features of the Sanger installation which were necessary to enable the pipeline to scale to whole-genome analysis.

Subject(s)

Computational Biology/methods , Base Sequence/genetics , DNA/genetics , Databases, Genetic/standards , Programming Languages , Proteins/classification , Software , Software Design

17.

The Ensembl automatic gene annotation system.

Curwen, Val; Eyras, Eduardo; Andrews, T Daniel; Clarke, Laura; Mongin, Emmanuel; Searle, Steven M J; Clamp, Michele.

Genome Res ; 14(5): 942-50, 2004 May.

Article in English | MEDLINE | ID: mdl-15123590

ABSTRACT

As more genomes are sequenced, there is an increasing need for automated first-pass annotation which allows timely access to important genomic information. The Ensembl gene-building system enables fast automated annotation of eukaryotic genomes. It annotates genes based on evidence derived from known protein, cDNA, and EST sequences. The gene-building system rests on top of the core Ensembl (MySQL) database schema and Perl Application Programming Interface (API), and the data generated are accessible through the Ensembl genome browser (http://www.ensembl.org). To date, the Ensembl predicted gene sets are available for the A. gambiae, C. briggsae, zebrafish, mouse, rat, and human genomes and have been heavily relied upon in the publication of the human, mouse, rat, and A. gambiae genome sequence analysis. Here we describe in detail the gene-building system and the algorithms involved. All code and data are freely available from http://www.ensembl.org.

Subject(s)

Automation , Computational Biology/methods , Genes/physiology , Animals , Anopheles/genetics , Caenorhabditis/genetics , DNA/genetics , DNA, Helminth/genetics , Expressed Sequence Tags , Gene Dosage , Genes, Helminth/physiology , Genes, Insect/physiology , Genome , Genome, Human , Helminth Proteins/genetics , Humans , Insect Proteins/genetics , Mice , Predictive Value of Tests , Proteins/genetics , Pseudogenes/genetics , Rats , Sequence Alignment/methods , Sequence Homology, Amino Acid , Software , Tandem Repeat Sequences/genetics , Untranslated Regions/genetics

18.

The otter annotation system.

Searle, Stephen M J; Gilbert, James; Iyer, Vivek; Clamp, Michele.

Genome Res ; 14(5): 963-70, 2004 May.

Article in English | MEDLINE | ID: mdl-15123593

ABSTRACT

With the completion of the human genome sequence and genome sequence available for other vertebrate genomes, the task of manual annotation at the large genome scale has become a priority. Possibly even more important, is the requirement to curate and improve this annotation in the light of future data. For this to be possible, there is a need for tools to access and manage the annotation. Ensembl provides an excellent means for storing gene structures, genome features, and sequence, but it does not support the extra textual data necessary for manual annotation. We have extended Ensembl to create the Otter manual annotation system. This comprises a relational database schema for storing the manual annotation data, an application-programming interface (API) to access it, an extensible markup language (XML) format to allow transfer of the data, and a server to allow multiuser/multimachine access to the data. We have also written a data-adaptor plugin for the Apollo Browser/Editor to enable it to utilize an Otter server. The otter database is currently used by the Vertebrate Genome Annotation (VEGA) site (http://vega.sanger.ac.uk), which provides access to manually curated human chromosomes. Support is also being developed for using the AceDB annotation editor, FMap, via a perl wrapper called Lace. The Human and Vertebrate Annotation (HAVANA) group annotators at the Sanger center are using this to annotate human chromosomes 1 and 20.

Subject(s)

Software , Computational Biology/methods , Databases, Genetic , Genes/physiology , Genome, Human , Humans , Online Systems

19.

ESTGenes: alternative splicing from ESTs in Ensembl.

Eyras, Eduardo; Caccamo, Mario; Curwen, Val; Clamp, Michele.

Genome Res ; 14(5): 976-87, 2004 May.

Article in English | MEDLINE | ID: mdl-15123595

ABSTRACT

We describe a novel algorithm for deriving the minimal set of nonredundant transcripts compatible with the splicing structure of a set of ESTs mapped on a genome. Sets of ESTs with compatible splicing are represented by a special type of graph. We describe the algorithms for building the graphs and for deriving the minimal set of transcripts from the graphs that are compatible with the evidence. These algorithms are part of the Ensembl automatic gene annotation system, and its results, using ESTs, are provided at www.ensembl.org as ESTgenes for the mosquito, Caenorhabditis briggsae, C. elegans, zebrafish, human, mouse, and rat genomes. Here we also report on the results of this method applied to the human and mouse genomes.

Subject(s)

Alternative Splicing/genetics , Expressed Sequence Tags , Software , Animals , Caenorhabditis/genetics , Caenorhabditis elegans/genetics , Computational Biology , Culicidae/genetics , DNA, Helminth/genetics , Genes , Genes, Helminth , Genes, Insect , Humans , Mice , Predictive Value of Tests , Rats , Reproducibility of Results , Transcription, Genetic , Zebrafish/genetics

20.

GeneWise and Genomewise.

Birney, Ewan; Clamp, Michele; Durbin, Richard.

Genome Res ; 14(5): 988-95, 2004 May.

Article in English | MEDLINE | ID: mdl-15123596

ABSTRACT

We present two algorithms in this paper: GeneWise, which predicts gene structure using similar protein sequences, and Genomewise, which provides a gene structure final parse across cDNA- and EST-defined spliced structure. Both algorithms are heavily used by the Ensembl annotation system. The GeneWise algorithm was developed from a principled combination of hidden Markov models (HMMs). Both algorithms are highly accurate and can provide both accurate and complete gene structures when used with the correct evidence.

Subject(s)

Software , 3' Flanking Region , 5' Flanking Region , Algorithms , Computational Biology/methods , DNA, Complementary , Models, Theoretical , Predictive Value of Tests , Research Design

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL