Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
Add more filters










Publication year range
1.
Genome Biol ; 24(1): 223, 2023 10 05.
Article in English | MEDLINE | ID: mdl-37798615

ABSTRACT

Crop pangenomes made from individual cultivar assemblies promise easy access to conserved genes, but genome content variability and inconsistent identifiers hamper their exploration. To address this, we define pangenes, which summarize a species coding potential and link back to original annotations. The protocol get_pangenes performs whole genome alignments (WGA) to call syntenic gene models based on coordinate overlaps. A benchmark with small and large plant genomes shows that pangenes recapitulate phylogeny-based orthologies and produce complete soft-core gene sets. Moreover, WGAs support lift-over and help confirm gene presence-absence variation. Source code and documentation: https://github.com/Ensembl/plant-scripts .


Subject(s)
Genome, Plant , Software
2.
F1000Res ; 112022.
Article in English | MEDLINE | ID: mdl-35811804

ABSTRACT

In this opinion article, we discuss the formatting of files from (plant) genotyping studies, in particular the formatting of (meta-) data in Variant Call Format (VCF) files. The flexibility of the VCF format specification facilitates its use as a generic interchange format across domains but can lead to inconsistency between files in the presentation of metadata. To enable fully autonomous machine actionable data flow, generic elements need to be further specified. We strongly support the merits of the FAIR principles and see the need to facilitate them also through technical implementation specifications. VCF files are an established standard for the exchange and publication of genotyping data. Other data formats are also used to capture variant call data (for example, the HapMap format and the gVCF format), but none currently have the reach of VCF. In VCF, only the sites of variation are described, whereas in gVCF, all positions are listed, and confidence values are also provided. For the sake of simplicity, we will only discuss VCF and our recommendations for its use. However, the part of the VCF standard relating to metadata (as opposed to the actual variant calls) defines a syntactic format but no vocabulary, unique identifier or recommended content. In practice, often only sparse (if any) descriptive metadata is included. When descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. To address this, we propose recommendations for supplying and encoding metadata, focusing on use cases from the plant sciences. We expect there to be overlap, but also divergence, with the needs of other domains.


Subject(s)
Metadata , Software , Genotype
3.
Methods Mol Biol ; 2443: 27-55, 2022.
Article in English | MEDLINE | ID: mdl-35037199

ABSTRACT

Ensembl Plants ( http://plants.ensembl.org ) offers genome-scale information for plants, with four releases per year. As of release 47 (April 2020) it features 79 species and includes genome sequence, gene models, and functional annotation. Comparative analyses help reconstruct the evolutionary history of gene families, genomes, and components of polyploid genomes. Some species have gene expression baseline reports or variation across genotypes. While the data can be accessed through the Ensembl genome browser, here we review specifically how our plant genomes can be interrogated programmatically and the data downloaded in bulk. These access routes are generally consistent across Ensembl for other non-plant species, including plant pathogens, pests, and pollinators.


Subject(s)
Databases, Genetic , Genomics , Genome, Plant , Molecular Sequence Annotation , Plants/genetics , Software
4.
Nucleic Acids Res ; 50(D1): D996-D1003, 2022 01 07.
Article in English | MEDLINE | ID: mdl-34791415

ABSTRACT

Ensembl Genomes (https://www.ensemblgenomes.org) provides access to non-vertebrate genomes and analysis complementing vertebrate resources developed by the Ensembl project (https://www.ensembl.org). The two resources collectively present genome annotation through a consistent set of interfaces spanning the tree of life presenting genome sequence, annotation, variation, transcriptomic data and comparative analysis. Here, we present our largest increase in plant, metazoan and fungal genomes since the project's inception creating one of the world's most comprehensive genomic resources and describe our efforts to reduce genome redundancy in our Bacteria portal. We detail our new efforts in gene annotation, our emerging support for pangenome analysis, our efforts to accelerate data dissemination through the Ensembl Rapid Release resource and our new AlphaFold visualization. Finally, we present details of our future plans including updates on our integration with Ensembl, and how we plan to improve our support for the microbial research community. Software and data are made available without restriction via our website, online tools platform and programmatic interfaces (available under an Apache 2.0 license). Data updates are synchronised with Ensembl's release cycle.


Subject(s)
Databases, Genetic , Genomics , Internet , Software , Animals , Computational Biology , Genome, Bacterial/genetics , Genome, Fungal/genetics , Genome, Plant/genetics , Plants/classification , Plants/genetics , Vertebrates/classification , Vertebrates/genetics
5.
Plant Genome ; 14(3): e20143, 2021 11.
Article in English | MEDLINE | ID: mdl-34562304

ABSTRACT

The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.


Subject(s)
Genome, Plant , Repetitive Sequences, Nucleic Acid
6.
Nucleic Acids Res ; 49(D1): D1452-D1463, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33170273

ABSTRACT

Gramene (http://www.gramene.org), a knowledgebase founded on comparative functional analyses of genomic and pathway data for model plants and major crops, supports agricultural researchers worldwide. The resource is committed to open access and reproducible science based on the FAIR data principles. Since the last NAR update, we made nine releases; doubled the genome portal's content; expanded curated genes, pathways and expression sets; and implemented the Domain Informational Vocabulary Extraction (DIVE) algorithm for extracting gene function information from publications. The current release, #63 (October 2020), hosts 93 reference genomes-over 3.9 million genes in 122 947 families with orthologous and paralogous classifications. Plant Reactome portrays pathway networks using a combination of manual biocuration in rice (320 reference pathways) and orthology-based projections to 106 species. The Reactome platform facilitates comparison between reference and projected pathways, gene expression analyses and overlays of gene-gene interactions. Gramene integrates ontology-based protein structure-function annotation; information on genetic, epigenetic, expression, and phenotypic diversity; and gene functional annotations extracted from plant-focused journals using DIVE. We train plant researchers in biocuration of genes and pathways; host curated maize gene structures as tracks in the maize genome browser; and integrate curated rice genes and pathways in the Plant Reactome.


Subject(s)
Databases, Genetic , Gene Expression Regulation, Plant , Genome, Plant , Genomics/methods , Plant Proteins/genetics , Plants/genetics , Crops, Agricultural , DNA Transposable Elements , Gene Duplication , Gene Ontology , Gene Regulatory Networks , Internet , Knowledge Bases , Metabolic Networks and Pathways , Molecular Sequence Annotation , Oryza/genetics , Oryza/metabolism , Plant Proteins/metabolism , Plants/classification , Plants/metabolism , Polyploidy , Protein Interaction Mapping , Software , Zea mays/genetics , Zea mays/metabolism
7.
Elife ; 92020 03 24.
Article in English | MEDLINE | ID: mdl-32208137

ABSTRACT

Understanding the function of genes within staple crops will accelerate crop improvement by allowing targeted breeding approaches. Despite their importance, a lack of genomic information and resources has hindered the functional characterisation of genes in major crops. The recent release of high-quality reference sequences for these crops underpins a suite of genetic and genomic resources that support basic research and breeding. For wheat, these include gene model annotations, expression atlases and gene networks that provide information about putative function. Sequenced mutant populations, improved transformation protocols and structured natural populations provide rapid methods to study gene function directly. We highlight a case study exemplifying how to integrate these resources. This review provides a helpful guide for plant scientists, especially those expanding into crop research, to capitalise on the discoveries made in Arabidopsis and other plants. This will accelerate the improvement of crops of vital importance for food and nutrition security.


Subject(s)
Arabidopsis/genetics , Crops, Agricultural/genetics , Genome, Plant/genetics , Triticum/genetics , Genomics/methods , Molecular Sequence Annotation/methods , Plant Breeding/methods , Polyploidy
8.
Genes (Basel) ; 11(3)2020 03 06.
Article in English | MEDLINE | ID: mdl-32155892

ABSTRACT

Sunflower germplasm collections are valuable resources for broadening the genetic base of commercial hybrids and ameliorate the risk of climate events. Nowadays, the most studied worldwide sunflower pre-breeding collections belong to INTA (Argentina), INRA (France), and USDA-UBC (United States of America-Canada). In this work, we assess the amount and distribution of genetic diversity (GD) available within and between these collections to estimate the distribution pattern of global diversity. A mixed genotyping strategy was implemented, by combining proprietary genotyping-by-sequencing data with public whole-genome-sequencing data, to generate an integrative 11,834-common single nucleotide polymorphism matrix including the three breeding collections. In general, the GD estimates obtained were moderate. An analysis of molecular variance provided evidence of population structure between breeding collections. However, the optimal number of subpopulations, studied via discriminant analysis of principal components (K = 12), the bayesian STRUCTURE algorithm (K = 6) and distance-based methods (K = 9) remains unclear, since no single unifying characteristic is apparent for any of the inferred groups. Different overall patterns of linkage disequilibrium (LD) were observed across chromosomes, with Chr10, Chr17, Chr5, and Chr2 showing the highest LD. This work represents the largest and most comprehensive inter-breeding collection analysis of genomic diversity for cultivated sunflower conducted to date.


Subject(s)
Helianthus/genetics , Linkage Disequilibrium , Polymorphism, Genetic , Seed Bank , Chromosomes, Plant/genetics , Plant Breeding/methods
9.
Nucleic Acids Res ; 48(D1): D689-D695, 2020 01 08.
Article in English | MEDLINE | ID: mdl-31598706

ABSTRACT

Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of interfaces to genomic data across the tree of life, including reference genome sequence, gene models, transcriptional data, genetic variation and comparative analysis. Data may be accessed via our website, online tools platform and programmatic interfaces, with updates made four times per year (in synchrony with Ensembl). Here, we provide an overview of Ensembl Genomes, with a focus on recent developments. These include the continued growth, more robust and reproducible sets of orthologues and paralogues, and enriched views of gene expression and gene function in plants. Finally, we report on our continued deeper integration with the Ensembl project, which forms a key part of our future strategy for dealing with the increasing quantity of available genome-scale data across the tree of life.


Subject(s)
Computational Biology/methods , Databases, Genetic , Genetic Variation , Genome, Bacterial , Genome, Fungal , Genome, Plant , Algorithms , Animals , Caenorhabditis elegans/genetics , Genomics , Internet , Molecular Sequence Annotation , Phenotype , Plants/genetics , Reference Values , Software , User-Computer Interface
10.
Nature ; 563(7730): 197-202, 2018 11.
Article in English | MEDLINE | ID: mdl-30356220

ABSTRACT

As the first line of defence against pathogens, cells mount an innate immune response, which varies widely from cell to cell. The response must be potent but carefully controlled to avoid self-damage. How these constraints have shaped the evolution of innate immunity remains poorly understood. Here we characterize the innate immune response's transcriptional divergence between species and variability in expression among cells. Using bulk and single-cell transcriptomics in fibroblasts and mononuclear phagocytes from different species, challenged with immune stimuli, we map the architecture of the innate immune response. Transcriptionally diverging genes, including those that encode cytokines and chemokines, vary across cells and have distinct promoter structures. Conversely, genes that are involved in the regulation of this response, such as those that encode transcription factors and kinases, are conserved between species and display low cell-to-cell variability in expression. We suggest that this expression pattern, which is observed across species and conditions, has evolved as a mechanism for fine-tuned regulation to achieve an effective but balanced response.


Subject(s)
Cells/metabolism , Evolution, Molecular , Immunity, Innate/genetics , Immunity, Innate/immunology , Organ Specificity/genetics , Species Specificity , Transcription, Genetic/genetics , Animals , Cells/cytology , Cytokines/genetics , Humans , Promoter Regions, Genetic/genetics
11.
Nucleic Acids Res ; 46(D1): D1181-D1189, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29165610

ABSTRACT

Gramene (http://www.gramene.org) is a knowledgebase for comparative functional analysis in major crops and model plant species. The current release, #54, includes over 1.7 million genes from 44 reference genomes, most of which were organized into 62,367 gene families through orthologous and paralogous gene classification, whole-genome alignments, and synteny. Additional gene annotations include ontology-based protein structure and function; genetic, epigenetic, and phenotypic diversity; and pathway associations. Gramene's Plant Reactome provides a knowledgebase of cellular-level plant pathway networks. Specifically, it uses curated rice reference pathways to derive pathway projections for an additional 66 species based on gene orthology, and facilitates display of gene expression, gene-gene interactions, and user-defined omics data in the context of these pathways. As a community portal, Gramene integrates best-of-class software and infrastructure components including the Ensembl genome browser, Reactome pathway browser, and Expression Atlas widgets, and undergoes periodic data and software upgrades. Via powerful, intuitive search interfaces, users can easily query across various portals and interactively analyze search results by clicking on diverse features such as genomic context, highly augmented gene trees, gene expression anatomograms, associated pathways, and external informatics resources. All data in Gramene are accessible through both visual and programmatic interfaces.


Subject(s)
Databases, Genetic , Gene Expression Regulation, Plant , Genomics/methods , Knowledge Bases , Plants/genetics , Epigenesis, Genetic , Gene Ontology , Genetic Research , Genetic Variation , Genome, Plant , Metabolic Networks and Pathways/genetics , Molecular Sequence Annotation , Plants/metabolism , Software , User-Computer Interface
12.
Nucleic Acids Res ; 46(D1): D802-D808, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29092050

ABSTRACT

Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including genome sequence, gene models, transcript sequence, genetic variation, and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments and expansions. These include the incorporation of almost 20 000 additional genome sequences and over 35 000 tracks of RNA-Seq data, which have been aligned to genomic sequence and made available for visualization. Other advances since 2015 include the release of the database in Resource Description Framework (RDF) format, a large increase in community-derived curation, a new high-performance protein sequence search, additional cross-references, improved annotation of non-protein-coding genes, and the launch of pre-release and archival sites. Collectively, these changes are part of a continuing response to the increasing quantity of publicly-available genome-scale data, and the consequent need to archive, integrate, annotate and disseminate these using automated, scalable methods.


Subject(s)
Archaea/genetics , Bacteria/genetics , Databases, Genetic , Databases, Protein , Eukaryota/genetics , Genomics , Amino Acid Sequence , Animals , Base Sequence , Data Mining , Forecasting , Genome , Molecular Sequence Annotation , RNA/genetics , User-Computer Interface
13.
Bioinformatics ; 28(7): 983-90, 2012 Apr 01.
Article in English | MEDLINE | ID: mdl-22328785

ABSTRACT

MOTIVATION: MicroRNAs (miRNAs) are short sequences that negatively regulate gene expression. The current understanding of miRNA and their corresponding mRNA targets is primarily based on prediction programs. This study addresses the potential of a coordinated action of miRNAs to manipulate the human pathways. Specifically, we investigate the effectiveness of disrupting the topology of human pathway graphs through a regulation by miRNAs. RESULTS: From a set of miRNA candidates that is associated with a pathway, an exhaustive search for all possible doubles and triplets (coined miR-Duo, miR-Trios) is performed. The impact of each miR-combination on the connectivity of the pathway graph was quantified. About 170 human pathways were tested, and the miR-Duos and miR-Trios were scored for their ability to disrupt these pathway graphs. We show that 75% of all pathways are effectively disconnected by a small number of pathway-specific miR-Trios. Only 15% of the human pathways are resistant to fragmentation by miR-Duos or miR-Trios. Significantly, the combination of the most effective miR-Trios is unique. Thus, a specific regulation of a pathway within the cell is guaranteed. The impact of the selected miR-Duo/Trios on various diseases is discussed. CONCLUSIONS: The methodology presented shows that the synthesis of the topology of a network with a detailed understanding of the miRNAs' regulation is useful in exposing critical nodes of the network. We propose the miR-Trio approach as a basis for rationally designed perturbation experiments. CONTACT: michall@cc.huji.ac.il SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology/methods , Gene Regulatory Networks , MicroRNAs/genetics , Algorithms , Gene Expression , Humans , RNA, Messenger/genetics
14.
Bioinformatics ; 26(18): i482-8, 2010 Sep 15.
Article in English | MEDLINE | ID: mdl-20823311

ABSTRACT

MOTIVATION: Animal toxins operate by binding to receptors and ion channels. These proteins are short and vary in sequence, structure and function. Sporadic discoveries have also revealed endogenous toxin-like proteins in non-venomous organisms. Viral proteins are the largest group of quickly evolving proteomes. We tested the hypothesis that toxin-like proteins exist in viruses and that they act to modulate functions of their hosts. RESULTS: We updated and improved a classifier for compact proteins resembling short animal toxins that is based on a machine-learning method. We applied it in a large-scale setting to identify toxin-like proteins among short viral proteins. Among the approximately 26 000 representatives of such short proteins, 510 sequences were positively identified. We focused on the 19 highest scoring proteins. Among them, we identified conotoxin-like proteins, growth factors receptor-like proteins and anti-bacterial peptides. Our predictor was shown to enhance annotation inference for many 'uncharacterized' proteins. We conclude that our protocol can expose toxin-like proteins in unexplored niches including metagenomics data and enhance the systematic discovery of novel cell modulators for drug development. AVAILABILITY: ClanTox is available at http://www.clantox.cs.huji.ac.il.


Subject(s)
Viral Proteins/physiology , Viral Proteins/toxicity , Algorithms , Amino Acid Sequence , Animals , Artificial Intelligence , Base Sequence , Conotoxins/chemistry , Metagenomics , Molecular Sequence Data , Protein Folding , Protein Structure, Tertiary , Proteome , Toxins, Biological/physiology , Toxins, Biological/toxicity , Viral Proteins/chemistry
15.
Bioinformatics ; 26(15): 1920-1, 2010 Aug 01.
Article in English | MEDLINE | ID: mdl-20529892

ABSTRACT

SUMMARY: The miRror application provides insights on microRNA (miRNA) regulation. It is based on the notion of a combinatorial regulation by an ensemble of miRNAs or genes. miRror integrates predictions from a dozen of miRNA resources that are based on complementary algorithms into a unified statistical framework. For miRNAs set as input, the online tool provides a ranked list of targets, based on set of resources selected by the user, according to their significance of being coordinately regulated. Symmetrically, a set of genes can be used as input to suggest a set of miRNAs. The user can restrict the analysis for the preferred tissue or cell line. miRror is suitable for analyzing results from miRNAs profiling, proteomics and gene expression arrays. AVAILABILITY: http://www.proto.cs.huji.ac.il/mirror


Subject(s)
Computational Biology/methods , Internet , MicroRNAs/genetics , Algorithms , Gene Expression Profiling/methods , Gene Expression Regulation , MicroRNAs/metabolism
16.
BMC Genomics ; 10: 593, 2009 Dec 10.
Article in English | MEDLINE | ID: mdl-20003297

ABSTRACT

BACKGROUND: The complete proteome of the starlet sea anemone, Nematostella vectensis, provides insights into gene invention dating back to the Cnidarian-Bilaterian ancestor. With the addition of the complete proteomes of Hydra magnipapillata and Monosiga brevicollis, the investigation of proteins having unique features in early metazoan life has become practical. We focused on the properties and the evolutionary trends of tandem repeat (TR) sequences in Cnidaria proteomes. RESULTS: We found that 11-16% of N. vectensis proteins contain tandem repeats. Most TRs cover 150 amino acid segments that are comprised of basic units of 5-20 amino acids. In total, the N. Vectensis proteome has about 3300 unique TR-units, but only a small fraction of them are shared with H. magnipapillata, M. brevicollis, or mammalian proteomes. The overall abundance of these TRs stands out relative to that of 14 proteomes representing the diversity among eukaryotes and within the metazoan world. TR-units are characterized by a unique composition of amino acids, with cysteine and histidine being over-represented. Structurally, most TR-segments are associated with coiled and disordered regions. Interestingly, 80% of the TR-segments can be read in more than one open reading frame. For over 100 of them, translation of the alternative frames would result in long proteins. Most domain families that are characterized as repeats in eukaryotes are found in the TR-proteomes from Nematostella and Hydra. CONCLUSIONS: While most TR-proteins have originated from prediction tools and are still awaiting experimental validations, supportive evidence exists for hundreds of TR-units in Nematostella. The existence of TR-proteins in early metazoan life may have served as a robust mode for novel genes with previously overlooked structural and functional characteristics.


Subject(s)
Proteome/genetics , Sea Anemones/genetics , Tandem Repeat Sequences , Animals , Comparative Genomic Hybridization , Computational Biology , DNA, Complementary/genetics , Evolution, Molecular , Open Reading Frames , Phylogeny , Sequence Analysis, DNA
17.
Nucleic Acids Res ; 37(Web Server issue): W363-8, 2009 Jul.
Article in English | MEDLINE | ID: mdl-19429697

ABSTRACT

Toxins are detected in sporadic species along the evolutionary tree of the animal kingdom. Venomous animals include scorpions, snakes, bees, wasps, frogs and numerous animals living in the sea such as the stonefish, snail, jellyfish, hydra and more. Interestingly, proteins that share a common scaffold with animal toxins also exist in non-venomous species. However, due to their short length and primary sequence diversity, these, toxin-like proteins remain undetected by classical search engines and genome annotation tools. We construct a toxin classification machine and web server called ClanTox (Classifier of Animal Toxins) that is based on the extraction of sequence-driven features from the primary protein sequence followed by the application of a classification system trained on known animal toxins. For a given input list of sequences, from venomous or non-venomous settings, the ClanTox system predicts whether each sequence is toxin-like. ClanTox provides a ranked list of positively predicted candidates according to statistical confidence. For each protein, additional information is presented including the presence of a signal peptide, the number of cysteine residues and the associated functional annotations. ClanTox is a discovery-prediction tool for a relatively overlooked niche of toxin-like cell modulators, many of which are therapeutic agent candidates. The ClanTox web server is freely accessible at http://www.clantox.cs.huji.ac.il.


Subject(s)
Software , Toxins, Biological/classification , Animals , Cysteine/chemistry , Proteomics , Sequence Analysis, Protein , Toxins, Biological/chemistry , User-Computer Interface
SELECTION OF CITATIONS
SEARCH DETAIL
...