Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 965
Filter
1.
Proc Natl Acad Sci U S A ; 121(23): e2403750121, 2024 Jun 04.
Article in English | MEDLINE | ID: mdl-38805269

ABSTRACT

Haplotype-resolved genome assemblies were produced for Chasselas and Ugni Blanc, two heterozygous Vitis vinifera cultivars by combining high-fidelity long-read sequencing and high-throughput chromosome conformation capture (Hi-C). The telomere-to-telomere full coverage of the chromosomes allowed us to assemble separately the two haplo-genomes of both cultivars and revealed structural variations between the two haplotypes of a given cultivar. The deletions/insertions, inversions, translocations, and duplications provide insight into the evolutionary history and parental relationship among grape varieties. Integration of de novo single long-read sequencing of full-length transcript isoforms (Iso-Seq) yielded a highly improved genome annotation. Given its higher contiguity, and the robustness of the IsoSeq-based annotation, the Chasselas assembly meets the standard to become the annotated reference genome for V. vinifera. Building on these resources, we developed VitExpress, an open interactive transcriptomic platform, that provides a genome browser and integrated web tools for expression profiling, and a set of statistical tools (StatTools) for the identification of highly correlated genes. Implementation of the correlation finder tool for MybA1, a major regulator of the anthocyanin pathway, identified candidate genes associated with anthocyanin metabolism, whose expression patterns were experimentally validated as discriminating between black and white grapes. These resources and innovative tools for mining genome-related data are anticipated to foster advances in several areas of grapevine research.


Subject(s)
Genome, Plant , Haplotypes , Transcriptome , Vitis , Vitis/genetics , Haplotypes/genetics , Transcriptome/genetics , Molecular Sequence Annotation/methods , Gene Expression Profiling/methods , Software
2.
Methods Mol Biol ; 2802: 33-55, 2024.
Article in English | MEDLINE | ID: mdl-38819555

ABSTRACT

The identification of orthologous genes is relevant for comparative genomics, phylogenetic analysis, and functional annotation. There are many computational tools for the prediction of orthologous groups as well as web-based resources that offer orthology datasets for download and online analysis. This chapter presents a simple and practical guide to the process of orthologous group prediction, using a dataset of 10 prokaryotic proteomes as example. The orthology methods covered are OrthoMCL, COGtriangles, OrthoFinder2, and OMA. The authors compare the number of orthologous groups predicted by these various methods, and present a brief workflow for the functional annotation and reconstruction of phylogenies from inferred single-copy orthologous genes. The chapter also demonstrates how to explore two orthology databases: eggNOG6 and OrthoDB.


Subject(s)
Genomics , Phylogeny , Genomics/methods , Computational Biology/methods , Software , Prokaryotic Cells/metabolism , Databases, Genetic , Molecular Sequence Annotation/methods , Multigene Family , Genome, Bacterial
3.
Methods Mol Biol ; 2802: 473-514, 2024.
Article in English | MEDLINE | ID: mdl-38819569

ABSTRACT

Genome sequencing quality, in terms of both read length and accuracy, is constantly improving. By combining long-read sequencing technologies with various scaffolding techniques, chromosome-level genome assemblies are now achievable at an affordable price for non-model organisms. Insects represent an exciting taxon for studying the genomic underpinnings of evolutionary innovations, due to ancient origins, immense species-richness, and broad phenotypic diversity. Here we summarize some of the most important methods for carrying out a comparative genomics study on insects. We describe available tools and offer concrete tips on all stages of such an endeavor from DNA extraction through genome sequencing, annotation, and several evolutionary analyses. Along the way we describe important insect-specific aspects, such as DNA extraction difficulties or gene families that are particularly difficult to annotate, and offer solutions. We describe results from several examples of comparative genomics analyses on insects to illustrate the fascinating questions that can now be addressed in this new age of genomics research.


Subject(s)
Evolution, Molecular , Genome, Insect , Genomics , Insecta , Animals , Insecta/genetics , Genomics/methods , Molecular Sequence Annotation/methods , Phylogeny , Sequence Analysis, DNA/methods
4.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38706315

ABSTRACT

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.


Subject(s)
Databases, Protein , Proteins , Proteins/chemistry , Molecular Sequence Annotation/methods , Computational Biology/methods , Machine Learning
5.
Curr Protoc ; 4(5): e1046, 2024 May.
Article in English | MEDLINE | ID: mdl-38717471

ABSTRACT

Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible, with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions using the Genome Analysis ToolKit (GATK) and predicts SVs using the Genome Rearrangement IDentification Software Suite (GRIDSS), and it culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus, SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our ability to associate genotypes with phenotypes. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Predicting single nucleotide polymorphisms and structural variations Support Protocol 1: Downloading publicly available sequencing data Support Protocol 2: Visualizing variant loci using Integrated Genome Viewer Support Protocol 3: Converting between VCF and aligned FASTA formats.


Subject(s)
Polymorphism, Single Nucleotide , Software , Workflow , Polymorphism, Single Nucleotide/genetics , Computational Biology/methods , Genomics/methods , Molecular Sequence Annotation/methods , Whole Genome Sequencing/methods
6.
PLoS Biol ; 22(5): e3002405, 2024 May.
Article in English | MEDLINE | ID: mdl-38713717

ABSTRACT

We report a new visualization tool for analysis of whole-genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by National Center for Biotechnology Information (NCBI) using assembly alignment algorithms developed by us and others. Researchers can examine large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed where available. CGV currently provides approximately 800 alignments from over 350 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.


Subject(s)
Genome , Software , Animals , Genome/genetics , Sequence Alignment/methods , Genomics/methods , Algorithms , United States , Humans , Eukaryota/genetics , Databases, Genetic , National Library of Medicine (U.S.) , Molecular Sequence Annotation/methods
7.
Bioinformatics ; 40(6)2024 Jun 03.
Article in English | MEDLINE | ID: mdl-38775729

ABSTRACT

MOTIVATION: Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models-adopted from the natural language processing field-have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes. RESULTS: To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health. AVAILABILITY AND IMPLEMENTATION: https://github.com/AbeelLab/safpred.


Subject(s)
Algorithms , Bacterial Proteins , Genome, Bacterial , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Software , Bacteria/genetics , Synteny , Computational Biology/methods , Molecular Sequence Annotation/methods
8.
PLoS One ; 19(5): e0304164, 2024.
Article in English | MEDLINE | ID: mdl-38805426

ABSTRACT

Engineered plasmids have been workhorses of recombinant DNA technology for nearly half a century. Plasmids are used to clone DNA sequences encoding new genetic parts and to reprogram cells by combining these parts in new ways. Historically, many genetic parts on plasmids were copied and reused without routinely checking their DNA sequences. With the widespread use of high-throughput DNA sequencing technologies, we now know that plasmids often contain variants of common genetic parts that differ slightly from their canonical sequences. Because the exact provenance of a genetic part on a particular plasmid is usually unknown, it is difficult to determine whether these differences arose due to mutations during plasmid construction and propagation or due to intentional editing by researchers. In either case, it is important to understand how the sequence changes alter the properties of the genetic part. We analyzed the sequences of over 50,000 engineered plasmids using depositor metadata and a metric inspired by the natural language processing field. We detected 217 uncatalogued genetic part variants that were especially widespread or were likely the result of convergent evolution or engineering. Several of these uncatalogued variants are known mutants of plasmid origins of replication or antibiotic resistance genes that are missing from current annotation databases. However, most are uncharacterized, and 3/5 of the plasmids we analyzed contained at least one of the uncatalogued variants. Our results include a list of genetic parts to prioritize for refining engineered plasmid annotation pipelines, highlight widespread variants of parts that warrant further investigation to see whether they have altered characteristics, and suggest cases where unintentional evolution of plasmid parts may be affecting the reliability and reproducibility of science.


Subject(s)
Genetic Engineering , Plasmids , Plasmids/genetics , Genetic Engineering/methods , High-Throughput Nucleotide Sequencing/methods , Molecular Sequence Annotation/methods , Mutation , Base Sequence , Sequence Analysis, DNA/methods
9.
Methods Mol Biol ; 2802: 165-187, 2024.
Article in English | MEDLINE | ID: mdl-38819560

ABSTRACT

Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. A large proportion of such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related species or strains ought to be annotated simultaneously. Often, in subsequent studies, differences between the closely related species or strains are in the focus of research when the shared gene structures prevail. We here review methods for comparative structural genome annotation. The reviewed methods include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate either a single target genome or all input genomes simultaneously. We discuss how the methods depend on the phylogenetic placement of genomes, give advice on the choice of methods, and examine the consistency between gene structure annotations in an example. Furthermore, we provide practical advice on genome annotation in general.


Subject(s)
Genomics , Molecular Sequence Annotation , Phylogeny , Molecular Sequence Annotation/methods , Genomics/methods , Computational Biology/methods , Genome/genetics , Sequence Alignment/methods , Software
10.
Nat Genet ; 56(5): 767-777, 2024 May.
Article in English | MEDLINE | ID: mdl-38689000

ABSTRACT

We develop a method, SBayesRC, that integrates genome-wide association study (GWAS) summary statistics with functional genomic annotations to improve polygenic prediction of complex traits. Our method is scalable to whole-genome variant analysis and refines signals from functional annotations by allowing them to affect both causal variant probability and causal effect distribution. We analyze 50 complex traits and diseases using ∼7 million common single-nucleotide polymorphisms (SNPs) and 96 annotations. SBayesRC improves prediction accuracy by 14% in European ancestry and up to 34% in cross-ancestry prediction compared to the baseline method SBayesR, which does not use annotations, and outperforms other methods, including LDpred2, LDpred-funct, MegaPRS, PolyPred-S and PRS-CSx. Investigation of factors affecting prediction accuracy identifies a significant interaction between SNP density and annotation information, suggesting whole-genome sequence variants with annotations may further improve prediction. Functional partitioning analysis highlights a major contribution of evolutionary constrained regions to prediction accuracy and the largest per-SNP contribution from nonsynonymous SNPs.


Subject(s)
Genome-Wide Association Study , Molecular Sequence Annotation , Multifactorial Inheritance , Polymorphism, Single Nucleotide , Multifactorial Inheritance/genetics , Genome-Wide Association Study/methods , Humans , Molecular Sequence Annotation/methods , Genomics/methods , Genome, Human , Models, Genetic
11.
Bioinformatics ; 40(4)2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38640488

ABSTRACT

MOTIVATION: The ENCODE project generated a large collection of eCLIP-seq RNA binding protein (RBP) profiling data with accompanying RNA-seq transcriptomes of shRNA knockdown of RBPs. These data could have utility in understanding the functional impact of genetic variants, however their potential has not been fully exploited. We implement INCA (Integrative annotation scores of variants for impact on RBP activities) as a multi-step genetic variant scoring approach that leverages the ENCODE RBP data together with ClinVar and integrates multiple computational approaches to aggregate evidence. RESULTS: INCA evaluates variant impacts on RBP activities by leveraging genotypic differences in cell lines used for eCLIP-seq. We show that INCA provides critical specificity, beyond generic scoring for RBP binding disruption, for candidate variants and their linkage-disequilibrium partners. As a result, it can, on average, augment scoring of 46.2% of the candidate variants beyond generic scoring for RBP binding disruption and aid in variant prioritization for follow-up analysis. AVAILABILITY AND IMPLEMENTATION: INCA is implemented in R and is available at https://github.com/keleslab/INCA.


Subject(s)
RNA-Binding Proteins , Humans , RNA-Binding Proteins/metabolism , RNA-Binding Proteins/genetics , Software , Genetic Variation , Computational Biology/methods , Molecular Sequence Annotation/methods
12.
BMC Bioinformatics ; 25(1): 165, 2024 Apr 25.
Article in English | MEDLINE | ID: mdl-38664627

ABSTRACT

BACKGROUND: The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS: Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION: The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.


Subject(s)
Algorithms , Molecular Sequence Annotation , Sequence Alignment , Molecular Sequence Annotation/methods , Sequence Alignment/methods , Viral Proteins/genetics , Viral Proteins/chemistry , Genes, Viral , Databases, Protein , Computational Biology/methods , Amino Acid Sequence
13.
Nat Methods ; 21(5): 793-797, 2024 May.
Article in English | MEDLINE | ID: mdl-38509328

ABSTRACT

SQANTI3 is a tool designed for the quality control, curation and annotation of long-read transcript models obtained with third-generation sequencing technologies. Leveraging its annotation framework, SQANTI3 calculates quality descriptors of transcript models, junctions and transcript ends. With this information, potential artifacts can be identified and replaced with reliable sequences. Furthermore, the integrated functional annotation feature enables subsequent functional iso-transcriptomics analyses.


Subject(s)
Molecular Sequence Annotation , Transcriptome , Humans , Molecular Sequence Annotation/methods , Software , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Protein Isoforms/genetics , High-Throughput Nucleotide Sequencing/methods
14.
Article in English | MEDLINE | ID: mdl-38442065

ABSTRACT

Rapid advances in single-cell chromatin accessibility sequencing (scCAS) technologies have enabled the characterization of epigenomic heterogeneity and increased the demand for automatic annotation of cell types. However, there are few computational methods tailored for cell type annotation in scCAS data and the existing methods perform poorly for differentiating and imbalanced cell types. Here, we propose CASCADE, a novel annotation method based on simulation- and denoising-based strategies. With comprehensive experiments on a number of scCAS datasets, we showed that CASCADE can effectively distinguish the patterns of different cell types and mitigate the effect of high noise levels, and thus achieve significantly better annotation performance for differentiating and imbalanced cell types. Besides, we performed model ablation experiments to show the contribution of modules in CASCADE and conducted extensive experiments to demonstrate the robustness of CASCADE to batch effect, imbalance degree, data sparsity, and number of cell types. Moreover, CASCADE significantly outperformed baseline methods for accurately annotating the cell types in newly sequenced data. We anticipate that CASCADE will greatly assist with characterizing cell heterogeneity in scCAS data analysis.


Subject(s)
Chromatin , Computational Biology , Single-Cell Analysis , Chromatin/genetics , Chromatin/metabolism , Chromatin/chemistry , Single-Cell Analysis/methods , Humans , Computational Biology/methods , Algorithms , Molecular Sequence Annotation/methods , Sequence Analysis, DNA/methods
15.
J Mol Biol ; 436(4): 168416, 2024 02 15.
Article in English | MEDLINE | ID: mdl-38143020

ABSTRACT

Neuropeptides not only work through nervous system but some of them also work peripherally to regulate numerous physiological processes. They are important in regulation of numerous physiological processes including growth, reproduction, social behavior, inflammation, fluid homeostasis, cardiovascular function, and energy homeostasis. The various roles of neuropeptides make them promising candidates for prospective therapeutics of different diseases. Currently, NeuroPep has been updated to version 2.0, it now holds 11,417 unique neuropeptide entries, which is nearly double of the first version of NeuroPep. When available, we collected information about the receptor for each neuropeptide entry and predicted the 3D structures of those neuropeptides without known experimental structure using AlphaFold2 or APPTEST according to the peptide sequence length. In addition, DeepNeuropePred and NeuroPred-PLM, two neuropeptide prediction tools developed by us recently, were also integrated into NeuroPep 2.0 to help to facilitate the identification of new neuropeptides. NeuroPep 2.0 is freely accessible at https://isyslab.info/NeuroPepV2/.


Subject(s)
Databases, Protein , Molecular Sequence Annotation , Neuropeptides , Amino Acid Sequence , Neuropeptides/chemistry , Molecular Sequence Annotation/methods
16.
J Biol Chem ; 299(9): 105130, 2023 09.
Article in English | MEDLINE | ID: mdl-37543366

ABSTRACT

Long noncoding RNAs (lncRNAs) are increasingly being recognized as modulators in various biological processes. However, due to their low expression, their systematic characterization is difficult to determine. Here, we performed transcript annotation by a newly developed computational pipeline, termed RNA-seq and small RNA-seq combined strategy (RSCS), in a wide variety of cellular contexts. Thousands of high-confidence potential novel transcripts were identified by the RSCS, and the reliability of the transcriptome was verified by analysis of transcript structure, base composition, and sequence complexity. Evidenced by the length comparison, the frequency of the core promoter and the polyadenylation signal motifs, and the locations of transcription start and end sites, the transcripts appear to be full length. Furthermore, taking advantage of our strategy, we identified a large number of endogenous retrovirus-associated lncRNAs, and a novel endogenous retrovirus-lncRNA that was functionally involved in control of Yap1 expression and essential for early embryogenesis was identified. In summary, the RSCS can generate a more complete and precise transcriptome, and our findings greatly expanded the transcriptome annotation for the mammalian community.


Subject(s)
Molecular Sequence Annotation , RNA, Long Noncoding , RNA-Seq , Animals , Embryonic Development/genetics , Mammals/embryology , Mammals/genetics , Molecular Sequence Annotation/methods , Promoter Regions, Genetic/genetics , Reproducibility of Results , Retroviridae/genetics , RNA, Long Noncoding/genetics , RNA-Seq/methods , Transcription Initiation Site , Transcriptome/genetics , YAP-Signaling Proteins/genetics , YAP-Signaling Proteins/metabolism
17.
Genome Biol ; 24(1): 135, 2023 06 08.
Article in English | MEDLINE | ID: mdl-37291671

ABSTRACT

BACKGROUND: In every living species, the function of a protein depends on its organization of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species but has so far been scarcely studied. RESULTS: Here we evaluate this diversity by comparing protein length distribution across 2326 species (1688 bacteria, 153 archaea, and 485 eukaryotes). We find that proteins tend to be on average slightly longer in eukaryotes than in bacteria or archaea, but that the variation of length distribution across species is low, especially compared to the variation of other genomic features (genome size, number of proteins, gene length, GC content, isoelectric points of proteins). Moreover, most cases of atypical protein length distribution appear to be due to artifactual gene annotation, suggesting the actual variation of protein length distribution across species is even smaller. CONCLUSIONS: These results open the way for developing a genome annotation quality metric based on protein length distribution to complement conventional quality measures. Overall, our findings show that protein length distribution between living species is more uniform than previously thought. Furthermore, we also provide evidence for a universal selection on protein length, yet its mechanism and fitness effect remain intriguing open questions.


Subject(s)
Molecular Sequence Annotation , Proteins , Sequence Analysis, Protein , Amino Acid Sequence , Molecular Sequence Annotation/methods , Proteins/chemistry , Proteins/classification , Proteome , Sequence Analysis, Protein/methods , Eukaryota , Bacteria , Archaea
18.
Science ; 380(6643): eabn3107, 2023 04 28.
Article in English | MEDLINE | ID: mdl-37104600

ABSTRACT

Annotating coding genes and inferring orthologs are two classical challenges in genomics and evolutionary biology that have traditionally been approached separately, limiting scalability. We present TOGA (Tool to infer Orthologs from Genome Alignments), a method that integrates structural gene annotation and orthology inference. TOGA implements a different paradigm to infer orthologous loci, improves ortholog detection and annotation of conserved genes compared with state-of-the-art methods, and handles even highly fragmented assemblies. TOGA scales to hundreds of genomes, which we demonstrate by applying it to 488 placental mammal and 501 bird assemblies, creating the largest comparative gene resources so far. Additionally, TOGA detects gene losses, enables selection screens, and automatically provides a superior measure of mammalian genome quality. TOGA is a powerful and scalable method to annotate and compare genes in the genomic era.


Subject(s)
Eutheria , Genomics , Molecular Sequence Annotation , Animals , Female , Mice , Eutheria/genetics , Genome , Genomics/methods , Molecular Sequence Annotation/methods , Birds/genetics
19.
Science ; 379(6639): 1358-1363, 2023 03 31.
Article in English | MEDLINE | ID: mdl-36996195

ABSTRACT

Enzyme function annotation is a fundamental challenge, and numerous computational tools have been developed. However, most of these tools cannot accurately predict functional annotations, such as enzyme commission (EC) number, for less-studied proteins or those with previously uncharacterized functions or multiple activities. We present a machine learning algorithm named CLEAN (contrastive learning-enabled enzyme annotation) to assign EC numbers to enzymes with better accuracy, reliability, and sensitivity compared with the state-of-the-art tool BLASTp. The contrastive learning framework empowers CLEAN to confidently (i) annotate understudied enzymes, (ii) correct mislabeled enzymes, and (iii) identify promiscuous enzymes with two or more EC numbers-functions that we demonstrate by systematic in silico and in vitro experiments. We anticipate that this tool will be widely used for predicting the functions of uncharacterized enzymes, thereby advancing many fields, such as genomics, synthetic biology, and biocatalysis.


Subject(s)
Enzymes , Machine Learning , Molecular Sequence Annotation , Proteins , Sequence Analysis, Protein , Algorithms , Computational Biology , Enzymes/chemistry , Genomics , Proteins/chemistry , Reproducibility of Results , Molecular Sequence Annotation/methods , Sequence Analysis, Protein/methods , Biocatalysis
20.
Sci Rep ; 13(1): 1417, 2023 01 25.
Article in English | MEDLINE | ID: mdl-36697464

ABSTRACT

We report here a new application, CustomProteinSearch (CusProSe), whose purpose is to help users to search for proteins of interest based on their domain composition. The application is customizable. It consists of two independent tools, IterHMMBuild and ProSeCDA. IterHMMBuild allows the iterative construction of Hidden Markov Model (HMM) profiles for conserved domains of selected protein sequences, while ProSeCDA scans a proteome of interest against an HMM profile database, and annotates identified proteins using user-defined rules. CusProSe was successfully used to identify, in fungal genomes, genes encoding key enzyme families involved in secondary metabolism, such as polyketide synthases (PKS), non-ribosomal peptide synthetases (NRPS), hybrid PKS-NRPS and dimethylallyl tryptophan synthases (DMATS), as well as to characterize distinct terpene synthases (TS) sub-families. The highly configurable characteristics of this application makes it a generic tool, which allows the user to refine the function of predicted proteins, to extend detection to new enzymes families, and may also be applied to biological systems other than fungi and to other proteins than those involved in secondary metabolism.


Subject(s)
Fungi , Molecular Sequence Annotation , Secondary Metabolism , Software , Amino Acid Sequence , Molecular Sequence Annotation/methods , Peptide Synthases/genetics , Polyketide Synthases/genetics , Secondary Metabolism/genetics , Fungi/enzymology , Fungi/genetics , Tryptophan Synthase/genetics , Conserved Sequence/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...