Search | VHL Regional Portal

1.

Scoring alignments by embedding vector similarity.

Ashrafzadeh, Sepehr; Golding, G Brian; Ilie, Silvana; Ilie, Lucian.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38695119

ABSTRACT

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.

Subject(s)

Algorithms , Computational Biology , Sequence Alignment , Sequence Alignment/methods , Computational Biology/methods , Software , Sequence Analysis, Protein/methods , Amino Acid Sequence , Proteins/chemistry , Proteins/genetics , Deep Learning , Databases, Protein

2.

PanDepth, an ultrafast and efficient genomic tool for coverage calculation.

Yu, Huiyang; Shi, Chunmei; He, Weiming; Li, Feng; Ouyang, Bo.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38701418

ABSTRACT

Coverage quantification is required in many sequencing datasets within the field of genomics research. However, most existing tools fail to provide comprehensive statistical results and exhibit limited performance gains from multithreading. Here, we present PanDepth, an ultra-fast and efficient tool for calculating coverage and depth from sequencing alignments. PanDepth outperforms other tools in computation time and memory efficiency for both BAM and CRAM-format alignment files from sequencing data, regardless of read length. It employs chromosome parallel computation and optimized data structures, resulting in ultrafast computation speeds and memory efficiency. It accepts sorted or unsorted BAM and CRAM-format alignment files as well as GTF, GFF and BED-formatted interval files or a specific window size. When provided with a reference genome sequence and the option to enable GC content calculation, PanDepth includes GC content statistics, enhancing the accuracy and reliability of copy number variation analysis. Overall, PanDepth is a powerful tool that accelerates scientific discovery in genomics research.

Subject(s)

Genomics , Software , Genomics/methods , Humans , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Base Composition , DNA Copy Number Variations , Computational Biology/methods , Algorithms , Sequence Alignment/methods

3.

Classification and Identification of Non-canonical Base Pairs and Structural Motifs.

Sarrazin-Gendron, Roman; Waldispühl, Jérôme; Reinharz, Vladimir.

Methods Mol Biol ; 2726: 143-168, 2024.

Article in English | MEDLINE | ID: mdl-38780731

ABSTRACT

The 3D structures of many ribonucleic acid (RNA) loops are characterized by highly organized networks of non-canonical interactions. Multiple computational methods have been developed to annotate structures with those interactions or automatically identify recurrent interaction networks. By contrast, the reverse problem that aims to retrieve the geometry of a look from its sequence or ensemble of interactions remains much less explored. In this chapter, we will describe how to retrieve and build families of conserved structural motifs using their underlying network of non-canonical interactions. Then, we will show how to assign sequence alignments to those families and use the software BayesPairing to build statistical models of structural motifs with their associated sequence alignments. From this model, we will apply BayesPairing to identify in new sequences regions where those loop geometries can occur.

Subject(s)

Base Pairing , Computational Biology , RNA , Software , Computational Biology/methods , RNA/chemistry , RNA/genetics , Nucleic Acid Conformation , Sequence Alignment/methods , Algorithms , Nucleotide Motifs , Bayes Theorem , Models, Molecular

4.

LocARNA 2.0: Versatile Simultaneous Alignment and Folding of RNAs.

Will, Sebastian.

Methods Mol Biol ; 2726: 235-254, 2024.

Article in English | MEDLINE | ID: mdl-38780734

ABSTRACT

Generating accurate alignments of non-coding RNA sequences is indispensable in the quest for understanding RNA function. Nevertheless, aligning RNAs remains a challenging computational task. In the twilight-zone of RNA sequences with low sequence similarity, sequence homologies and compatible, favorable (a priori unknown) structures can be inferred only in dependency of each other. Thus, simultaneous alignment and folding (SA&F) remains the gold-standard of comparative RNA analysis, even if this method is computationally highly demanding. This text introduces to the recent release 2.0 of the software package LocARNA, focusing on its practical application. The package enables versatile, fast and accurate analysis of multiple RNAs. For this purpose, it implements SA&F algorithms in a specific, lightweight flavor that makes them routinely applicable in large scale. Its high performance is achieved by combining ensemble-based sparsification of the structure space and banding strategies. Probabilistic banding strongly improves the performance of LocARNA 2.0 even over previous releases, while simplifying its effective use. Enabling flexible application to various use cases, LocARNA provides tools to globally and locally compare, cluster, and multiply aligned RNAs based on optimization and probabilistic variants of SA&F, which optionally integrate prior knowledge, expressible by anchor and structure constraints.

Subject(s)

Algorithms , Computational Biology , RNA Folding , RNA , Software , RNA/genetics , RNA/chemistry , Computational Biology/methods , Nucleic Acid Conformation , Sequence Alignment/methods , Sequence Analysis, RNA/methods

5.

Evolutionary Structure Conservation and Covariance Scores.

Eggenhofer, Florian; Höner Zu Siederdissen, Christian.

Methods Mol Biol ; 2726: 255-284, 2024.

Article in English | MEDLINE | ID: mdl-38780735

ABSTRACT

Effective homology search for non-coding RNAs is frequently not possible via sequence similarity alone. Current methods leverage evolutionary information like structure conservation or covariance scores to identify homologs in organisms that are phylogenetically more distant. In this chapter, we introduce the theoretical background of evolutionary structure conservation and covariance score, and we show hands-on how current methods in the field are applied on example datasets.

Subject(s)

Computational Biology , Evolution, Molecular , Computational Biology/methods , Phylogeny , Algorithms , RNA, Untranslated/genetics , Conserved Sequence , Humans , Animals , Software , Sequence Alignment/methods

6.

The NCBI Comparative Genome Viewer (CGV) is an interactive visualization tool for the analysis of whole-genome eukaryotic alignments.

Rangwala, Sanjida H; Rudnev, Dmitry V; Ananiev, Victor V; Oh, Dong-Ha; Asztalos, Andrea; Benica, Barrett; Borodin, Evgeny A; Bouk, Nathan; Evgeniev, Vladislav I; Kodali, Vamsi K; Lotov, Vadim; Mozes, Eyal; Omelchenko, Marina V; Savkina, Sofya; Sukharnikov, Ekaterina; Virothaisakun, Joël; Murphy, Terence D; Pruitt, Kim D; Schneider, Valerie A.

PLoS Biol ; 22(5): e3002405, 2024 May.

Article in English | MEDLINE | ID: mdl-38713717

ABSTRACT

We report a new visualization tool for analysis of whole-genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by National Center for Biotechnology Information (NCBI) using assembly alignment algorithms developed by us and others. Researchers can examine large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed where available. CGV currently provides approximately 800 alignments from over 350 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.

Subject(s)

Genome , Software , Animals , Genome/genetics , Sequence Alignment/methods , Genomics/methods , Algorithms , United States , Humans , Eukaryota/genetics , Databases, Genetic , National Library of Medicine (U.S.) , Molecular Sequence Annotation/methods

7.

CAREx: context-aware read extension of paired-end sequencing data.

Kallenborn, Felix; Schmidt, Bertil.

BMC Bioinformatics ; 25(1): 186, 2024 May 10.

Article in English | MEDLINE | ID: mdl-38730374

ABSTRACT

BACKGROUND: Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads. RESULTS: We present CAREx-an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to 99 % for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPU-accelerated version of CAREx exhibits the fastest execution times among all tested tools. CONCLUSION: CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-the-art programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at ( https://github.com/fkallen/CAREx ).

Subject(s)

Algorithms , High-Throughput Nucleotide Sequencing , Software , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Humans , Sequence Alignment/methods

8.

Accelerating spliced alignment of long RNA sequencing reads using parallel maximal exact match retrieval.

Wang, Rongxing; Zhang, Yanju.

Comput Biol Med ; 175: 108542, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38714048

ABSTRACT

The genomics landscape has undergone a revolutionary transformation with the emergence of third-generation sequencing technologies. Fueled by the exponential surge in sequencing data, there is an urgent demand for accurate and rapid algorithms to effectively handle this burgeoning influx. Under such circumstances, we developed a parallelized, yet accuracy-lossless algorithm for maximal exact match (MEM) retrieval to strategically address the computational bottleneck of uLTRA, a leading spliced alignment algorithm known for its precision in handling long RNA sequencing (RNA-seq) reads. The design of the algorithm incorporates a multi-threaded strategy, enabling the concurrent processing of multiple reads simultaneously. Additionally, we implemented the serialization of index required for MEM retrieval to facilitate its reuse, resulting in accelerated startup for practical tasks. Extensive experiments demonstrate that our parallel algorithm achieves significant improvements in runtime, speedup, throughput, and memory usage. When applied to the largest human dataset, the algorithm achieves an impressive speedup of 10.78 × , significantly improving throughput on a large scale. Moreover, the integration of the parallel MEM retrieval algorithm into the uLTRA pipeline introduces a dual-layered parallel capability, consistently yielding a speedup of 4.99 × compared to the multi-process and single-threaded execution of uLTRA. The thorough analysis of experimental results underscores the adept utilization of parallel processing capabilities and its advantageous performance in handling large datasets. This study provides a showcase of parallelized strategies for MEM retrieval within the context of spliced alignment algorithm, effectively facilitating the process of RNA-seq data analysis. The code is available at https://github.com/RongxingWong/AcceleratingSplicedAlignment.

Subject(s)

Algorithms , Sequence Analysis, RNA , Humans , Sequence Analysis, RNA/methods , RNA Splicing , High-Throughput Nucleotide Sequencing/methods , Sequence Alignment/methods , Software

9.

Parsnp 2.0: scalable core-genome alignment for massive microbial datasets.

Kille, Bryce; Nute, Michael G; Huang, Victor; Kim, Eddie; Phillippy, Adam M; Treangen, Todd J.

Bioinformatics ; 40(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38724243

ABSTRACT

MOTIVATION: Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. RESULTS: To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. AVAILABILITY AND IMPLEMENTATION: Parsnp v2 is available at https://github.com/marbl/parsnp.

Subject(s)

Genome, Bacterial , Sequence Alignment , Software , Sequence Alignment/methods , Genomics/methods , Algorithms

10.

AGO, a Framework for the Reconstruction of Ancestral Syntenies and Gene Orders.

Cribbie, Evan P; Doerr, Daniel; Chauve, Cedric.

Methods Mol Biol ; 2802: 247-265, 2024.

Article in English | MEDLINE | ID: mdl-38819563

ABSTRACT

Reconstructing ancestral gene orders from the genome data of extant species is an important problem in comparative and evolutionary genomics. In a phylogenomics setting that accounts for gene family evolution through gene duplication and gene loss, the reconstruction of ancestral gene orders involves several steps, including multiple sequence alignment, the inference of reconciled gene trees, and the inference of ancestral syntenies and gene adjacencies. For each of the steps of such a process, several methods can be used and implemented using a growing corpus of, often parameterized, tools; in practice, interfacing such tools into an ancestral gene order reconstruction pipeline is far from trivial. This chapter introduces AGO, a Python-based framework aimed at creating ancestral gene order reconstruction pipelines allowing to interface and parameterize different bioinformatics tools. The authors illustrate the features of AGO by reconstructing ancestral gene orders for the X chromosome of three ancestral Anopheles species using three different pipelines. AGO is freely available at https://github.com/cchauve/AGO-pipeline .

Subject(s)

Evolution, Molecular , Gene Order , Genomics , Phylogeny , Software , Animals , Genomics/methods , Computational Biology/methods , Synteny/genetics , Anopheles/genetics , X Chromosome/genetics , Sequence Alignment/methods

11.

Comparative Genome Annotation.

Nachtweide, Stefanie; Romoth, Lars; Stanke, Mario.

Methods Mol Biol ; 2802: 165-187, 2024.

Article in English | MEDLINE | ID: mdl-38819560

ABSTRACT

Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. A large proportion of such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related species or strains ought to be annotated simultaneously. Often, in subsequent studies, differences between the closely related species or strains are in the focus of research when the shared gene structures prevail. We here review methods for comparative structural genome annotation. The reviewed methods include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate either a single target genome or all input genomes simultaneously. We discuss how the methods depend on the phylogenetic placement of genomes, give advice on the choice of methods, and examine the consistency between gene structure annotations in an example. Furthermore, we provide practical advice on genome annotation in general.

Subject(s)

Genomics , Molecular Sequence Annotation , Phylogeny , Molecular Sequence Annotation/methods , Genomics/methods , Computational Biology/methods , Genome/genetics , Sequence Alignment/methods , Software

12.

A hepatitis B virus (HBV) sequence variation graph improves alignment and sample-specific consensus sequence construction.

Duchen, Dylan; Clipman, Steven J; Vergara, Candelaria; Thio, Chloe L; Thomas, David L; Duggal, Priya; Wojcik, Genevieve L.

PLoS One ; 19(4): e0301069, 2024.

Article in English | MEDLINE | ID: mdl-38669259

ABSTRACT

Nearly 300 million individuals live with chronic hepatitis B virus (HBV) infection (CHB), for which no curative therapy is available. As viral diversity is associated with pathogenesis and immunological control of infection, improved methods to characterize this diversity could aid drug development efforts. Conventionally, viral sequencing data are mapped/aligned to a reference genome, and only the aligned sequences are retained for analysis. Thus, reference selection is critical, yet selecting the most representative reference a priori remains difficult. We investigate an alternative pangenome approach which can combine multiple reference sequences into a graph which can be used during alignment. Using simulated short-read sequencing data generated from publicly available HBV genomes and real sequencing data from an individual living with CHB, we demonstrate alignment to a phylogenetically representative 'genome graph' can improve alignment, avoid issues of reference ambiguity, and facilitate the construction of sample-specific consensus sequences more genetically similar to the individual's infection. Graph-based methods can, therefore, improve efforts to characterize the genetics of viral pathogens, including HBV, and have broader implications in host-pathogen research.

Subject(s)

Consensus Sequence , Genome, Viral , Hepatitis B virus , Hepatitis B virus/genetics , Humans , Consensus Sequence/genetics , Phylogeny , Sequence Alignment/methods , Genetic Variation , Hepatitis B, Chronic/virology , DNA, Viral/genetics , Sequence Analysis, DNA/methods

13.

A phylogenetic method linking nucleotide substitution rates to rates of continuous trait evolution.

Gemmell, Patrick; Sackton, Timothy B; Edwards, Scott V; Liu, Jun S.

PLoS Comput Biol ; 20(4): e1011995, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38656999

ABSTRACT

Genomes contain conserved non-coding sequences that perform important biological functions, such as gene regulation. We present a phylogenetic method, PhyloAcc-C, that associates nucleotide substitution rates with changes in a continuous trait of interest. The method takes as input a multiple sequence alignment of conserved elements, continuous trait data observed in extant species, and a background phylogeny and substitution process. Gibbs sampling is used to assign rate categories (background, conserved, accelerated) to lineages and explore whether the assigned rate categories are associated with increases or decreases in the rate of trait evolution. We test our method using simulations and then illustrate its application using mammalian body size and lifespan data previously analyzed with respect to protein coding genes. Like other studies, we find processes such as tumor suppression, telomere maintenance, and p53 regulation to be related to changes in longevity and body size. In addition, we also find that skeletal genes, and developmental processes, such as sprouting angiogenesis, are relevant.

Subject(s)

Evolution, Molecular , Models, Genetic , Phylogeny , Animals , Longevity/genetics , Humans , Computational Biology/methods , Computer Simulation , Body Size/genetics , Nucleotides/genetics , Sequence Alignment/methods

14.

Effect of tokenization on transformers for biological sequences.

Dotan, Edo; Jaschek, Gal; Pupko, Tal; Belinkov, Yonatan.

Bioinformatics ; 40(4)2024 Mar 29.

Article in English | MEDLINE | ID: mdl-38608190

ABSTRACT

MOTIVATION: Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS: We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION: Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.

Subject(s)

Algorithms , Computational Biology , Deep Learning , Natural Language Processing , Computational Biology/methods , Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods

15.

VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.

Liu, Guangchen; Chen, Xun; Luan, Yihui; Li, Dawei.

Bioinformatics ; 40(4)2024 Mar 29.

Article in English | MEDLINE | ID: mdl-38597887

ABSTRACT

MOTIVATION: Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS: We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, i.e. 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2000-5000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to >0.98 when query sequences increased from 150-350 to >850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g. â¼1000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY AND IMPLEMENTATION: www.dllab.org/software/VirusPredictor.html.

Subject(s)

Genome, Viral , Software , Humans , Viruses/genetics , Sequence Analysis, DNA/methods , Sequence Alignment/methods , Machine Learning

16.

Improvements in viral gene annotation using large language models and soft alignments.

Harrigan, William L; Ferrell, Barbra D; Wommack, K Eric; Polson, Shawn W; Schreiber, Zachary D; Belcaid, Mahdi.

BMC Bioinformatics ; 25(1): 165, 2024 Apr 25.

Article in English | MEDLINE | ID: mdl-38664627

ABSTRACT

BACKGROUND: The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS: Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION: The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.

Subject(s)

Algorithms , Molecular Sequence Annotation , Sequence Alignment , Molecular Sequence Annotation/methods , Sequence Alignment/methods , Viral Proteins/genetics , Viral Proteins/chemistry , Genes, Viral , Databases, Protein , Computational Biology/methods , Amino Acid Sequence

17.

Large-scale structure-informed multiple sequence alignment of proteins with SIMSApiper.

Crauwels, Charlotte; Heidig, Sophie-Luise; Díaz, Adrián; Vranken, Wim F.

Bioinformatics ; 40(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38648741

ABSTRACT

SUMMARY: SIMSApiper is a Nextflow pipeline that creates reliable, structure-informed MSAs of thousands of protein sequences faster than standard structure-based alignment methods. Structural information can be provided by the user or collected by the pipeline from online resources. Parallelization with sequence identity-based subsets can be activated to significantly speed up the alignment process. Finally, the number of gaps in the final alignment can be reduced by leveraging the position of conserved secondary structure elements. AVAILABILITY AND IMPLEMENTATION: The pipeline is implemented using Nextflow, Python3, and Bash. It is publicly available on github.com/Bio2Byte/simsapiper.

Subject(s)

Proteins , Sequence Alignment , Sequence Analysis, Protein , Software , Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Algorithms , Amino Acid Sequence , Computational Biology/methods , Databases, Protein

18.

KmerAperture: Retaining k-mer synteny for alignment-free extraction of core and accessory differences between bacterial genomes.

Moore, Matthew P; Laager, Mirjam; Ribeca, Paolo; Didelot, Xavier.

PLoS Genet ; 20(4): e1011184, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38683871

ABSTRACT

By decomposing genome sequences into k-mers, it is possible to estimate genome differences without alignment. Techniques such as k-mer minimisers, for example MinHash, have been developed and are often accurate approximations of distances based on full k-mer sets. These and other alignment-free methods avoid the large temporal and computational expense of alignment. However, these k-mer set comparisons are not entirely accurate within-species and can be completely inaccurate within-lineage. This is due, in part, to their inability to distinguish core polymorphism from accessory differences. Here we present a new approach, KmerAperture, which uses information on the k-mer relative genomic positions to determine the type of polymorphism causing differences in k-mer presence and absence between pairs of genomes. Single SNPs are expected to result in k unique contiguous k-mers per genome. On the other hand, contiguous series > k may be caused by accessory differences of length S-k+1; when the start and end of the sequence are contiguous with homologous sequence. Alternatively, they may be caused by multiple SNPs within k bp from each other and KmerAperture can determine whether that is the case. To demonstrate use cases KmerAperture was benchmarked using datasets including a very low diversity simulated population with accessory content independent from the number of SNPs, a simulated population where SNPs are spatially dense, a moderately diverse real cluster of genomes (Escherichia coli ST1193) with a large accessory genome and a low diversity real genome cluster (Salmonella Typhimurium ST34). We show that KmerAperture can accurately distinguish both core and accessory sequence diversity without alignment, outperforming other k-mer based tools.

Subject(s)

Genome, Bacterial , Polymorphism, Single Nucleotide , Polymorphism, Single Nucleotide/genetics , Synteny , Genomics/methods , Algorithms , Escherichia coli/genetics , Software , Sequence Alignment/methods , Phylogeny

19.

RecGraph: recombination-aware alignment of sequences to variation graphs.

Avila Cartes, Jorge; Bonizzoni, Paola; Ciccolella, Simone; Della Vedova, Gianluca; Denti, Luca; Didelot, Xavier; Monti, Davide Cesare; Pirola, Yuri.

Bioinformatics ; 40(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38676570

ABSTRACT

MOTIVATION: Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated. RESULTS: In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.

Subject(s)

Algorithms , Genome, Bacterial , Recombination, Genetic , Sequence Alignment , Sequence Alignment/methods , Humans , Software , Sequence Analysis, DNA/methods , Genomics/methods

20.

Characteristic Attribute Organization System (CAOS): Identifying Classification Rules Based on Phylogenetically Organized Sequences.

Ramanan, Vivek; Sarkar, Indra Neil.

Methods Mol Biol ; 2744: 335-345, 2024.

Article in English | MEDLINE | ID: mdl-38683329

ABSTRACT

Classification is a technique that labels subjects based on the characteristics of the data. It often includes using prior learned information from preexisting data drawn from the same distribution or data type to make informed decisions per each given subject. The method presented here, the Characteristic Attribute Organization System (CAOS), uses a character-based approach to molecular sequence classification. Using a set of aligned sequences (either nucleotide or amino acid) and a maximum parsimony tree, CAOS will generate classification rules for the sequences based on tree structure and provide more interpretable results than other classification or sequence analysis protocols. The code is accessible at https://github.com/JuliaHealth/CAOS.jl/ .

Subject(s)

Phylogeny , Software , Computational Biology/methods , Algorithms , Sequence Alignment/methods

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL