Search | VHL Regional Portal

Deep embedding and alignment of protein sequences.

Llinares-López, Felipe; Berthet, Quentin; Blondel, Mathieu; Teboul, Olivier; Vert, Jean-Philippe.

Nat Methods ; 20(1): 104-111, 2023 01.

Article in English | MEDLINE | ID: mdl-36522501

ABSTRACT

Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.

Subject(s)

Algorithms , Proteins , Amino Acid Sequence , Proteins/genetics , Proteins/chemistry , Sequence Alignment , Genomics

DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer.

Baid, Gunjan; Cook, Daniel E; Shafin, Kishwar; Yun, Taedong; Llinares-López, Felipe; Berthet, Quentin; Belyaeva, Anastasiya; Töpfer, Armin; Wenger, Aaron M; Rowell, William J; Yang, Howard; Kolesnikov, Alexey; Ammar, Waleed; Vert, Jean-Philippe; Vaswani, Ashish; McLean, Cory Y; Nattestad, Maria; Chang, Pi-Chuan; Carroll, Andrew.

Nat Biotechnol ; 41(2): 232-238, 2023 02.

Article in English | MEDLINE | ID: mdl-36050551

ABSTRACT

Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10-25 kilobases), accurate 'HiFi' reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer-encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (ï»¿NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.

Subject(s)

High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA

CASMAP: detection of statistically significant combinations of SNPs in association mapping.

Llinares-López, Felipe; Papaxanthos, Laetitia; Roqueiro, Damian; Bodenham, Dean; Borgwardt, Karsten.

Bioinformatics ; 35(15): 2680-2682, 2019 08 01.

Article in English | MEDLINE | ID: mdl-30541062

ABSTRACT

SUMMARY: Combinatorial association mapping aims to assess the statistical association of higher-order interactions of genetic markers with a phenotype of interest. This article presents combinatorial association mapping (CASMAP), a software package that leverages recent advances in significant pattern mining to overcome the statistical and computational challenges that have hindered combinatorial association mapping. CASMAP can be used to perform region-based association studies and to detect higher-order epistatic interactions of genetic variants. Most importantly, unlike other existing significant pattern mining-based tools, CASMAP allows for the correction of categorical covariates such as age or gender, making it suitable for genome-wide association studies. AVAILABILITY AND IMPLEMENTATION: The R and Python packages can be downloaded from our GitHub repository http://github.com/BorgwardtLab/CASMAP. The R package is also available on CRAN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Polymorphism, Single Nucleotide , Genome-Wide Association Study , Phenotype , Software

graphkernels: R and Python packages for graph comparison.

Sugiyama, Mahito; Ghisu, M Elisabetta; Llinares-López, Felipe; Borgwardt, Karsten.

Bioinformatics ; 34(3): 530-532, 2018 02 01.

Article in English | MEDLINE | ID: mdl-29028902

ABSTRACT

Summary: Measuring the similarity of graphs is a fundamental step in the analysis of graph-structured data, which is omnipresent in computational biology. Graph kernels have been proposed as a powerful and efficient approach to this problem of graph comparison. Here we provide graphkernels, the first R and Python graph kernel libraries including baseline kernels such as label histogram based kernels, classic graph kernels such as random walk based kernels, and the state-of-the-art Weisfeiler-Lehman graph kernel. The core of all graph kernels is implemented in C ++ for efficiency. Using the kernel matrices computed by the package, we can easily perform tasks such as classification, regression and clustering on graph-structured samples. Availability and implementation: The R and Python packages including source code are available at https://CRAN.R-project.org/package=graphkernels and https://pypi.python.org/pypi/graphkernels. Contact: mahito@nii.ac.jp or elisabetta.ghisu@bsse.ethz.ch. Supplementary information: Supplementary data are available online at Bioinformatics.

Subject(s)

Computational Biology/methods , Software

Genome-wide genetic heterogeneity discovery with categorical covariates.

Llinares-López, Felipe; Papaxanthos, Laetitia; Bodenham, Dean; Roqueiro, Damian; Borgwardt, Karsten.

Bioinformatics ; 33(12): 1820-1828, 2017 Jun 15.

Article in English | MEDLINE | ID: mdl-28200033

ABSTRACT

MOTIVATION: Genetic heterogeneity is the phenomenon that distinct genetic variants may give rise to the same phenotype. The recently introduced algorithm Fast Automatic Interval Search ( FAIS ) enables the genome-wide search of candidate regions for genetic heterogeneity in the form of any contiguous sequence of variants, and achieves high computational efficiency and statistical power. Although FAIS can test all possible genomic regions for association with a phenotype, a key limitation is its inability to correct for confounders such as gender or population structure, which may lead to numerous false-positive associations. RESULTS: We propose FastCMH , a method that overcomes this problem by properly accounting for categorical confounders, while still retaining statistical power and computational efficiency. Experiments comparing FastCMH with FAIS and multiple kinds of burden tests on simulated data, as well as on human and Arabidopsis samples, demonstrate that FastCMH can drastically reduce genomic inflation and discover associations that are missed by standard burden tests. AVAILABILITY AND IMPLEMENTATION: An R package fastcmh is available on CRAN and the source code can be found at: https://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/fastcmh.html. CONTACT: felipe.llinares@bsse.ethz.ch. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genetic Heterogeneity , Genomics/methods , Software , Algorithms , Arabidopsis/genetics , Female , Genetics, Population/methods , Humans , Male

Genome-wide detection of intervals of genetic heterogeneity associated with complex traits.

Llinares-López, Felipe; Grimm, Dominik G; Bodenham, Dean A; Gieraths, Udo; Sugiyama, Mahito; Rowan, Beth; Borgwardt, Karsten.

Bioinformatics ; 31(12): i240-9, 2015 Jun 15.

Article in English | MEDLINE | ID: mdl-26072488

ABSTRACT

MOTIVATION: Genetic heterogeneity, the fact that several sequence variants give rise to the same phenotype, is a phenomenon that is of the utmost interest in the analysis of complex phenotypes. Current approaches for finding regions in the genome that exhibit genetic heterogeneity suffer from at least one of two shortcomings: (i) they require the definition of an exact interval in the genome that is to be tested for genetic heterogeneity, potentially missing intervals of high relevance, or (ii) they suffer from an enormous multiple hypothesis testing problem due to the large number of potential candidate intervals being tested, which results in either many false positives or a lack of power to detect true intervals. RESULTS: Here, we present an approach that overcomes both problems: it allows one to automatically find all contiguous sequences of single nucleotide polymorphisms in the genome that are jointly associated with the phenotype. It also solves both the inherent computational efficiency problem and the statistical problem of multiple hypothesis testing, which are both caused by the huge number of candidate intervals. We demonstrate on Arabidopsis thaliana genome-wide association study data that our approach can discover regions that exhibit genetic heterogeneity and would be missed by single-locus mapping. CONCLUSIONS: Our novel approach can contribute to the genome-wide discovery of intervals that are involved in the genetic heterogeneity underlying complex phenotypes. AVAILABILITY AND IMPLEMENTATION: The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/sis.html.

Subject(s)

Genetic Heterogeneity , Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide , Algorithms , Arabidopsis/genetics , Phenotype

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL