Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
Bioinformatics ; 37(11): 1506-1514, 2021 Jul 12.
Article in English | MEDLINE | ID: mdl-30726875

ABSTRACT

MOTIVATION: Most evolutionary analyses are based on pre-estimated multiple sequence alignment. Wong et al. established the existence of an uncertainty induced by multiple sequence alignment when reconstructing phylogenies. They were able to show that in many cases different aligners produce different phylogenies, with no simple objective criterion sufficient to distinguish among these alternatives. RESULTS: We demonstrate that incorporating MSA induced uncertainty into bootstrap sampling can significantly increase correlation between clade correctness and its corresponding bootstrap value. Our procedure involves concatenating several alternative multiple sequence alignments of the same sequences, produced using different commonly used aligners. We then draw bootstrap replicates while favoring columns of the more unique aligner among the concatenated aligners. We named this concatenation and bootstrapping method, Weighted Partial Super Bootstrap (wpSBOOT). We show on three simulated datasets of 16, 32 and 64 tips that our method improves the predictive power of bootstrap values. We also used as a benchmark an empirical collection of 853 one to one orthologous genes from seven yeast species and found wpSBOOT to significantly improve discrimination capacity between topologically correct and incorrect trees. Bootstrap values of wpSBOOT are comparable to similar readouts estimated using a single method. However, for reduced trees by 50 and 95% bootstrap thresholds, wpSBOOT comes out the lowest Type I error (less FP). AVAILABILITY AND IMPLEMENTATION: The automated generation of replicates has been implemented in the T-Coffee package, which is available as open source freeware available from www.tcoffee.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

2.
Methods Mol Biol ; 2231: 89-97, 2021.
Article in English | MEDLINE | ID: mdl-33289888

ABSTRACT

Many fields of biology rely on the inference of accurate multiple sequence alignments (MSA) of biological sequences. Unfortunately, the problem of assembling an MSA is NP-complete thus limiting computation to approximate solutions using heuristics solutions. The progressive algorithm is one of the most popular frameworks for the computation of MSAs. It involves pre-clustering the sequences and aligning them starting with the most similar ones. The scalability of this framework is limited, especially with respect to accuracy. We present here an alternative approach named regressive algorithm. In this framework, sequences are first clustered and then aligned starting with the most distantly related ones. This approach has been shown to greatly improve accuracy during scale-up, especially on datasets featuring 10,000 sequences or more. Another benefit is the possibility to integrate third-party clustering methods and third-party MSA aligners. The regressive algorithm has been tested on up to 1.5 million sequences, its implementation is available in the T-Coffee package.


Subject(s)
Computational Biology/methods , Sequence Alignment/methods , Software , Algorithms , Cluster Analysis , Computational Biology/instrumentation , Sequence Alignment/instrumentation
4.
Nat Biotechnol ; 37(12): 1466-1470, 2019 12.
Article in English | MEDLINE | ID: mdl-31792410

ABSTRACT

Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around from the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes6.


Subject(s)
Algorithms , Sequence Alignment/methods , Databases, Genetic , Eukaryota/genetics , Genomics/methods , Regression Analysis
5.
Methods Mol Biol ; 1910: 723-745, 2019.
Article in English | MEDLINE | ID: mdl-31278683

ABSTRACT

Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer.In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel.We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.


Subject(s)
Computational Biology , Genomics , Big Data , Biological Evolution , Cloud Computing , Computational Biology/methods , Data Analysis , Genomics/methods , Humans , Reproducibility of Results , Software , Workflow
6.
Syst Biol ; 67(6): 997-1009, 2018 11 01.
Article in English | MEDLINE | ID: mdl-30295908

ABSTRACT

Phylogenetic reconstructions are essential in genomics data analyses and depend on accurate multiple sequence alignment (MSA) models. We show that all currently available large-scale progressive multiple alignment methods are numerically unstable when dealing with amino-acid sequences. They produce significantly different output when changing sequence input order. We used the HOMFAM protein sequences dataset to show that on datasets larger than 100 sequences, this instability affects on average 21.5% of the aligned residues. The resulting Maximum Likelihood (ML) trees estimated from these MSAs are equally unstable with over 38% of the branches being sensitive to the sequence input order. We established that about two-thirds of this uncertainty stems from the unordered nature of children nodes within the guide trees used to estimate MSAs. To quantify this uncertainty we developed unistrap, a novel approach that estimates the combined effect of alignment uncertainty and site sampling on phylogenetic tree branch supports. Compared with the regular bootstrap procedure, unistrap provides branch support estimates that take into account a larger fraction of the parameters impacting tree instability when processing datasets containing a large number of sequences.


Subject(s)
Classification/methods , Models, Genetic , Phylogeny , Proteins/genetics , Proteins/chemistry , Sequence Alignment , Software , Uncertainty
8.
PeerJ ; 3: e1273, 2015.
Article in English | MEDLINE | ID: mdl-26421241

ABSTRACT

Genomic pipelines consist of several pieces of third party software and, because of their experimental nature, frequent changes and updates are commonly necessary thus raising serious deployment and reproducibility issues. Docker containers are emerging as a possible solution for many of these problems, as they allow the packaging of pipelines in an isolated and self-contained manner. This makes it easy to distribute and execute pipelines in a portable manner across a wide range of computing platforms. Thus, the question that arises is to what extent the use of Docker containers might affect the performance of these pipelines. Here we address this question and conclude that Docker containers have only a minor impact on the performance of common genomic pipelines, which is negligible when the executed jobs are long in terms of computational time.

9.
Nucleic Acids Res ; 43(W1): W3-6, 2015 Jul 01.
Article in English | MEDLINE | ID: mdl-25855806

ABSTRACT

This article introduces the Transitive Consistency Score (TCS) web server; a service making it possible to estimate the local reliability of protein multiple sequence alignments (MSAs) using the TCS index. The evaluation can be used to identify the aligned positions most likely to contain structurally analogous residues and also most likely to support an accurate phylogenetic reconstruction. The TCS scoring scheme has been shown to be accurate predictor of structural alignment correctness among commonly used methods. It has also been shown to outperform common filtering schemes like Gblocks or trimAl when doing MSA post-processing prior to phylogenetic tree reconstruction. The web server is available from http://tcoffee.crg.cat/tcs.


Subject(s)
Phylogeny , Sequence Alignment/methods , Sequence Analysis, Protein , Software , Algorithms , Internet
10.
Nucleic Acids Res ; 42(Web Server issue): W356-60, 2014 Jul.
Article in English | MEDLINE | ID: mdl-24972831

ABSTRACT

This article introduces the SARA-Coffee web server; a service allowing the online computation of 3D structure based multiple RNA sequence alignments. The server makes it possible to combine sequences with and without known 3D structures. Given a set of sequences SARA-Coffee outputs a multiple sequence alignment along with a reliability index for every sequence, column and aligned residue. SARA-Coffee combines SARA, a pairwise structural RNA aligner with the R-Coffee multiple RNA aligner in a way that has been shown to improve alignment accuracy over most sequence aligners when enough structural data is available. The server can be accessed from http://tcoffee.crg.cat/apps/tcoffee/do:saracoffee.


Subject(s)
RNA/chemistry , Sequence Alignment/methods , Sequence Analysis, RNA/methods , Software , Algorithms , Internet , Nucleic Acid Conformation
11.
Mol Biol Evol ; 31(6): 1625-37, 2014 Jun.
Article in English | MEDLINE | ID: mdl-24694831

ABSTRACT

Multiple sequence alignment (MSA) is a key modeling procedure when analyzing biological sequences. Homology and evolutionary modeling are the most common applications of MSAs. Both are known to be sensitive to the underlying MSA accuracy. In this work, we show how this problem can be partly overcome using the transitive consistency score (TCS), an extended version of the T-Coffee scoring scheme. Using this local evaluation function, we show that one can identify the most reliable portions of an MSA, as judged from BAliBASE and PREFAB structure-based reference alignments. We also show how this measure can be used to improve phylogenetic tree reconstruction using both an established simulated data set and a novel empirical yeast data set. For this purpose, we describe a novel lossless alternative to site filtering that involves overweighting the trustworthy columns. Our approach relies on the T-Coffee framework; it uses libraries of pairwise alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy. We compared TCS with Heads-or-Tails, GUIDANCE, Gblocks, and trimAl and found it to lead to significantly better estimates of structural accuracy and more accurate phylogenetic trees. The software is available from www.tcoffee.org/Projects/tcs.


Subject(s)
Phylogeny , Sequence Alignment/methods , Software , Models, Molecular , Reproducibility of Results , Sequence Homology
12.
Methods Mol Biol ; 1079: 117-29, 2014.
Article in English | MEDLINE | ID: mdl-24170398

ABSTRACT

T-Coffee, for Tree-based consistency objective function for alignment evaluation, is a versatile multiple sequence alignment (MSA) method suitable for aligning virtually any type of biological sequences. T-Coffee provides more than a simple sequence aligner; rather it is a framework in which alternative alignment methods and/or extra information (i.e., structural, evolutionary, or experimental information) can be combined to reach more accurate and more meaningful MSAs. T-Coffee can be used either by running input data via the Web server ( http://tcoffee.crg.cat/apps/tcoffee/index.html ) or by downloading the T-Coffee package. Here, we present how the package can be used in its command line mode to carry out the most common tasks and multiply align proteins, DNA, and RNA sequences. This chapter particularly emphasizes on the description of T-Coffee special flavors also called "modes," designed to address particular biological problems.


Subject(s)
Computational Biology/methods , Sequence Alignment/methods , Amino Acid Sequence , DNA/genetics , Internet , Molecular Sequence Data , Proteins/chemistry , RNA/genetics
13.
Nucleic Acids Res ; 41(Web Server issue): W358-62, 2013 Jul.
Article in English | MEDLINE | ID: mdl-23716642

ABSTRACT

This article introduces the T-RMSD web server (tree-based on root-mean-square deviation), a service allowing the online computation of structure-based protein classification. It has been developed to address the relation between structural and functional similarity in proteins, and it allows a fine-grained structural clustering of a given protein family or group of structurally related proteins using distance RMSD (dRMSD) variations. These distances are computed between all pairs of equivalent residues, as defined by the ungapped columns within a given multiple sequence alignment. Using these generated distance matrices (one per equivalent position), T-RMSD produces a structural tree with support values for each cluster node, reminiscent of bootstrap values. These values, associated with the tree topology, allow a quantitative estimate of structural distances between proteins or group of proteins defined by the tree topology. The clusters thus defined have been shown to be structurally and functionally informative. The T-RMSD web server is a free website open to all users and available at http://tcoffee.crg.cat/apps/tcoffee/do:trmsd.


Subject(s)
Protein Conformation , Proteins/classification , Software , Algorithms , Cluster Analysis , Internet
14.
BMC Bioinformatics ; 13 Suppl 4: S1, 2012 Mar 28.
Article in English | MEDLINE | ID: mdl-22536955

ABSTRACT

BACKGROUND: Transmembrane proteins (TMPs) constitute about 20~30% of all protein coding genes. The relative lack of experimental structure has so far made it hard to develop specific alignment methods and the current state of the art (PRALINE™) only manages to recapitulate 50% of the positions in the reference alignments available from the BAliBASE2-ref7. METHODS: We show how homology extension can be adapted and combined with a consistency based approach in order to significantly improve the multiple sequence alignment of alpha-helical TMPs. TM-Coffee is a special mode of PSI-Coffee able to efficiently align TMPs, while using a reduced reference database for homology extension. RESULTS: Our benchmarking on BAliBASE2-ref7 alpha-helical TMPs shows a significant improvement over the most accurate methods such as MSAProbs, Kalign, PROMALS, MAFFT, ProbCons and PRALINE™. We also estimated the influence of the database used for homology extension and show that highly non-redundant UniRef databases can be used to obtain similar results at a significantly reduced computational cost over full protein databases. TM-Coffee is part of the T-Coffee package, a web server is also available from http://tcoffee.crg.cat/tmcoffee and a freeware open source code can be downloaded from http://www.tcoffee.org/Packages/Stable/Latest.


Subject(s)
Drosophila melanogaster/chemistry , Drosophila/chemistry , Membrane Proteins/chemistry , Sequence Alignment , Software , Algorithms , Animals , Databases, Protein , Drosophila/metabolism , Drosophila melanogaster/metabolism
15.
Bioinformatics ; 28(1): 130-1, 2012 Jan 01.
Article in English | MEDLINE | ID: mdl-22053077

ABSTRACT

SUMMARY: AMPA is a web application for assessing the antimicrobial domains of proteins, with a focus on the design on new antimicrobial drugs. The application provides fast discovery of antimicrobial patterns in proteins that can be used to develop new peptide-based drugs against pathogens. Results are shown in a user-friendly graphical interface and can be downloaded as raw data for later examination. AVAILABILITY: AMPA is freely available on the web at http://tcoffee.crg.cat/apps/ampa. The source code is also available in the web. CONTACT: marc.torrent@upf.edu; david.andreu@upf.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Anti-Infective Agents/chemistry , Peptides/chemistry , Antimicrobial Cationic Peptides/chemistry , Internet , Programming Languages , Protein Structure, Tertiary , Software
16.
Nat Protoc ; 6(11): 1669-82, 2011 Nov.
Article in English | MEDLINE | ID: mdl-21979275

ABSTRACT

T-Coffee (Tree-based consistency objective function for alignment evaluation) is a versatile multiple sequence alignment (MSA) method suitable for aligning most types of biological sequences. The main strength of T-Coffee is its ability to combine third party aligners and to integrate structural (or homology) information when building MSAs. The series of protocols presented here show how the package can be used to multiply align proteins, RNA and DNA sequences. The protein section shows how users can select the most suitable T-Coffee mode for their data set. Detailed protocols include T-Coffee, the default mode, M-Coffee, a meta version able to combine several third party aligners into one, PSI (position-specific iterated)-Coffee, the homology extended mode suitable for remote homologs and Expresso, the structure-based multiple aligner. We then also show how the T-RMSD (tree based on root mean square deviation) option can be used to produce a functionally informative structure-based clustering. RNA alignment procedures are described for using R-Coffee, a mode able to use predicted RNA secondary structures when aligning RNA sequences. DNA alignments are illustrated with Pro-Coffee, a multiple aligner specific of promoter regions. We also present some of the many reformatting utilities bundled with T-Coffee. The package is an open-source freeware available from http://www.tcoffee.org/.


Subject(s)
DNA/chemistry , Nucleic Acid Conformation , Proteins/chemistry , RNA/chemistry , Sequence Alignment/methods , Algorithms , Amino Acid Sequence , Base Sequence , Models, Molecular , Molecular Sequence Data , Software
17.
Nucleic Acids Res ; 39(Web Server issue): W13-7, 2011 Jul.
Article in English | MEDLINE | ID: mdl-21558174

ABSTRACT

This article introduces a new interface for T-Coffee, a consistency-based multiple sequence alignment program. This interface provides an easy and intuitive access to the most popular functionality of the package. These include the default T-Coffee mode for protein and nucleic acid sequences, the M-Coffee mode that allows combining the output of any other aligners, and template-based modes of T-Coffee that deliver high accuracy alignments while using structural or homology derived templates. These three available template modes are Expresso for the alignment of protein with a known 3D-Structure, R-Coffee to align RNA sequences with conserved secondary structures and PSI-Coffee to accurately align distantly related sequences using homology extension. The new server benefits from recent improvements of the T-Coffee algorithm and can align up to 150 sequences as long as 10,000 residues and is available from both http://www.tcoffee.org and its main mirror http://tcoffee.crg.cat.


Subject(s)
Sequence Alignment/methods , Sequence Analysis, Protein , Sequence Analysis, RNA , Software , Internet , Nucleic Acid Conformation , Protein Conformation , RNA/chemistry
18.
Bioinformatics ; 26(15): 1903-4, 2010 Aug 01.
Article in English | MEDLINE | ID: mdl-20605929

ABSTRACT

SUMMARY: We present the first parallel implementation of the T-Coffee consistency-based multiple aligner. We benchmark it on the Amazon Elastic Cloud (EC2) and show that the parallelization procedure is reasonably effective. We also conclude that for a web server with moderate usage (10K hits/month) the cloud provides a cost-effective alternative to in-house deployment. AVAILABILITY: T-Coffee is a freeware open source package available from http://www.tcoffee.org/homepage.html


Subject(s)
Algorithms , Sequence Alignment/methods , Software , Internet
SELECTION OF CITATIONS
SEARCH DETAIL
...