Search | VHL Regional Portal

1.

The k-Robinson-Foulds Dissimilarity Measures for Comparison of Labeled Trees.

Khayatian, Elahe; Valiente, Gabriel; Zhang, Louxin.

J Comput Biol ; 31(4): 328-344, 2024 04.

Article in English | MEDLINE | ID: mdl-38271573

ABSTRACT

Understanding the mutational history of tumor cells is a critical endeavor in unraveling the mechanisms that drive the onset and progression of cancer. Modeling tumor cell evolution with labeled trees motivates researchers to develop different measures to compare labeled trees. Although the Robinson-Foulds (RF) distance is widely used for comparing species trees, its applicability to labeled trees reveals certain limitations. This study introduces the k-RF dissimilarity measures, tailored to address the challenges of labeled tree comparison. The RF distance is succinctly expressed as n-RF in the space of labeled trees with n nodes. Like the RF distance, the k-RF is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. By setting k to a small value, the k-RF dissimilarity can capture analogous local regions in two labeled trees with different size or different labels.

Subject(s)

Algorithms , Humans , Neoplasms/genetics , Mutation , Computational Biology/methods , Phylogeny

2.

The Landscape of Virus-Host Protein-Protein Interaction Databases.

Valiente, Gabriel.

Front Microbiol ; 13: 827742, 2022.

Article in English | MEDLINE | ID: mdl-35910656

ABSTRACT

Knowledge of virus-host interactomes has advanced exponentially in the last decade by the use of high-throughput screening technologies to obtain a more comprehensive landscape of virus-host protein-protein interactions. In this article, we present a systematic review of the available virus-host protein-protein interaction database resources. The resources covered in this review are both generic virus-host protein-protein interaction databases and databases of protein-protein interactions for a specific virus or for those viruses that infect a particular host. The databases are reviewed on the basis of the specificity for a particular virus or host, the number of virus-host protein-protein interactions included, and the functionality in terms of browse, search, visualization, and download. Further, we also analyze the overlap of the databases, that is, the number of virus-host protein-protein interactions shared by the various databases, as well as the structure of the virus-host protein-protein interaction network, across viruses and hosts.

3.

The Generalized Robinson-Foulds Distance for Phylogenetic Trees.

Llabrés, Mercè; Rosselló, Francesc; Valiente, Gabriel.

J Comput Biol ; 28(12): 1181-1195, 2021 12.

Article in English | MEDLINE | ID: mdl-34714118

ABSTRACT

The Robinson-Foulds (RF) distance, one of the most widely used metrics for comparing phylogenetic trees, has the advantage of being intuitive, with a natural interpretation in terms of common splits, and it can be computed in linear time, but it has a very low resolution, and it may become trivial for phylogenetic trees with overlapping taxa, that is, phylogenetic trees that share some but not all of their leaf labels. In this article, we study the properties of the Generalized Robinson-Foulds (GRF) distance, a recently proposed metric for comparing any structures that can be described by multisets of multisets of labels, when applied to rooted phylogenetic trees with overlapping taxa, which are described by sets of clusters, that is, by sets of sets of labels. We show that the GRF distance has a very high resolution, it can also be computed in linear time, and it is not (uniformly) equivalent to the RF distance.

Subject(s)

Classification/methods , Computational Biology/methods , Algorithms , Models, Genetic , Phylogeny

4.

Alignment of virus-host protein-protein interaction networks by integer linear programming: SARS-CoV-2.

Llabrés, Mercè; Valiente, Gabriel.

PLoS One ; 15(12): e0236304, 2020.

Article in English | MEDLINE | ID: mdl-33284827

ABSTRACT

MOTIVATION: Beside socio-economic issues, coronavirus pandemic COVID-19, the infectious disease caused by the newly discovered coronavirus SARS-CoV-2, has caused a deep impact in the scientific community, that has considerably increased its effort to discover the infection strategies of the new virus. Among the extensive and crucial research that has been carried out in the last months, the analysis of the virus-host relationship plays an important role in drug discovery. Virus-host protein-protein interactions are the active agents in virus replication, and the analysis of virus-host protein-protein interaction networks is fundamental to the study of the virus-host relationship. RESULTS: We have adapted and implemented a recent integer linear programming model for protein-protein interaction network alignment to virus-host networks, and obtained a consensus alignment of the SARS-CoV-1 and SARS-CoV-2 virus-host protein-protein interaction networks. Despite the lack of shared human proteins in these virus-host networks, and the low number of preserved virus-host interactions, the consensus alignment revealed aligned human proteins that share a function related to viral infection, as well as human proteins of high functional similarity that interact with SARS-CoV-1 and SARS-CoV-2 proteins, whose alignment would preserve these virus-host interactions.

Subject(s)

Host Microbial Interactions/physiology , Protein Interaction Maps/physiology , SARS-CoV-2/metabolism , COVID-19/virology , Coronavirus/metabolism , Coronavirus Infections/virology , Humans , Models, Theoretical , Pandemics , Pneumonia, Viral/virology , Programming, Linear , Protein Binding/physiology , Proteins/metabolism , Spike Glycoprotein, Coronavirus/metabolism , Virus Replication/physiology

5.

Alignment of biological networks by integer linear programming: virus-host protein-protein interaction networks.

Llabrés, Mercè; Riera, Gabriel; Rosselló, Francesc; Valiente, Gabriel.

BMC Bioinformatics ; 21(Suppl 6): 434, 2020 Nov 18.

Article in English | MEDLINE | ID: mdl-33203352

ABSTRACT

BACKGROUND: The alignment of protein-protein interaction networks was recently formulated as an integer quadratic programming problem, along with a linearization that can be solved by integer linear programming software tools. However, the resulting integer linear program has a huge number of variables and constraints, rendering it of no practical use. RESULTS: We present a compact integer linear programming reformulation of the protein-protein interaction network alignment problem, which can be solved using state-of-the-art mathematical modeling and integer linear programming software tools, along with empirical results showing that small biological networks, such as virus-host protein-protein interaction networks, can be aligned in a reasonable amount of time on a personal computer and the resulting alignments are structurally coherent and biologically meaningful. CONCLUSIONS: The implementation of the integer linear programming reformulation using current mathematical modeling and integer linear programming software tools provided biologically meaningful alignments of virus-host protein-protein interaction networks.

Subject(s)

Programming, Linear , Protein Interaction Maps , Software , Algorithms , Models, Theoretical

6.

AligNet: alignment of protein-protein interaction networks.

Alcalá, Adrià; Alberich, Ricardo; Llabrés, Mercè; Rosselló, Francesc; Valiente, Gabriel.

BMC Bioinformatics ; 21(Suppl 6): 265, 2020 Nov 18.

Article in English | MEDLINE | ID: mdl-33203353

ABSTRACT

BACKGROUND: All molecular functions and biological processes are carried out by groups of proteins that interact with each other. Metaproteomic data continuously generates new proteins whose molecular functions and relations must be discovered. A widely accepted structure to model functional relations between proteins are protein-protein interaction networks (PPIN), and their analysis and alignment has become a key ingredient in the study and prediction of protein-protein interactions, protein function, and evolutionary conserved assembly pathways of protein complexes. Several PPIN aligners have been proposed, but attaining the right balance between network topology and biological information is one of the most difficult and key points in the design of any PPIN alignment algorithm. RESULTS: Motivated by the challenge of well-balanced and efficient algorithms, we have designed and implemented AligNet, a parameter-free pairwise PPIN alignment algorithm aimed at bridging the gap between topologically efficient and biologically meaningful matchings. A comparison of the results obtained with AligNet and with the best aligners shows that AligNet achieves indeed a good balance between topological and biological matching. CONCLUSION: In this paper we present AligNet, a new pairwise global PPIN aligner that produces biologically meaningful alignments, by achieving a good balance between structural matching and protein function conservation, and more efficient computations than state-of-the-art tools.

Subject(s)

Protein Interaction Mapping , Protein Interaction Maps , Proteins , Algorithms , Biological Evolution , Proteins/metabolism

7.

A balance index for phylogenetic trees based on rooted quartets.

Coronado, Tomás M; Mir, Arnau; Rosselló, Francesc; Valiente, Gabriel.

J Math Biol ; 79(3): 1105-1148, 2019 08.

Article in English | MEDLINE | ID: mdl-31209515

ABSTRACT

We define a new balance index for rooted phylogenetic trees based on the symmetry of the evolutive history of every set of 4 leaves. This index makes sense for multifurcating trees and it can be computed in time linear in the number of leaves. We determine its maximum and minimum values for arbitrary and bifurcating trees, and we provide exact formulas for its expected value and variance on bifurcating trees under Ford's [Formula: see text]-model and Aldous' [Formula: see text]-model and on arbitrary trees under the [Formula: see text]-[Formula: see text]-model.

Subject(s)

Algorithms , Biological Evolution , Mathematical Concepts , Models, Biological , Phylogeny , Animals , Humans

8.

Unbiased Taxonomic Annotation of Metagenomic Samples.

Fosso, Bruno; Pesole, Graziano; Rosselló, Francesc; Valiente, Gabriel.

J Comput Biol ; 25(3): 348-360, 2018 03.

Article in English | MEDLINE | ID: mdl-29028181

ABSTRACT

The classification of reads from a metagenomic sample using a reference taxonomy is usually based on first mapping the reads to the reference sequences and then classifying each read at a node under the lowest common ancestor of the candidate sequences in the reference taxonomy with the least classification error. However, this taxonomic annotation can be biased by an imbalanced taxonomy and also by the presence of multiple nodes in the taxonomy with the least classification error for a given read. In this article, we show that the Rand index is a better indicator of classification error than the often used area under the receiver operating characteristic (ROC) curve and F-measure for both balanced and imbalanced reference taxonomies, and we also address the second source of bias by reducing the taxonomic annotation problem for a whole metagenomic sample to a set cover problem, for which a logarithmic approximation can be obtained in linear time and an exact solution can be obtained by integer linear programming. Experimental results with a proof-of-concept implementation of the set cover approach to taxonomic annotation in a next release of the TANGO software show that the set cover approach further reduces ambiguity in the taxonomic annotation obtained with TANGO without distorting the relative abundance profile of the metagenomic sample.

Subject(s)

DNA Barcoding, Taxonomic/methods , Metagenome , Phylogeny , Software , DNA Barcoding, Taxonomic/standards , Humans , Microbiota

9.

Complexity and Dynamics of the Winemaking Bacterial Communities in Berries, Musts, and Wines from Apulian Grape Cultivars through Time and Space.

Marzano, Marinella; Fosso, Bruno; Manzari, Caterina; Grieco, Francesco; Intranuovo, Marianna; Cozzi, Giuseppe; Mulè, Giuseppina; Scioscia, Gaetano; Valiente, Gabriel; Tullo, Apollonia; Sbisà, Elisabetta; Pesole, Graziano; Santamaria, Monica.

PLoS One ; 11(6): e0157383, 2016.

Article in English | MEDLINE | ID: mdl-27299312

ABSTRACT

Currently, there is very little information available regarding the microbiome associated with the wine production chain. Here, we used an amplicon sequencing approach based on high-throughput sequencing (HTS) to obtain a comprehensive assessment of the bacterial community associated with the production of three Apulian red wines, from grape to final product. The relationships among grape variety, the microbial community, and fermentation was investigated. Moreover, the winery microbiota was evaluated compared to the autochthonous species in vineyards that persist until the end of the winemaking process. The analysis highlighted the remarkable dynamics within the microbial communities during fermentation. A common microbial core shared among the examined wine varieties was observed, and the unique taxonomic signature of each wine appellation was revealed. New species belonging to the genus Halomonas were also reported. This study demonstrates the potential of this metagenomic approach, supported by optimized protocols, for identifying the biodiversity of the wine supply chain. The developed experimental pipeline offers new prospects for other research fields in which a comprehensive view of microbial community complexity and dynamics is desirable.

Subject(s)

Bacteria/genetics , Fungi/genetics , Vitis/microbiology , Wine/microbiology , Bacteria/classification , Bacteria/isolation & purification , Fermentation , Fruit/microbiology , Fungi/classification , Fungi/isolation & purification , High-Throughput Screening Assays , Metagenomics , Microbiota

10.

BioMaS: a modular pipeline for Bioinformatic analysis of Metagenomic AmpliconS.

Fosso, Bruno; Santamaria, Monica; Marzano, Marinella; Alonso-Alemany, Daniel; Valiente, Gabriel; Donvito, Giacinto; Monaco, Alfonso; Notarangelo, Pasquale; Pesole, Graziano.

BMC Bioinformatics ; 16: 203, 2015 Jul 01.

Article in English | MEDLINE | ID: mdl-26130132

ABSTRACT

BACKGROUND: Substantial advances in microbiology, molecular evolution and biodiversity have been carried out in recent years thanks to Metagenomics, which allows to unveil the composition and functions of mixed microbial communities in any environmental niche. If the investigation is aimed only at the microbiome taxonomic structure, a target-based metagenomic approach, here also referred as Meta-barcoding, is generally applied. This approach commonly involves the selective amplification of a species-specific genetic marker (DNA meta-barcode) in the whole taxonomic range of interest and the exploration of its taxon-related variants through High-Throughput Sequencing (HTS) technologies. The accessibility to proper computational systems for the large-scale bioinformatic analysis of HTS data represents, currently, one of the major challenges in advanced Meta-barcoding projects. RESULTS: BioMaS (Bioinformatic analysis of Metagenomic AmpliconS) is a new bioinformatic pipeline designed to support biomolecular researchers involved in taxonomic studies of environmental microbial communities by a completely automated workflow, comprehensive of all the fundamental steps, from raw sequence data upload and cleaning to final taxonomic identification, that are absolutely required in an appropriately designed Meta-barcoding HTS-based experiment. In its current version, BioMaS allows the analysis of both bacterial and fungal environments starting directly from the raw sequencing data from either Roche 454 or Illumina HTS platforms, following two alternative paths, respectively. BioMaS is implemented into a public web service available at https://recasgateway.ba.infn.it/ and is also available in Galaxy at http://galaxy.cloud.ba.infn.it:8080 (only for Illumina data). CONCLUSION: BioMaS is a friendly pipeline for Meta-barcoding HTS data analysis specifically designed for users without particular computing skills. A comparative benchmark, carried out by using a simulated dataset suitably designed to broadly represent the currently known bacterial and fungal world, showed that BioMaS outperforms QIIME and MOTHUR in terms of extent and accuracy of deep taxonomic sequence assignments.

Subject(s)

Bacteria/genetics , Computational Biology/methods , Fungi/genetics , High-Throughput Nucleotide Sequencing/methods , Metagenomics , Software , Biodiversity

11.

The comparison of tree-sibling time consistent phylogenetic networks is graph isomorphism-complete.

Cardona, Gabriel; Llabrés, Mercè; Rosselló, Francesc; Valiente, Gabriel.

ScientificWorldJournal ; 2014: 254279, 2014.

Article in English | MEDLINE | ID: mdl-24982934

ABSTRACT

Several polynomial time computable metrics on the class of semibinary tree-sibling time consistent phylogenetic networks are available in the literature; in particular, the problem of deciding if two networks of this kind are isomorphic is in P. In this paper, we show that if we remove the semibinarity condition, then the problem becomes much harder. More precisely, we prove that the isomorphism problem for generic tree-sibling time consistent phylogenetic networks is polynomially equivalent to the graph isomorphism problem. Since the latter is believed not to belong to P, the chances are that it is impossible to define a metric on the class of all tree-sibling time consistent phylogenetic networks that can be computed in polynomial time.

Subject(s)

Algorithms , Phylogeny , Computational Biology , Humans

12.

Further steps in TANGO: improved taxonomic assignment in metagenomics.

Alonso-Alemany, Daniel; Barré, Aurélien; Beretta, Stefano; Bonizzoni, Paola; Nikolski, Macha; Valiente, Gabriel.

Bioinformatics ; 30(1): 17-23, 2014 Jan 01.

Article in English | MEDLINE | ID: mdl-23645816

ABSTRACT

MOTIVATION: TANGO is one of the most accurate tools for the taxonomic assignment of sequence reads. However, because of the differences in the taxonomy structures, performing a taxonomic assignment on different reference taxonomies will produce divergent results. RESULTS: We have improved the TANGO pipeline to be able to perform the taxonomic assignment of a metagenomic sample using alternative reference taxonomies, coming from different sources. We highlight the novel pre-processing step, necessary to accomplish this task, and describe the improvements in the assignment process. We present the new TANGO pipeline in details, and, finally, we show its performance on four real metagenomic datasets and also on synthetic datasets. AVAILABILITY: The new version of TANGO, including implementation improvements and novel developments to perform the assignment on different reference taxonomies, is freely available at http://sourceforge.net/projects/taxoassignment/.

Subject(s)

Metagenomics/methods , Software , Algorithms , Metagenomics/classification

13.

Bioinformatics approaches and tools for metagenomic analysis. Editorial.

Valiente, Gabriel; Pesole, Graziano.

Brief Bioinform ; 13(6): 645, 2012 Nov.

Article in English | MEDLINE | ID: mdl-23175747

Subject(s)

Computational Biology , Metagenome , Metagenomics

14.

Reference databases for taxonomic assignment in metagenomics.

Santamaria, Monica; Fosso, Bruno; Consiglio, Arianna; De Caro, Giorgio; Grillo, Giorgio; Licciulli, Flavio; Liuni, Sabino; Marzano, Marinella; Alonso-Alemany, Daniel; Valiente, Gabriel; Pesole, Graziano.

Brief Bioinform ; 13(6): 682-95, 2012 Nov.

Article in English | MEDLINE | ID: mdl-22786784

ABSTRACT

Metagenomics is providing an unprecedented access to the environmental microbial diversity. The amplicon-based metagenomics approach involves the PCR-targeted sequencing of a genetic locus fitting different features. Namely, it must be ubiquitous in the taxonomic range of interest, variable enough to discriminate between different species but flanked by highly conserved sequences, and of suitable size to be sequenced through next-generation platforms. The internal transcribed spacers 1 and 2 (ITS1 and ITS2) of the ribosomal DNA operon and one or more hyper-variable regions of 16S ribosomal RNA gene are typically used to identify fungal and bacterial species, respectively. In this context, reliable reference databases and taxonomies are crucial to assign amplicon sequence reads to the correct phylogenetic ranks. Several resources provide consistent phylogenetic classification of publicly available 16S ribosomal DNA sequences, whereas the state of ribosomal internal transcribed spacers reference databases is notably less advanced. In this review, we aim to give an overview of existing reference resources for both types of markers, highlighting strengths and possible shortcomings of their use for metagenomics purposes. Moreover, we present a new database, ITSoneDB, of well annotated and phylogenetically classified ITS1 sequences to be used as a reference collection in metagenomic studies of environmental fungal communities. ITSoneDB is available for download and browsing at http://itsonedb.ba.itb.cnr.it/.

Subject(s)

Databases, Genetic , Metagenomics/methods , Algorithms , Fungi/classification , Fungi/genetics , Genes, rRNA , RNA, Ribosomal, 16S/genetics , RNA, Ribosomal, 16S/metabolism

15.

Computational challenges of sequence classification in microbiomic data.

Ribeca, Paolo; Valiente, Gabriel.

Brief Bioinform ; 12(6): 614-25, 2011 Nov.

Article in English | MEDLINE | ID: mdl-21504986

ABSTRACT

Next-generation sequencing technologies have opened up an unprecedented opportunity for microbiology by enabling the culture-independent genetic study of complex microbial communities, which were so far largely unknown. The analysis of metagenomic data is challenging: potentially, one is faced with a sample containing a mixture of many different bacterial species, whose genome has not necessarily been sequenced beforehand. In the simpler case of the analysis of 16S ribosomal RNA metagenomic data, for which databases of reference sequences are known, we survey the computational challenges to be solved in order to be able to characterize and quantify a sample. In particular, we examine two aspects: how the necessary adoption of new tools geared towards high-throughput analysis impacts the quality of the results, and how good is the performance of various established methods to assign sequence reads to microbial species, with and without taking taxonomic information into account.

Subject(s)

Metagenomics/methods , Archaea/classification , Archaea/genetics , Bacteria/classification , Bacteria/genetics , DNA, Bacterial/chemistry , Metagenome , RNA, Ribosomal, 16S/chemistry

16.

Flexible taxonomic assignment of ambiguous sequencing reads.

Clemente, José C; Jansson, Jesper; Valiente, Gabriel.

BMC Bioinformatics ; 12: 8, 2011 Jan 07.

Article in English | MEDLINE | ID: mdl-21211059

ABSTRACT

BACKGROUND: To characterize the diversity of bacterial populations in metagenomic studies, sequencing reads need to be accurately assigned to taxonomic units in a given reference taxonomy. Reads that cannot be reliably assigned to a unique leaf in the taxonomy (ambiguous reads) are typically assigned to the lowest common ancestor of the set of species that match it. This introduces a potentially severe error in the estimation of bacteria present in the sample due to false positives, since all species in the subtree rooted at the ancestor are implicitly assigned to the read even though many of them may not match it. RESULTS: We present a method that maps each read to a node in the taxonomy that minimizes a penalty score while balancing the relevance of precision and recall in the assignment through a parameter q. This mapping can be obtained in time linear in the number of matching sequences, because LCA queries to the reference taxonomy take constant time. When applied to six different metagenomic datasets, our algorithm produces different taxonomic distributions depending on whether coverage or precision is maximized. Including information on the quality of the reads reduces the number of unassigned reads but increases the number of ambiguous reads, stressing the relevance of our method. Finally, two measures of performance are described and results with a set of artificially generated datasets are discussed. CONCLUSIONS: The assignment strategy of sequencing reads introduced in this paper is a versatile and a quick method to study bacterial communities. The bacterial composition of the analyzed samples can vary significantly depending on how ambiguous reads are assigned depending on the value of the q parameter. Validation of our results in an artificial dataset confirm that a combination of values of q produces the most accurate results.

Subject(s)

Bacteria/classification , Computational Biology/methods , Metagenomics , Sequence Analysis, DNA/methods , Algorithms , Bacteria/genetics , DNA, Bacterial/genetics

17.

Comparison of galled trees.

Cardona, Gabriel; Llabrés, Mercè; Rosselló, Francesc; Valiente, Gabriel.

IEEE/ACM Trans Comput Biol Bioinform ; 8(2): 410-27, 2011.

Article in English | MEDLINE | ID: mdl-20660951

ABSTRACT

Galled trees, directed acyclic graphs that model evolutionary histories with isolated hybridization events, have become very popular due to both their biological significance and the existence of polynomial-time algorithms for their reconstruction. In this paper, we establish to which extent several distance measures for the comparison of evolutionary networks are metrics for galled trees, and hence, when they can be safely used to evaluate galled tree reconstruction methods.

Subject(s)

Phylogeny , Computational Biology/methods , Evolution, Molecular , Gene Expression Profiling/methods , Hybridization, Genetic , Models, Genetic

18.

Characterization of phylogenetic networks with NetTest.

Arenas, Miguel; Patricio, Mateus; Posada, David; Valiente, Gabriel.

BMC Bioinformatics ; 11: 268, 2010 May 20.

Article in English | MEDLINE | ID: mdl-20487540

ABSTRACT

BACKGROUND: Typical evolutionary events like recombination, hybridization or gene transfer make necessary the use of phylogenetic networks to properly depict the evolution of DNA and protein sequences. Although several theoretical classes have been proposed to characterize these networks, they make stringent assumptions that will likely not be met by the evolutionary process. We have recently shown that the complexity of simulated networks is a function of the population recombination rate, and that at moderate and large recombination rates the resulting networks cannot be categorized. However, we do not know whether these results extend to networks estimated from real data. RESULTS: We introduce a web server for the categorization of explicit phylogenetic networks, including the most relevant theoretical classes developed so far. Using this tool, we analyzed statistical parsimony phylogenetic networks estimated from approximately 5,000 DNA alignments, obtained from the NCBI PopSet and Polymorphix databases. The level of characterization was correlated to nucleotide diversity, and a high proportion of the networks derived from these data sets could be formally characterized. CONCLUSIONS: We have developed a public web server, NetTest (freely available from the software section at http://darwin.uvigo.es), to formally characterize the complexity of phylogenetic networks. Using NetTest we found that most statistical parsimony networks estimated with the program TCS could be assigned to a known network class. The level of network characterization was correlated to nucleotide diversity and dependent upon the intra/interspecific levels, although no significant differences were detected among genes. More research on the properties of phylogenetic networks is clearly needed.

Subject(s)

Phylogeny , Software , Databases, Genetic , Evolution, Molecular , Hybridization, Genetic

19.

An optimized TOPS+ comparison method for enhanced TOPS models.

Veeramalai, Mallika; Gilbert, David; Valiente, Gabriel.

BMC Bioinformatics ; 11: 138, 2010 Mar 17.

Article in English | MEDLINE | ID: mdl-20236520

ABSTRACT

BACKGROUND: Although methods based on highly abstract descriptions of protein structures, such as VAST and TOPS, can perform very fast protein structure comparison, the results can lack a high degree of biological significance. Previously we have discussed the basic mechanisms of our novel method for structure comparison based on our TOPS+ model (Topological descriptions of Protein Structures Enhanced with Ligand Information). In this paper we show how these results can be significantly improved using parameter optimization, and we call the resulting optimised TOPS+ method as advanced TOPS+ comparison method i.e. advTOPS+. RESULTS: We have developed a TOPS+ string model as an improvement to the TOPS 123 graph model by considering loops as secondary structure elements (SSEs) in addition to helices and strands, representing ligands as first class objects, and describing interactions between SSEs, and SSEs and ligands, by incoming and outgoing arcs, annotating SSEs with the interaction direction and type. Benchmarking results of an all-against-all pairwise comparison using a large dataset of 2,620 non-redundant structures from the PDB40 dataset 4 demonstrate the biological significance, in terms of SCOP classification at the superfamily level, of our TOPS+ comparison method. CONCLUSIONS: Our advanced TOPS+ comparison shows better performance on the PDB40 dataset 4 compared to our basic TOPS+ method, giving 90% accuracy for SCOP alpha+beta; a 6% increase in accuracy compared to the TOPS and basic TOPS+ methods. It also outperforms the TOPS, basic TOPS+ and SSAP comparison methods on the Chew-Kedem dataset 5, achieving 98% accuracy. SOFTWARE AVAILABILITY: The TOPS+ comparison server is available at http://balabio.dcs.gla.ac.uk/mallika/WebTOPS/.

Subject(s)

Computational Biology/methods , Proteins/chemistry , Software , Algorithms , Databases, Protein , Ligands , Models, Molecular , Protein Conformation , Protein Folding

20.

Accurate taxonomic assignment of short pyrosequencing reads.

Clemente, José C; Jansson, Jesper; Valiente, Gabriel.

Pac Symp Biocomput ; : 3-9, 2010.

Article in English | MEDLINE | ID: mdl-19908352

ABSTRACT

Ambiguities in the taxonomy dependent assignment of pyrosequencing reads are usually resolved by mapping each read to the lowest common ancestor in a reference taxonomy of all those sequences that match the read. This conservative approach has the drawback of mapping a read to a possibly large clade that may also contain many sequences not matching the read. A more accurate taxonomic assignment of short reads can be made by mapping each read to the node in the reference taxonomy that provides the best precision and recall. We show that given a suffix array for the sequences in the reference taxonomy, a short read can be mapped to the node of the reference taxonomy with the best combined value of precision and recall in time linear in the size of the taxonomy subtree rooted at the lowest common ancestor of the matching sequences. An accurate taxonomic assignment of short reads can thus be made with about the same efficiency as when mapping each read to the lowest common ancestor of all matching sequences in a reference taxonomy. We demonstrate the effectiveness of our approach on several metagenomic datasets of marine and gut microbiota.

Subject(s)

Bacteria/classification , Bacteria/genetics , High-Throughput Nucleotide Sequencing/methods , Animals , Computational Biology , Digestive System/microbiology , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Metagenome/genetics , Metagenomics/methods , Metagenomics/statistics & numerical data , Phylogeny , RNA, Bacterial/genetics , RNA, Ribosomal, 16S/genetics , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Sequence Analysis, RNA/methods , Sequence Analysis, RNA/statistics & numerical data

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL