Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
Brief Bioinform ; 21(5): 1596-1608, 2020 09 25.
Article in English | MEDLINE | ID: mdl-32978619

ABSTRACT

Bacterial proteins dubbed virulence factors (VFs) are a highly diverse group of sequences, whose only obvious commonality is the very property of being, more or less directly, involved in virulence. It is therefore tempting to speculate whether their prediction, based on direct sequence similarity (seqsim) to known VFs, could be enhanced or even replaced by using machine-learning methods. Specifically, when trained on a large and diverse set of VFs, such may be able to detect putative, non-trivial characteristics shared by otherwise unrelated VF families and therefore better predict novel VFs with insignificant similarity to each individual family. We therefore first reassess the performance of dimer-based Support Vector Machines, as used in the widely used MP3 method, in light of seqsim-only and seqsim/dimer-hybrid classifiers. We then repeat the analysis with a novel, considerably more diverse data set, also addressing the important problem of negative data selection. Finally, we move on to the real-world use case of proteome-wide VF prediction, outlining different approaches to estimating specificity in this scenario. We find that direct seqsim is of unparalleled importance and therefore should always be exploited. Further, we observe strikingly low correlations between different feature and classifier types when ranking proteins by VF likeness. We therefore propose a 'best of each world' approach to prioritize proteins for experimental testing, focussing on the top predictions of each classifier. Further, classifiers for individual VF families should be developed.


Subject(s)
Bacteria/pathogenicity , Bacterial Proteins/metabolism , Support Vector Machine , Virulence Factors/metabolism , Algorithms , Amino Acid Sequence , Bacterial Proteins/chemistry , Datasets as Topic , Dimerization , Proteome , Virulence Factors/chemistry
2.
Bioinformatics ; 36(1): 81-89, 2020 01 01.
Article in English | MEDLINE | ID: mdl-31298694

ABSTRACT

MOTIVATION: We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. Moreover, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious purposes. Traditional approaches to open-view pathogen detection depend on databases of known organisms, which limits their performance on unknown, unrecognized and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads, even though the biological context is unavailable. RESULTS: We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. We show that convolutional neural networks and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning. Combining a deep learning approach with integrating the predictions for both mates in a read pair results in cutting the error rate almost in half in comparison to the previous state-of-the-art. AVAILABILITY AND IMPLEMENTATION: The code and the models are available at: https://gitlab.com/rki_bioinformatics/DeePaC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Neural Networks, Computer , DNA , Deep Learning , Sequence Analysis, DNA
3.
Sci Rep ; 7: 39194, 2017 01 04.
Article in English | MEDLINE | ID: mdl-28051068

ABSTRACT

The reliable detection of novel bacterial pathogens from next-generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from the reference database. Here we present the machine learning based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide range of species with known pathogenicity phenotype. To that end we compiled a comprehensive list of pathogenic and non-pathogenic bacteria with human host, using various genome metadata in conjunction with a rule-based protocol. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads with low similarity to currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. CombiningPaPrBaG with existing approaches further improves prediction results.


Subject(s)
Bacteria/isolation & purification , Bacterial Infections/etiology , Computational Biology/methods , High-Throughput Nucleotide Sequencing , Machine Learning , Sequence Analysis, DNA/methods , Bacteria/genetics , Humans
4.
Brief Bioinform ; 16(6): 1045-56, 2015 Nov.
Article in English | MEDLINE | ID: mdl-25900849

ABSTRACT

There is a growing interest in the mechanisms and the prediction of how flexible peptides bind proteins, often in a highly selective and conserved manner. While both existing small-molecule docking methods and custom protocols can be used, even short peptides make difficult targets owing to their high torsional flexibility. Any benchmarking should therefore start with those. We compiled a meta-data set of 47 complexes with peptides up to five residues, based on 11 related studies from the past decade. Although their highly varying strategies and constraints preclude direct, quantitative comparisons, we still provide a comprehensive overview of the reported results, using a simple yet stringent measure: the quality of the top-scoring peptide pose. Using the entire data set, this is augmented by our own benchmark of AutoDock Vina, a freely available, fast and widely used docking tool. It particularly addresses non-expert users and was therefore implemented in a highly integrated manner. Guidelines addressing important issues such as the amount of sampling required for result reproducibility are so far lacking. Using peptide docking as an example, this is the first study to address these issues in detail. Finally, to encourage further, standardized benchmarking efforts, the compiled data set is made available in an accessible, transparent and extendable manner.


Subject(s)
Peptides/chemistry , Molecular Docking Simulation , Reproducibility of Results
5.
Nucleic Acids Res ; 42(Database issue): D240-5, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24270792

ABSTRACT

Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year.


Subject(s)
Databases, Protein , Molecular Sequence Annotation , Protein Structure, Tertiary , Genome , Genomics , Internet , Models, Molecular , Protein Structure, Tertiary/genetics , Sequence Analysis, Protein
6.
BMC Bioinformatics ; 14 Suppl 3: S5, 2013.
Article in English | MEDLINE | ID: mdl-23514456

ABSTRACT

Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.


Subject(s)
Protein Structure, Tertiary , Proteins/physiology , Cluster Analysis , Databases, Protein , Molecular Sequence Annotation , Proteins/classification , Proteins/genetics , Sequence Analysis, Protein , Vocabulary, Controlled
7.
Nat Methods ; 10(3): 221-7, 2013 Mar.
Article in English | MEDLINE | ID: mdl-23353650

ABSTRACT

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.


Subject(s)
Computational Biology/methods , Molecular Biology/methods , Molecular Sequence Annotation , Proteins/physiology , Algorithms , Animals , Databases, Protein , Exoribonucleases/classification , Exoribonucleases/genetics , Exoribonucleases/physiology , Forecasting , Humans , Proteins/chemistry , Proteins/classification , Proteins/genetics , Species Specificity
8.
Nucleic Acids Res ; 41(Database issue): D490-8, 2013 Jan.
Article in English | MEDLINE | ID: mdl-23203873

ABSTRACT

CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily.


Subject(s)
Databases, Protein , Protein Structure, Tertiary , Genomics , Internet , Molecular Sequence Annotation , Protein Folding , Proteins/chemistry , Proteins/classification , Proteins/genetics , Sequence Alignment , Sequence Analysis, Protein , Structural Homology, Protein
9.
Nucleic Acids Res ; 40(Database issue): D465-71, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22139938

ABSTRACT

Gene3D http://gene3d.biochem.ucl.ac.uk is a comprehensive database of protein domain assignments for sequences from the major sequence databases. Domains are directly mapped from structures in the CATH database or predicted using a library of representative profile HMMs derived from CATH superfamilies. As previously described, Gene3D integrates many other protein family and function databases. These facilitate complex associations of molecular function, structure and evolution. Gene3D now includes a domain functional family (FunFam) level below the homologous superfamily level assignments. Additions have also been made to the interaction data. More significantly, to help with the visualization and interpretation of multi-genome scale data sets, we have developed a new, revamped website. Searching has been simplified with more sophisticated filtering of results, along with new tools based on Cytoscape Web, for visualizing protein-protein interaction networks, differences in domain composition between genomes and the taxonomic distribution of individual superfamilies.


Subject(s)
Databases, Protein , Molecular Sequence Annotation , Protein Interaction Maps , Protein Structure, Tertiary , Genomics , Proteins/chemistry , Proteins/classification , Proteins/genetics
10.
Nucleic Acids Res ; 39(Database issue): D420-6, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21097779

ABSTRACT

CATH version 3.3 (class, architecture, topology, homology) contains 128,688 domains, 2386 homologous superfamilies and 1233 fold groups, and reflects a major focus on classifying structural genomics (SG) structures and transmembrane proteins, both of which are likely to add structural novelty to the database and therefore increase the coverage of protein fold space within CATH. For CATH version 3.4 we have significantly improved the presentation of sequence information and associated functional information for CATH superfamilies. The CATH superfamily pages now reflect both the functional and structural diversity within the superfamily and include structural alignments of close and distant relatives within the superfamily, annotated with functional information and details of conserved residues. A significantly more efficient search function for CATH has been established by implementing the search server Solr (http://lucene.apache.org/solr/). The CATH v3.4 webpages have been built using the Catalyst web framework.


Subject(s)
Databases, Protein , Protein Structure, Tertiary , Phylogeny , Protein Folding , Proteins/chemistry , Proteins/classification
11.
Nucleic Acids Res ; 38(3): 720-37, 2010 Jan.
Article in English | MEDLINE | ID: mdl-19923231

ABSTRACT

GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile-profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics.


Subject(s)
Algorithms , Protein Structure, Tertiary , Benchmarking , Classification/methods , Models, Chemical , Proteins/classification , Sequence Analysis, Protein
12.
Trends Biotechnol ; 27(4): 210-9, 2009 Apr.
Article in English | MEDLINE | ID: mdl-19251332

ABSTRACT

Advances in experimental and computational methods have quietly ushered in a new era in protein function annotation. This 'age of multiplicity' is marked by the notion that only the use of multiple tools, multiple evidence and considering the multiple aspects of function can give us the broad picture that 21st century biology will need to link and alter micro- and macroscopic phenotypes. It might also help us to undo past mistakes by removing errors from our databases and prevent us from producing more. On the downside, multiplicity is often confusing. We therefore systematically review methods and resources for automated protein function prediction, looking at individual (biochemical) and contextual (network) functions, respectively.


Subject(s)
Computational Biology/methods , Proteins/physiology , Artificial Intelligence , Databases, Protein , Pattern Recognition, Automated , Phylogeny , Proteins/chemistry
13.
J Mol Biol ; 387(2): 416-30, 2009 Mar 27.
Article in English | MEDLINE | ID: mdl-19135455

ABSTRACT

Divergence in function of homologous proteins is based on both sequence and structural changes. Overall enzyme function has been reported to diverge earlier (50% sequence identity) than overall structure (35%). We herein study the functional conservation of enzymes and non-enzyme sequences using the protein domain families in CATH-Gene3D. Despite the rapid increase in sequence data since the last comprehensive study by Tian and Skolnick, our findings suggest that generic thresholds of 40% and 60% aligned sequence identity are still sufficient to safely inherit third-level and full Enzyme Commission numbers, respectively. This increases to 50% and 70% on the domain level, unless the multi-domain architecture matches. Assignments from the Kyoto Encyclopedia of Genes and Genomes and the Munich Information Center for Protein Sequences Functional Catalogue seem to be less conserved with sequence, probably due to a more pathway-centric view: 80% domain sequence identity is required for safe function transfer. Comparing domains (more pairwise relationships) and the use of family-specific thresholds (varying evolutionary speeds) yields the highest coverage rates when transferring functions to model proteomes. An average twofold increase in enzyme annotations is seen for 523 proteomes in Gene3D. As simple 'rules of thumb', sequence identity thresholds do not require a bioinformatics background. We will provide and update this information with future releases of CATH-Gene3D.


Subject(s)
Proteins/chemistry , Proteins/metabolism , Sequence Analysis, Protein , Amino Acid Sequence , Enzymes/chemistry , Enzymes/metabolism , Genome/genetics , Models, Biological , Multigene Family , Protein Structure, Tertiary , Proteome/chemistry , Proteome/metabolism , Sequence Homology, Amino Acid
SELECTION OF CITATIONS
SEARCH DETAIL
...