Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
Add more filters










Publication year range
1.
Nat Biotechnol ; 40(6): 932-937, 2022 06.
Article in English | MEDLINE | ID: mdl-35190689

ABSTRACT

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.


Subject(s)
Deep Learning , Amino Acid Sequence , Databases, Protein , Humans , Molecular Sequence Annotation , Proteome/metabolism , Proteomics
2.
Nat Biotechnol ; 39(6): 691-696, 2021 06.
Article in English | MEDLINE | ID: mdl-33574611

ABSTRACT

Modern experimental technologies can assay large numbers of biological sequences, but engineered protein libraries rarely exceed the sequence diversity of natural protein families. Machine learning (ML) models trained directly on experimental data without biophysical modeling provide one route to accessing the full potential diversity of engineered proteins. Here we apply deep learning to design highly diverse adeno-associated virus 2 (AAV2) capsid protein variants that remain viable for packaging of a DNA payload. Focusing on a 28-amino acid segment, we generated 201,426 variants of the AAV2 wild-type (WT) sequence yielding 110,689 viable engineered capsids, 57,348 of which surpass the average diversity of natural AAV serotype sequences, with 12-29 mutations across this region. Even when trained on limited data, deep neural network models accurately predict capsid viability across diverse variants. This approach unlocks vast areas of functional but previously unreachable sequence space, with many potential applications for the generation of improved viral vectors and protein therapeutics.


Subject(s)
Capsid Proteins/genetics , Dependovirus/genetics , Machine Learning , Genetic Vectors , HeLa Cells , Humans
3.
PLoS Comput Biol ; 9(6): e1003087, 2013.
Article in English | MEDLINE | ID: mdl-23754939

ABSTRACT

The protein kinases are a large family of enzymes that play fundamental roles in propagating signals within the cell. Because of the high degree of binding site similarity shared among protein kinases, designing drug compounds with high specificity among the kinases has proven difficult. However, computational approaches to comparing the 3-dimensional geometry and physicochemical properties of key binding site residue positions have been shown to be informative of inhibitor selectivity. The Combinatorial Clustering Of Residue Position Subsets (ccorps) method, introduced here, provides a semi-supervised learning approach for identifying structural features that are correlated with a given set of annotation labels. Here, ccorps is applied to the problem of identifying structural features of the kinase atp binding site that are informative of inhibitor binding. ccorps is demonstrated to make perfect or near-perfect predictions for the binding affinity profile of 8 of the 38 kinase inhibitors studied, while only having overall poor predictive ability for 1 of the 38 compounds. Additionally, ccorps is shown to identify shared structural features across phylogenetically diverse groups of kinases that are correlated with binding affinity for particular inhibitors; such instances of structural similarity among phylogenetically diverse kinases are also shown to not be rare among kinases. Finally, these function-specific structural features may serve as potential starting points for the development of highly specific kinase inhibitors.


Subject(s)
Protein Kinases/chemistry , Cluster Analysis , Humans , Models, Theoretical , Proteome , Support Vector Machine
4.
Bioinformatics ; 27(15): 2161-2, 2011 Aug 01.
Article in English | MEDLINE | ID: mdl-21659320

ABSTRACT

SUMMARY: The LabelHash server and tools are designed for large-scale substructure comparison. The main use is to predict the function of unknown proteins. Given a set of (putative) functional residues, LabelHash finds all occurrences of matching substructures in the entire Protein Data Bank, along with a statistical significance estimate and known functional annotations for each match. The results can be downloaded for further analysis in any molecular viewer. For Chimera, there is a plugin to facilitate this process. AVAILABILITY: The web site is free and open to all users with no login requirements at http://labelhash.kavrakilab.org


Subject(s)
Databases, Protein , Internet , Molecular Sequence Annotation/methods , Proteins/metabolism , Algorithms , Amino Acid Motifs , Computational Biology/methods , Structure-Activity Relationship , User-Computer Interface
5.
BMC Bioinformatics ; 11: 555, 2010 Nov 11.
Article in English | MEDLINE | ID: mdl-21070651

ABSTRACT

BACKGROUND: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. RESULTS: We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95% sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs at http://labelhash.kavrakilab.org. The output of the LabelHash algorithm can be further analyzed with Chimera through a plugin that we developed for this purpose. CONCLUSIONS: LabelHash is an efficient, versatile algorithm for large-scale substructure matching. When LabelHash is running in parallel, motifs can typically be matched against the entire PDB on the order of minutes. The algorithm is able to identify functional homologs beyond the twilight zone of sequence identity and even beyond fold similarity. The three case studies presented in this paper illustrate the versatility of the algorithm.


Subject(s)
Algorithms , Proteins/chemistry , Computational Biology/methods , Databases, Protein , Protein Conformation , Proteins/metabolism , Structure-Activity Relationship
6.
BMC Bioinformatics ; 11: 242, 2010 May 11.
Article in English | MEDLINE | ID: mdl-20459833

ABSTRACT

BACKGROUND: Structural variations caused by a wide range of physico-chemical and biological sources directly influence the function of a protein. For enzymatic proteins, the structure and chemistry of the catalytic binding site residues can be loosely defined as a substructure of the protein. Comparative analysis of drug-receptor substructures across and within species has been used for lead evaluation. Substructure-level similarity between the binding sites of functionally similar proteins has also been used to identify instances of convergent evolution among proteins. In functionally homologous protein families, shared chemistry and geometry at catalytic sites provide a common, local point of comparison among proteins that may differ significantly at the sequence, fold, or domain topology levels. RESULTS: This paper describes two key results that can be used separately or in combination for protein function analysis. The Family-wise Analysis of SubStructural Templates (FASST) method uses all-against-all substructure comparison to determine Substructural Clusters (SCs). SCs characterize the binding site substructural variation within a protein family. In this paper we focus on examples of automatically determined SCs that can be linked to phylogenetic distance between family members, segregation by conformation, and organization by homology among convergent protein lineages. The Motif Ensemble Statistical Hypothesis (MESH) framework constructs a representative motif for each protein cluster among the SCs determined by FASST to build motif ensembles that are shown through a series of function prediction experiments to improve the function prediction power of existing motifs. CONCLUSIONS: FASST contributes a critical feedback and assessment step to existing binding site substructure identification methods and can be used for the thorough investigation of structure-function relationships. The application of MESH allows for an automated, statistically rigorous procedure for incorporating structural variation data into protein function prediction pipelines. Our work provides an unbiased, automated assessment of the structural variability of identified binding site substructures among protein structure families and a technique for exploring the relation of substructural variation to protein function. As available proteomic data continues to expand, the techniques proposed will be indispensable for the large-scale analysis and interpretation of structural data.


Subject(s)
Enzymes/chemistry , Proteins/chemistry , Proteomics/methods , Amino Acid Motifs , Binding Sites , Databases, Protein , Enzymes/metabolism , Models, Molecular , Protein Conformation , Protein Folding , Proteins/metabolism , Sequence Analysis, Protein
7.
Article in English | MEDLINE | ID: mdl-17951837

ABSTRACT

The study of disease often hinges on the biological function of proteins, but determining protein function is a difficult experimental process. To minimize duplicated effort, algorithms for function prediction seek characteristics indicative of possible protein function. One approach is to identify substructural matches of geometric and chemical similarity between motifs representing known active sites and target protein structures with unknown function. In earlier work, statistically significant matches of certain effective motifs have identified functionally related active sites. Effective motifs must be carefully designed to maintain similarity to functionally related sites (sensitivity) and avoid incidental similarities to functionally unrelated protein geometry (specificity). Existing motif design techniques use the geometry of a single protein structure. Poor selection of this structure can limit motif effectiveness if the selected functional site lacks similarity to functionally related sites. To address this problem, this paper presents composite motifs, which combine structures of functionally related active sites to potentially increase sensitivity. Our experimentation compares the effectiveness of composite motifs with simple motifs designed from single protein structures. On six distinct families of functionally related proteins, leave-one-out testing showed that composite motifs had sensitivity comparable to the most sensitive of all simple motifs and specificity comparable to the average simple motif. On our data set, we observed that composite motifs simultaneously capture variations in active site conformation, diminish the problem of selecting motif structures, and enable the fusion of protein structures from diverse data sources.


Subject(s)
Models, Chemical , Models, Molecular , Proteins/chemistry , Proteins/ultrastructure , Sequence Analysis, Protein/methods , Amino Acid Motifs , Binding Sites , Computer Simulation , Multiprotein Complexes/chemistry , Multiprotein Complexes/metabolism , Multiprotein Complexes/ultrastructure , Protein Binding , Protein Conformation , Protein Folding , Surface Properties
8.
J Comput Biol ; 14(6): 791-816, 2007.
Article in English | MEDLINE | ID: mdl-17691895

ABSTRACT

The development of new and effective drugs is strongly affected by the need to identify drug targets and to reduce side effects. Resolving these issues depends partially on a thorough understanding of the biological function of proteins. Unfortunately, the experimental determination of protein function is expensive and time consuming. To support and accelerate the determination of protein functions, algorithms for function prediction are designed to gather evidence indicating functional similarity with well studied proteins. One such approach is the MASH pipeline, described in the first half of this paper. MASH identifies matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Observations from several research groups concur that statistically significant matches can indicate functionally related active sites. One major subproblem is the design of effective motifs, which have many matches to functionally related targets (sensitive motifs), and few matches to functionally unrelated targets (specific motifs). Current techniques select and combine structural, physical, and evolutionary properties to generate motifs that mirror functional characteristics in active sites. This approach ignores incidental similarities that may occur with functionally unrelated proteins. To address this problem, we have developed Geometric Sieving (GS), a parallel distributed algorithm that efficiently refines motifs, designed by existing methods, into optimized motifs with maximal geometric and chemical dissimilarity from all known protein structures. In exhaustive comparison of all possible motifs based on the active sites of 10 well-studied proteins, we observed that optimized motifs were among the most sensitive and specific.


Subject(s)
Algorithms , Computational Biology , Proteins/chemistry , Proteins/metabolism , Amino Acid Motifs , Binding Sites , Databases, Protein , Models, Molecular , Protein Structure, Tertiary , Software
9.
J Bioinform Comput Biol ; 5(2a): 353-82, 2007 Apr.
Article in English | MEDLINE | ID: mdl-17589966

ABSTRACT

Algorithms for geometric and chemical comparison of protein substructure can be useful for many applications in protein function prediction. These motif matching algorithms identify matches of geometric and chemical similarity between well-studied functional sites, motifs, and substructures of functionally uncharacterized proteins, targets. For the purpose of function prediction, the accuracy of motif matching algorithms can be evaluated with the number of statistically significant matches to functionally related proteins, true positives (TPs), and the number of statistically insignificant matches to functionally unrelated proteins, false positives (FPs). Our earlier work developed cavity-aware motifs which use motif points to represent functionally significant atoms and C-spheres to represent functionally significant volumes. We observed that cavity-aware motifs match significantly fewer FPs than matches containing only motif points. We also observed that high-impact C-spheres, which significantly contribute to the reduction of FPs, can be isolated automatically with a technique we call Cavity Scaling. This paper extends our earlier work by demonstrating that C-spheres can be used to accelerate point-based geometric and chemical comparison algorithms, maintaining accuracy while reducing runtime. We also demonstrate that the placement of C-spheres can significantly affect the number of TPs and FPs identified by a cavity-aware motif. While the optimal placement of C-spheres remains a difficult open problem, we compared two logical placement strategies to better understand C-sphere placement.


Subject(s)
Algorithms , Amino Acid Motifs , Models, Chemical , Pattern Recognition, Automated/methods , Proteins/chemistry , Sequence Analysis, Protein/methods , Amino Acid Sequence , Computer Simulation , Molecular Sequence Data , Structure-Activity Relationship
10.
Article in English | MEDLINE | ID: mdl-17369649

ABSTRACT

Determining the function of proteins is a problem with immense practical impact on the identification of inhibition targets and the causes of side effects. Unfortunately, experimental determination of protein function is expensive and time consuming. For this reason, algorithms for computational function prediction have been developed to focus and accelerate this effort. These algorithms are comparison techniques which identify matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Matches of statistically significant geometric and chemical similarity can identify targets with active sites cognate to the matching motif. Unfortunately statistically significant matches can include false positive matches to functionally unrelated proteins. We target this problem by presenting Cavity Aware Match Augmentation (CAMA), a technique which uses C-spheres to represent active clefts which must remain vacant for ligand binding. CAMA rejects matches to targets without similar binding volumes. On 18 sample motifs, we observed that introducing C-spheres eliminated 80% of false positive matches and maintained 87% of true positive matches found with identical motifs lacking C-spheres. Analyzing a range of C-sphere positions and sizes, we observed that some high-impact C- spheres eliminate more false positive matches than others. High-impact C-spheres can be detected with a geometric analysis we call Cavity Scaling, permitting us to refine our initial cavity-aware motifs to contain only high-impact C-spheres. In the absence of expert knowledge, Cavity Scaling can guide the design of cavity-aware motifs to eliminate many false positive matches.


Subject(s)
Computational Biology/methods , Proteomics/methods , Algorithms , Amino Acid Motifs , Animals , Carbon/chemistry , Computer Simulation , Databases, Protein , False Positive Reactions , Models, Molecular , Models, Statistical , Models, Theoretical , Molecular Conformation , Protein Conformation , Reproducibility of Results
SELECTION OF CITATIONS
SEARCH DETAIL
...