Search | VHL Regional Portal

Protein language models can capture protein quaternary state.

Avraham, Orly; Tsaban, Tomer; Ben-Aharon, Ziv; Tsaban, Linoy; Schueler-Furman, Ora.

BMC Bioinformatics ; 24(1): 433, 2023 Nov 14.

Article in English | MEDLINE | ID: mdl-37964216

ABSTRACT

BACKGROUND: Determining a protein's quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction. RESULTS: We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings. CONCLUSIONS: QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab. RESEARCH: google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb .

Subject(s)

Language , Proteins , Proteins/chemistry , Amino Acid Sequence , Protein Structure, Secondary , Protein Transport

Shared Cancer Dataset Analysis Identifies and Predicts the Quantitative Effects of Pan-Cancer Somatic Driver Variants.

Landau, Jakob; Tsaban, Linoy; Yaacov, Adar; Ben Cohen, Gil; Rosenberg, Shai.

Cancer Res ; 83(1): 74-88, 2023 01 04.

Article in English | MEDLINE | ID: mdl-36264175

ABSTRACT

Driver mutations endow tumors with selective advantages and produce an array of pathogenic effects. Determining the function of somatic variants is important for understanding cancer biology and identifying optimal therapies. Here, we compiled a shared dataset from several cancer genomic databases. Two measures were applied to 535 cancer genes based on observed and expected frequencies of driver variants as derived from cancer-specific rates of somatic mutagenesis. The first measure comprised a binary classifier based on a binomial test; the second was tumor variant amplitude (TVA), a continuous measure representing the selective advantage of individual variants. TVA outperformed all other computational tools in terms of its correlation with experimentally derived functional scores of cancer mutations. TVA also highly correlated with drug response, overall survival, and other clinical implications in relevant cancer genes. This study demonstrates how a selective advantage measure based on a large cancer dataset significantly impacts our understanding of the spectral effect of driver variants in cancer. The impact of this information will increase as cancer treatment becomes more precise and personalized to tumor-specific mutations. SIGNIFICANCE: A new selective advantage estimation assists in oncogenic driver identification and relative effect measurements, enabling better prognostication, therapy selection, and prioritization.

Subject(s)

Computational Biology , Neoplasms , Humans , Mutation , Neoplasms/genetics , Oncogenes

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL