Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
Add more filters










Publication year range
1.
Proc Mach Learn Res ; 89: 3449-3458, 2019 Apr.
Article in English | MEDLINE | ID: mdl-31497776

ABSTRACT

Covariate shift is a prevalent setting for supervised learning in the wild when the training and test data are drawn from different time periods, different but related domains, or via different sampling strategies. This paper addresses a transfer learning setting, with covariate shift between source and target domains. Most existing methods for correcting covariate shift exploit density ratios of the features to reweight the source-domain data, and when the features are high-dimensional, the estimated density ratios may suffer large estimation variances, leading to poor prediction performance. In this work, we investigate the dependence of covariate shift correction performance on the dimensionality of the features, and propose a correction method that finds a low-dimensional representation of the features, which takes into account feature relevant to the target Y, and exploits the density ratio of this representation for importance reweighting. We discuss the factors affecting the performance of our method and demonstrate its capabilities on both pseudo-real and real-world data.

2.
Proc Mach Learn Res ; 89: 3487-3496, 2019 Apr.
Article in English | MEDLINE | ID: mdl-31497777

ABSTRACT

A key problem in domain adaptation is determining what to transfer across different domains. We propose a data-driven method to represent these changes across multiple source domains and perform unsupervised domain adaptation. We assume that the joint distributions follow a specific generating process and have a small number of identifiable changing parameters, and develop a data-driven method to identify the changing parameters by learning low-dimensional representations of the changing class-conditional distributions across multiple source domains. The learned low-dimensional representations enable us to reconstruct the target-domain joint distribution from unlabeled target-domain data, and further enable predicting the labels in the target domain. We demonstrate the efficacy of this method by conducting experiments on synthetic and real datasets.

3.
J Comput Biol ; 24(6): 501-514, 2017 Jun.
Article in English | MEDLINE | ID: mdl-28128642

ABSTRACT

Disease-causing pathogens such as viruses introduce their proteins into the host cells in which they interact with the host's proteins, enabling the virus to replicate inside the host. These interactions between pathogen and host proteins are key to understanding infectious diseases. Often multiple diseases involve phylogenetically related or biologically similar pathogens. Here we present a multitask learning method to jointly model interactions between human proteins and three different but related viruses: Hepatitis C, Ebola virus, and Influenza A. Our multitask matrix completion-based model uses a shared low-rank structure in addition to a task-specific sparse structure to incorporate the various interactions. We obtain between 7 and 39 percentage points improvement in predictive performance over prior state-of-the-art models. We show how our model's parameters can be interpreted to reveal both general and specific interaction-relevant characteristics of the viruses. Our code is available online.


Subject(s)
Algorithms , Computational Biology/methods , Host-Pathogen Interactions , Protein Interaction Maps , Proteins/metabolism , Databases, Protein , Ebolavirus/metabolism , Hepacivirus/metabolism , Humans , Influenza A virus/metabolism , Models, Molecular , Protein Conformation
4.
Pac Symp Biocomput ; : 318-29, 2015.
Article in English | MEDLINE | ID: mdl-25592592

ABSTRACT

The availability of high-quality physical interaction datasets is a prerequisite for system-level analysis of interactomes and supervised models to predict protein-protein interactions (PPIs). One source is literature-curated PPI databases in which pairwise associations of proteins published in the scientific literature are deposited. However, PPIs may not be clearly labelled as physical interactions affecting the quality of the entire dataset. In order to obtain a high-quality gold standard dataset for PPIs between human immunodeficiency virus (HIV-1) and its human host, we adopted a crowd-sourcing approach. We collected expert opinions and utilized an expectation-maximization based approach to estimate expert labeling quality. These estimates are used to infer the probability of a reported PPI actually being a direct physical interaction given the set of expert opinions. The effectiveness of our approach is demonstrated through synthetic data experiments and a high quality physical interaction network between HIV and human proteins is obtained. Since many literature-curated databases suffer from similar challenges, the framework described herein could be utilized in refining other databases. The curated data is available at http://www.cs.bilkent.edu.tr/~oznur.tastan/supp/psb2015/.


Subject(s)
Databases, Protein/statistics & numerical data , Protein Interaction Maps , Computational Biology , Crowdsourcing , Expert Testimony , HIV-1/pathogenicity , HIV-1/physiology , Host-Pathogen Interactions , Human Immunodeficiency Virus Proteins/physiology , Humans , Knowledge Discovery , Likelihood Functions , Models, Statistical , Systems Analysis
5.
Proteins ; 79(4): 1061-78, 2011 Apr.
Article in English | MEDLINE | ID: mdl-21268112

ABSTRACT

We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy.


Subject(s)
Models, Chemical , Protein Folding , Proteins/chemistry , Sequence Alignment/methods , Amino Acid Sequence , Area Under Curve , Computational Biology , Computer Graphics , Computer Simulation , Markov Chains , Models, Molecular , Models, Statistical , PDZ Domains , Sequence Analysis, Protein , Structure-Activity Relationship
6.
Bioinformatics ; 26(18): i645-52, 2010 Sep 15.
Article in English | MEDLINE | ID: mdl-20823334

ABSTRACT

MOTIVATION: Protein-protein interactions (PPIs) are critical for virtually every biological function. Recently, researchers suggested to use supervised learning for the task of classifying pairs of proteins as interacting or not. However, its performance is largely restricted by the availability of truly interacting proteins (labeled). Meanwhile, there exists a considerable amount of protein pairs where an association appears between two partners, but not enough experimental evidence to support it as a direct interaction (partially labeled). RESULTS: We propose a semi-supervised multi-task framework for predicting PPIs from not only labeled, but also partially labeled reference sets. The basic idea is to perform multi-task learning on a supervised classification task and a semi-supervised auxiliary task. The supervised classifier trains a multi-layer perceptron network for PPI predictions from labeled examples. The semi-supervised auxiliary task shares network layers of the supervised classifier and trains with partially labeled examples. Semi-supervision could be utilized in multiple ways. We tried three approaches in this article, (i) classification (to distinguish partial positives with negatives); (ii) ranking (to rate partial positive more likely than negatives); (iii) embedding (to make data clusters get similar labels). We applied this framework to improve the identification of interacting pairs between HIV-1 and human proteins. Our method improved upon the state-of-the-art method for this task indicating the benefits of semi-supervised multi-task learning using auxiliary information. AVAILABILITY: http://www.cs.cmu.edu/~qyj/HIVsemi.


Subject(s)
Artificial Intelligence , Computational Biology/methods , HIV-1/physiology , Human Immunodeficiency Virus Proteins/metabolism , Protein Interaction Mapping/methods , Proteins/metabolism , Algorithms , Data Interpretation, Statistical , Humans , Models, Statistical
7.
BMC Bioinformatics ; 11 Suppl 1: S57, 2010 Jan 18.
Article in English | MEDLINE | ID: mdl-20122232

ABSTRACT

BACKGROUND: Biological processes in cells are carried out by means of protein-protein interactions. Determining whether a pair of proteins interacts by wet-lab experiments is resource-intensive; only about 38,000 interactions, out of a few hundred thousand expected interactions, are known today. Active machine learning can guide the selection of pairs of proteins for future experimental characterization in order to accelerate accurate prediction of the human protein interactome. RESULTS: Random forest (RF) has previously been shown to be effective for predicting protein-protein interactions. Here, four different active learning algorithms have been devised for selection of protein pairs to be used to train the RF. With labels of as few as 500 protein-pairs selected using any of the four active learning methods described here, the classifier achieved a higher F-score (harmonic mean of Precision and Recall) than with 3000 randomly chosen protein-pairs. F-score of predicted interactions is shown to increase by about 15% with active learning in comparison to that with random selection of data. CONCLUSION: Active learning algorithms enable learning more accurate classifiers with much lesser labelled data and prove to be useful in applications where manual annotation of data is formidable. Active learning techniques demonstrated here can also be applied to other proteomics applications such as protein structure prediction and classification.


Subject(s)
Algorithms , Proteins/chemistry , Proteomics/methods , Databases, Protein , Humans , Protein Conformation , Proteins/classification
8.
BMC Bioinformatics ; 11 Suppl 1: S58, 2010 Jan 18.
Article in English | MEDLINE | ID: mdl-20122233

ABSTRACT

BACKGROUND: About 30% of genes code for membrane proteins, which are involved in a wide variety of crucial biological functions. Despite their importance, experimentally determined structures correspond to only about 1.7% of protein structures deposited in the Protein Data Bank due to the difficulty in crystallizing membrane proteins. Algorithms that can identify proteins whose high-resolution structure can aid in predicting the structure of many previously unresolved proteins are therefore of potentially high value. Active machine learning is a supervised machine learning approach which is suitable for this domain where there are a large number of sequences but only very few have known corresponding structures. In essence, active learning seeks to identify proteins whose structure, if revealed experimentally, is maximally predictive of others. RESULTS: An active learning approach is presented for selection of a minimal set of proteins whose structures can aid in the determination of transmembrane helices for the remaining proteins. TMpro, an algorithm for high accuracy TM helix prediction we previously developed, is coupled with active learning. We show that with a well-designed selection procedure, high accuracy can be achieved with only few proteins. TMpro, trained with a single protein achieved an F-score of 94% on benchmark evaluation and 91% on MPtopo dataset, which correspond to the state-of-the-art accuracies on TM helix prediction that are achieved usually by training with over 100 training proteins. CONCLUSION: Active learning is suitable for bioinformatics applications, where manually characterized data are not a comprehensive representation of all possible data, and in fact can be a very sparse subset thereof. It aids in selection of data instances which when characterized experimentally can improve the accuracy of computational characterization of remaining raw data. The results presented here also demonstrate that the feature extraction method of TMpro is well designed, achieving a very good separation between TM and non TM segments.


Subject(s)
Algorithms , Artificial Intelligence , Membrane Proteins/chemistry , Protein Structure, Secondary , Databases, Protein , Protein Folding , Sequence Analysis, Protein
9.
Pac Symp Biocomput ; : 516-27, 2009.
Article in English | MEDLINE | ID: mdl-19209727

ABSTRACT

Human immunodeficiency virus-1 (HIV-1) in acquired immune deficiency syndrome (AIDS) relies on human host cell proteins in virtually every aspect of its life cycle. Knowledge of the set of interacting human and viral proteins would greatly contribute to our understanding of the mechanisms of infection and subsequently to the design of new therapeutic approaches. This work is the first attempt to predict the global set of interactions between HIV-1 and human host cellular proteins. We propose a supervised learning framework, where multiple information data sources are utilized, including co-occurrence of functional motifs and their interaction domains and protein classes, gene ontology annotations, posttranslational modifications, tissue distributions and gene expression profiles, topological properties of the human protein in the interaction network and the similarity of HIV-1 proteins to human proteins' known binding partners. We trained and tested a Random Forest (RF) classifier with this extensive feature set. The model's predictions achieved an average Mean Average Precision (MAP) score of 23%. Among the predicted interactions was for example the pair, HIV-1 protein tat and human vitamin D receptor. This interaction had recently been independently validated experimentally. The rank-ordered lists of predicted interacting pairs are a rich source for generating biological hypotheses. Amongst the novel predictions, transcription regulator activity, immune system process and macromolecular complex were the top most significant molecular function, process and cellular compartments, respectively. Supplementary material is available at URL www.cs.cmu.edu/õznur/hiv/hivPPI.html


Subject(s)
HIV-1/physiology , HIV-1/pathogenicity , Human Immunodeficiency Virus Proteins/physiology , Protein Interaction Mapping/statistics & numerical data , Artificial Intelligence , Biometry , Databases, Protein , HIV Infections/genetics , HIV Infections/physiopathology , HIV Infections/virology , HIV-1/genetics , Host-Pathogen Interactions/genetics , Host-Pathogen Interactions/physiology , Human Immunodeficiency Virus Proteins/genetics , Humans , Ligands , Models, Biological , Protein Binding , RNA, Small Interfering/genetics
10.
Proteins ; 58(4): 955-70, 2005 Mar 01.
Article in English | MEDLINE | ID: mdl-15645499

ABSTRACT

The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.


Subject(s)
Biotechnology/methods , Computational Biology/methods , Proteins/chemistry , Proteomics/methods , Receptors, G-Protein-Coupled/chemistry , Algorithms , Animals , Bayes Theorem , Cell Membrane/metabolism , Cell Nucleus/metabolism , Databases, Protein , Decision Trees , Genome, Human , Humans , Markov Chains , Models, Biological , Models, Statistical , Molecular Sequence Data , Protein Structure, Tertiary , Reproducibility of Results , Sequence Alignment , Sequence Analysis, Protein , Software , Terminology as Topic
SELECTION OF CITATIONS
SEARCH DETAIL
...