Search | VHL Regional Portal

Manifold Learning for Multivariate Variable-Length Sequences With an Application to Similarity Search.

Ho, Shen-Shyang; Dai, Peng; Rudzicz, Frank.

IEEE Trans Neural Netw Learn Syst ; 27(6): 1333-44, 2016 06.

Article in English | MEDLINE | ID: mdl-25781959

ABSTRACT

Multivariate variable-length sequence data are becoming ubiquitous with the technological advancement in mobile devices and sensor networks. Such data are difficult to compare, visualize, and analyze due to the nonmetric nature of data sequence similarity measures. In this paper, we propose a general manifold learning framework for arbitrary-length multivariate data sequences driven by similarity/distance (parameter) learning in both the original data sequence space and the learned manifold. Our proposed algorithm transforms the data sequences in a nonmetric data sequence space into feature vectors in a manifold that preserves the data sequence space structure. In particular, the feature vectors in the manifold representing similar data sequences remain close to one another and far from the feature points corresponding to dissimilar data sequences. To achieve this objective, we assume a semisupervised setting where we have knowledge about whether some of data sequences are similar or dissimilar, called the instance-level constraints. Using this information, one learns the similarity measure for the data sequence space and the distance measures for the manifold. Moreover, we describe an approach to handle the similarity search problem given user-defined instance level constraints in the learned manifold using a consensus voting scheme. Experimental results on both synthetic data and real tropical cyclone sequence data are presented to demonstrate the feasibility of our manifold learning framework and the robustness of performing similarity search in the learned manifold.

ML-Tree: a tree-structure-based approach to multilabel learning.

Wu, Qingyao; Ye, Yunming; Zhang, Haijun; Chow, Tommy W S; Ho, Shen-Shyang.

IEEE Trans Neural Netw Learn Syst ; 26(3): 430-43, 2015 Mar.

Article in English | MEDLINE | ID: mdl-25546863

ABSTRACT

Multilabel learning aims to predict labels of unseen instances by learning from training samples that are associated with a set of known labels. In this paper, we propose to use a hierarchical tree model for multilabel learning, and to develop the ML-Tree algorithm for finding the tree structure. ML-Tree considers a tree as a hierarchy of data and constructs the tree using the induction of one-against-all SVM classifiers at each node to recursively partition the data into child nodes. For each node, we define a predictive label vector to represent the predictive label transmission in the tree model for multilabel prediction and automatic discovery of the label relationships. If two labels co-occur frequently as predictive labels at leaf nodes, these labels are supposed to be relevant. The amount of predictive label co-occurrence provides an estimation of the label relationships. We examine the ML-Tree method on 11 real data sets of different domains and compare it with six well-established multilabel learning algorithms. The performances of these approaches are evaluated by 16 commonly used measures. We also conduct Friedman and Nemenyi tests to assess the statistical significance of the differences in performance. Experimental results demonstrate the effectiveness of our method.

Semi-supervised multi-label collective classification ensemble for functional genomics.

Wu, Qingyao; Ye, Yunming; Ho, Shen-Shyang; Zhou, Shuigeng.

BMC Genomics ; 15 Suppl 9: S17, 2014.

Article in English | MEDLINE | ID: mdl-25521242

ABSTRACT

BACKGROUND: With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a fully-labeled protein interaction network with a large amount of labeled data. However, acquiring such labels can be very difficult in practice, especially for multi-label protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from similar proteins or from connections between them. To effectively annotate proteins even in the paucity of labeled data, it is important to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data. RESULTS: In this paper, we show that the underlying nature of predicting functional properties of proteins using various data sources of relational data is a typical collective classification (CC) problem in machine learning. The protein functional prediction task with limited annotation is then cast into a semi-supervised multi-label collective classification (SMCC) framework. As such, we propose a novel generative model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties. To further boost the predicting performance, we extend the method in an ensemble manner, called EGM-SMCC, by utilizing multiple heterogeneous networks with various latent linkages constructed to explicitly model the relationships among the nodes for effectively propagate the supervision knowledge from labeled to unlabeled nodes. CONCLUSION: Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method. In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.

Subject(s)

Artificial Intelligence , Genomics/methods , Algorithms , Molecular Sequence Annotation , Probability , Protein Interaction Mapping , Yeasts/genetics , Yeasts/metabolism

Collective prediction of protein functions from protein-protein interaction networks.

Wu, Qingyao; Ye, Yunming; Ng, Michael K; Ho, Shen-Shyang; Shi, Ruichao.

BMC Bioinformatics ; 15 Suppl 2: S9, 2014.

Article in English | MEDLINE | ID: mdl-24564855

ABSTRACT

BACKGROUND: Automated assignment of functions to unknown proteins is one of the most important task in computational biology. The development of experimental methods for genome scale analysis of molecular interaction networks offers new ways to infer protein function from protein-protein interaction (PPI) network data. Existing techniques for collective classification (CC) usually increase accuracy for network data, wherein instances are interlinked with each other, using a large amount of labeled data for training. However, the labeled data are time-consuming and expensive to obtain. On the other hand, one can easily obtain large amount of unlabeled data. Thus, more sophisticated methods are needed to exploit the unlabeled data to increase prediction accuracy for protein function prediction. RESULTS: In this paper, we propose an effective Markov chain based CC algorithm (ICAM) to tackle the label deficiency problem in CC for interrelated proteins from PPI networks. Our idea is to model the problem using two distinct Markov chain classifiers to make separate predictions with regard to attribute features from protein data and relational features from relational information. The ICAM learning algorithm combines the results of the two classifiers to compute the ranks of labels to indicate the importance of a set of labels to an instance, and uses an ICA framework to iteratively refine the learning models for improving performance of protein function prediction from PPI networks in the paucity of labeled data. CONCLUSION: Experimental results on the real-world Yeast protein-protein interaction datasets show that our proposed ICAM method is better than the other ICA-type methods given limited labeled training data. This approach can serve as a valuable tool for the study of protein function prediction from PPI networks.

Subject(s)

Algorithms , Protein Interaction Mapping/methods , Proteins/physiology , Markov Chains

A martingale framework for detecting changes in data streams by testing exchangeability.

Ho, Shen-Shyang; Wechsler, Harry.

IEEE Trans Pattern Anal Mach Intell ; 32(12): 2113-27, 2010 Dec.

Article in English | MEDLINE | ID: mdl-20975112

ABSTRACT

In a data streaming setting, data points are observed sequentially. The data generating model may change as the data are streaming. In this paper, we propose detecting this change in data streams by testing the exchangeability property of the observed data. Our martingale approach is an efficient, nonparametric, one-pass algorithm that is effective on the classification, cluster, and regression data generating models. Experimental results show the feasibility and effectiveness of the martingale methodology in detecting changes in the data generating model for time-varying data streams. Moreover, we also show that: 1) An adaptive support vector machine (SVM) utilizing the martingale methodology compares favorably against an adaptive SVM utilizing a sliding window, and 2) a multiple martingale video-shot change detector compares favorably against standard shot-change detection algorithms.

Query by transduction.

Ho, Shen-Shyang; Wechsler, Harry.

IEEE Trans Pattern Anal Mach Intell ; 30(9): 1557-71, 2008 Sep.

Article in English | MEDLINE | ID: mdl-18617715

ABSTRACT

There has been recently a growing interest in the use of transductive inference for learning. We expand here the scope of transductive inference to active learning in a stream-based setting. Towards that end this paper proposes Query-by-Transduction (QBT) as a novel active learning algorithm. QBT queries the label of an example based on the p-values obtained using transduction. We show that QBT is closely related to Query-by-Committee (QBC) using relations between transduction, Bayesian statistical testing, Kullback-Leibler divergence, and Shannon information. The feasibility and utility of QBT is shown on both binary and multi-class classification tasks using SVM as the choice classifier. Our experimental results show that QBT compares favorably, in terms of mean generalization, against random sampling, committee-based active learning, margin-based active learning, and QBC in the stream-based setting.

Subject(s)

Algorithms , Artificial Intelligence , Information Storage and Retrieval/methods , Pattern Recognition, Automated/methods , Sensitivity and Specificity

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL