Search | VHL Regional Portal

1.

Unified fair federated learning for digital healthcare.

Zhang, Fengda; Shuai, Zitao; Kuang, Kun; Wu, Fei; Zhuang, Yueting; Xiao, Jun.

Patterns (N Y) ; 5(1): 100907, 2024 Jan 12.

Article in English | MEDLINE | ID: mdl-38264718

ABSTRACT

Federated learning (FL) is a promising approach for healthcare institutions to train high-quality medical models collaboratively while protecting sensitive data privacy. However, FL models encounter fairness issues at diverse levels, leading to performance disparities across different subpopulations. To address this, we propose Federated Learning with Unified Fairness Objective (FedUFO), a unified framework consolidating diverse fairness levels within FL. By leveraging distributionally robust optimization and a unified uncertainty set, it ensures consistent performance across all subpopulations and enhances the overall efficacy of FL in healthcare and other domains while maintaining accuracy levels comparable with those of existing methods. Our model was validated by applying it to four digital healthcare tasks using real-world datasets in federated settings. Our collaborative machine learning paradigm not only promotes artificial intelligence in digital healthcare but also fosters social equity by embodying fairness.

2.

FADngs: Federated Learning for Anomaly Detection.

Dong, Boyu; Chen, Dong; Wu, Yu; Tang, Siliang; Zhuang, Yueting.

IEEE Trans Neural Netw Learn Syst ; PP2024 Jan 19.

Article in English | MEDLINE | ID: mdl-38241100

ABSTRACT

With the increasing demand for data privacy, federated learning (FL) has gained popularity for various applications. Most existing FL works focus on the classification task, overlooking those scenarios where anomaly detection may also require privacy-preserving. Traditional anomaly detection algorithms cannot be directly applied to the FL setting due to false and missing detection issues. Moreover, with common aggregation methods used in FL (e.g., averaging model parameters), the global model cannot keep the capacities of local models in discriminating anomalies deviating from local distributions, which further degrades the performance. For the aforementioned challenges, we propose Federated Anomaly Detection with Noisy Global Density Estimation, and Self-supervised Ensemble Distillation (FADngs). Specifically, FADngs aligns the knowledge of data distributions from each client by sharing processed density functions. Besides, FADngs trains local models in an improved contrastive learning way that learns more discriminative representations specific for anomaly detection based on the shared density functions. Furthermore, FADngs aggregates capacities by ensemble distillation, which distills the knowledge learned from different distributions to the global model. Our experiments demonstrate that the proposed method significantly outperforms state-of-the-art federated anomaly detection methods. We also empirically show that the shared density function is privacy-preserving. The code for the proposed method is provided for research purposes https://github.com/kanade00/Federated_Anomaly_detection.

3.

Attribute-driven streaming edge partitioning with reconciliations for distributed graph neural network training.

Mu, Zongshen; Tang, Siliang; Zhuang, Yueting; Yu, Dianhai.

Neural Netw ; 165: 987-998, 2023 Aug.

Article in English | MEDLINE | ID: mdl-37467586

ABSTRACT

Current distributed graph training frameworks evenly partition a large graph into small chunks to suit distributed storage, leverage a uniform interface to access neighbors, and train graph neural networks in a cluster of machines to update weights. Nevertheless, they consider a separate design of storage and training, resulting in huge communication costs for retrieving neighborhoods. During the storage phase, traditional heuristic graph partitioning not only suffers from memory overhead because of loading the full graph into the memory but also damages semantically related structures because of its neglecting meaningful node attributes. What is more, in the weight-update phase, directly averaging synchronization is difficult to tackle with heterogeneous local models where each machine's data are loaded from different subgraphs, resulting in slow convergence. To solve these problems, we propose a novel distributed graph training approach, attribute-driven streaming edge partitioning with reconciliations (ASEPR), where the local model loads only the subgraph stored on its own machine to make fewer communications. ASEPR firstly clusters nodes with similar attributes in the same partition to maintain semantic structure and keep multihop neighbor locality. Then streaming partitioning combined with attribute clustering is applied to subgraph assignment to alleviate memory overhead. After local graph neural network training on distributed machines, we deploy cross-layer reconciliation strategies for heterogeneous local models to improve the averaged global model by knowledge distillation and contrastive learning. Extensive experiments conducted on four large graph datasets on node classification and link prediction tasks show that our model outperforms DistDGL, with fewer resource requirements and up to quadruple the convergence speed.

Subject(s)

Communication , Learning , Cluster Analysis , Heuristics , Neural Networks, Computer

4.

Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding.

Li, Juncheng; Tang, Siliang; Zhu, Linchao; Zhang, Wenqiao; Yang, Yi; Chua, Tat-Seng; Wu, Fei; Zhuang, Yueting.

IEEE Trans Pattern Anal Mach Intell ; 45(10): 12601-12617, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37155378

ABSTRACT

Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. This task has achieved significant momentum in the computer vision community as it enables activity grounding beyond pre-defined activity classes by utilizing the semantic diversity of natural language descriptions. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, existing temporal grounding datasets are not carefully designed to evaluate the compositional generalizability. To systematically benchmark the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. We empirically find that they fail to generalize to queries with novel combinations of seen words. We argue that the inherent compositional structure (i.e., composition constituents and their relationships) inside the videos and language is the crucial factor to achieve compositional generalization. Based on this insight, we propose a variational cross-graph reasoning framework that explicitly decomposes video and language into hierarchical semantic graphs, respectively, and learns fine-grained semantic correspondence between the two graphs. Meanwhile, we introduce a novel adaptive structured semantics learning approach to derive the structure-informed and domain-generalizable graph representations, which facilitate the fine-grained semantic correspondence reasoning between the two graphs. To further evaluate the understanding of the compositional structure, we also introduce a more challenging setting, where one of the components in the novel composition is unseen. This requires more sophisticated understanding of the compositional structure to infer the potential semantics of the unseen word based on the other learned composition constituents appearing in both the video and language context, and their relationships. Extensive experiments validate the superior compositional generalizability of our approach, demonstrating its ability to handle queries with novel combinations of seen words as well as novel words in the testing composition.

5.

Unfolding and modeling the recovery process after COVID lockdowns.

Yang, Xuan; Yang, Yang; Tan, Chenhao; Lin, Yinghe; Fu, Zhengzhe; Wu, Fei; Zhuang, Yueting.

Sci Rep ; 13(1): 4131, 2023 03 13.

Article in English | MEDLINE | ID: mdl-36914698

ABSTRACT

Lockdown is a common policy used to deter the spread of COVID-19. However, the question of how our society comes back to life after a lockdown remains an open one. Understanding how cities bounce back from lockdown is critical for promoting the global economy and preparing for future pandemics. Here, we propose a novel computational method based on electricity data to study the recovery process, and conduct a case study on the city of Hangzhou. With the designed Recovery Index, we find a variety of recovery patterns in main sectors. One of the main reasons for this difference is policy; therefore, we aim to answer the question of how policies can best facilitate the recovery of society. We first analyze how policy affects sectors and employ a change-point detection algorithm to provide a non-subjective approach to policy assessment. Furthermore, we design a model that can predict future recovery, allowing policies to be adjusted accordingly in advance. Specifically, we develop a deep neural network, TPG, to model recovery trends, which utilizes the graph structure learning to perceive influences between sectors. Simulation experiments using our model offer insights for policy-making: the government should prioritize supporting sectors that have greater influence on others and are influential on the whole economy.

Subject(s)

COVID-19 , Humans , COVID-19/epidemiology , COVID-19/prevention & control , Communicable Disease Control , Policy Making , Policy , Cities

6.

Elastic Knowledge Distillation by Learning From Recollection.

Fu, Yongjian; Li, Songyuan; Zhao, Hanbin; Wang, Wenfu; Fang, Weihao; Zhuang, Yueting; Pan, Zhijie; Li, Xi.

IEEE Trans Neural Netw Learn Syst ; 34(5): 2647-2658, 2023 May.

Article in English | MEDLINE | ID: mdl-34550892

ABSTRACT

Model performance can be further improved with the extra guidance apart from the one-hot ground truth. To achieve it, recently proposed recollection-based methods utilize the valuable information contained in the past training history and derive a "recollection" from it to provide data-driven prior to guide the training. In this article, we focus on two fundamental aspects of this method, i.e., recollection construction and recollection utilization. Specifically, to meet the various demands of models with different capacities and at different training periods, we propose to construct a set of recollections with diverse distributions from the same training history. After that, all the recollections collaborate together to provide guidance, which is adaptive to different model capacities, as well as different training periods, according to our similarity-based elastic knowledge distillation (KD) algorithm. Without any external prior to guide the training, our method achieves a significant performance gain and outperforms the methods of the same category, even as well as KD with well-trained teacher. Extensive experiments and further analysis are conducted to demonstrate the effectiveness of our method.

7.

Boosting RGB-D Saliency Detection by Leveraging Unlabeled RGB Images.

Wang, Xiaoqiang; Zhu, Lei; Tang, Siliang; Fu, Huazhu; Li, Ping; Wu, Fei; Yang, Yi; Zhuang, Yueting.

IEEE Trans Image Process ; 31: 1107-1119, 2022.

Article in English | MEDLINE | ID: mdl-34990359

ABSTRACT

Training deep models for RGB-D salient object detection (SOD) often requires a large number of labeled RGB-D images. However, RGB-D data is not easily acquired, which limits the development of RGB-D SOD techniques. To alleviate this issue, we present a Dual-Semi RGB-D Salient Object Detection Network (DS-Net) to leverage unlabeled RGB images for boosting RGB-D saliency detection. We first devise a depth decoupling convolutional neural network (DDCNN), which contains a depth estimation branch and a saliency detection branch. The depth estimation branch is trained with RGB-D images and then used to estimate the pseudo depth maps for all unlabeled RGB images to form the paired data. The saliency detection branch is used to fuse the RGB feature and depth feature to predict the RGB-D saliency. Then, the whole DDCNN is assigned as the backbone in a teacher-student framework for semi-supervised learning. Moreover, we also introduce a consistency loss on the intermediate attention and saliency maps for the unlabeled data, as well as a supervised depth and saliency loss for labeled data. Experimental results on seven widely-used benchmark datasets demonstrate that our DDCNN outperforms state-of-the-art methods both quantitatively and qualitatively. We also demonstrate that our semi-supervised DS-Net can further improve the performance, even when using an RGB image with the pseudo depth map.

Subject(s)

Neural Networks, Computer , Supervised Machine Learning , Attention , Humans

8.

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA.

Jin, Weike; Zhao, Zhou; Cao, Xiaochun; Zhu, Jieming; He, Xiuqiang; Zhuang, Yueting.

IEEE Trans Image Process ; 30: 5477-5489, 2021.

Article in English | MEDLINE | ID: mdl-33950840

ABSTRACT

Vision-language research has become very popular, which focuses on understanding of visual contents, language semantics and relationships between them. Video question answering (Video QA) is one of the typical tasks. Recently, several BERT style pre-training methods have been proposed and shown effectiveness on various vision-language tasks. In this work, we leverage the successful vision-language transformer structure to solve the Video QA problem. However, we do not pre-train it with any video data, because video pre-training requires massive computing resources and is hard to perform with only a few GPUs. Instead, our work aims to leverage image-language pre-training to help with video-language modeling, by sharing a common module design. We further introduce an adaptive spatio-temporal graph to enhance the vision-language representation learning. That is, we adaptively refine the spatio-temporal tubes of salient objects according to their spatio-temporal relations learned through a hierarchical graph convolution process. Finally, we can obtain a number of fine-grained tube-level video object representations, as the visual inputs of the vision-language transformer module. Experiments on three widely used Video QA datasets show that our model achieves the new state-of-the-art results.

9.

End-to-End Video Saliency Detection via a Deep Contextual Spatiotemporal Network.

Wei, Lina; Zhao, Shanshan; Bourahla, Omar Farouk; Li, Xi; Wu, Fei; Zhuang, Yueting; Han, Junwei; Xu, Mingliang.

IEEE Trans Neural Netw Learn Syst ; 32(4): 1691-1702, 2021 Apr.

Article in English | MEDLINE | ID: mdl-33017291

ABSTRACT

As an interesting and important problem in computer vision, learning-based video saliency detection aims to discover the visually interesting regions in a video sequence. Capturing the information within frame and between frame at different aspects (such as spatial contexts, motion information, temporal consistency across frames, and multiscale representation) is important for this task. A key issue is how to jointly model all these factors within a unified data-driven scheme in an end-to-end fashion. In this article, we propose an end-to-end spatiotemporal deep video saliency detection approach, which captures the information on spatial contexts and motion characteristics. Furthermore, it encodes the temporal consistency information across the consecutive frames by implementing a convolutional long short-term memory (Conv-LSTM) model. In addition, the multiscale saliency properties for each frame are adaptively integrated for final saliency prediction in a collaborative feature-pyramid way. Finally, the proposed deep learning approach unifies all the aforementioned parts into an end-to-end joint deep learning scheme. Experimental results demonstrate the effectiveness of our approach in comparison with the state-of-the-art approaches.

10.

Context-aware Graph Label Propagation Network for Saliency Detection.

Ji, Wei; Li, Xi; Wei, Lina; Wu, Fei; Zhuang, Yueting.

IEEE Trans Image Process ; PP2020 Jun 25.

Article in English | MEDLINE | ID: mdl-32746240

ABSTRACT

Recently, a large number of existing methods for saliency detection have mainly focused on designing complex network architectures to aggregate powerful features from backbone networks. However, contextual information is not well utilized, which often causes false background regions and blurred object boundaries. Motivated by these issues, we propose an easyto-implement module that utilizes the edge-preserving ability of superpixels and the graph neural network to interact the context of superpixel nodes. In more detail, we first extract the features from the backbone network and obtain the superpixel information of images. This step is followed by superpixel pooling in which we transfer the irregular superpixel information to a structured feature representation. To propagate the information among the foreground and background regions, we use a graph neural network and self-attention layer to better evaluate the degree of saliency degree. Additionally, an affinity loss is proposed to regularize the affinity matrix to constrain the propagation path. Moreover, we extend our module to a multiscale structure with different numbers of superpixels. Experiments on five challenging datasets show that our approach can improve the performance of three baseline methods in terms of some popular evaluation metrics.

11.

Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks.

Zhao, Zhou; Xiao, Shuwen; Song, Zehan; Lu, Chujie; Xiao, Jun; Zhuang, Yueting.

IEEE Trans Image Process ; 2020 Jan 29.

Article in English | MEDLINE | ID: mdl-32011250

ABSTRACT

As a challenging task in visual information retrieval, open-ended long-form video question answering automatically generates the natural language answer from the referenced video content according to the given question. However, the existing video question answering works mainly focus on the short-form video, which may be ineffectively applied for long-form video question answering directly, due to the insufficiency of modeling the semantic representation of long-form video content. In this paper, we study the problem of open-ended long-form video question answering from the viewpoint of hierarchical multimodal conditional adversarial network learning. We propose the hierarchical attentional encoder network to learn the joint representation of long-form video content and given question with adaptive video segmentation. We then devise the reinforced decoder network to generate the natural language answer for openended video question answering with multi-modal conditional adversarial network learning. We construct three large-scale open-ended video question answering datasets. The extensive experiments validate the effectiveness of our method.

12.

Deep Group-wise Fully Convolutional Network for Co-saliency Detection with Graph Propagation.

Wei, Lina; Zhao, Shanshan; Bourahla, Omar El Farouk; Li, Xi; Wu, Fei; Zhuang, Yueting.

IEEE Trans Image Process ; 2019 Apr 15.

Article in English | MEDLINE | ID: mdl-30998462

ABSTRACT

A key problem in co-saliency detection is how to effectively model the interactive relationship of a whole image group and the individual perspective of each image in a united data-driven manner. In this paper, we propose a group-wise deep co-saliency detection approach to address the co-saliency object discovery problem based on the fully convolutional network (FCN). The proposed approach captures the group-wise interaction information for group images by learning a semantics-aware image representation based on a convolutional neural network, which adaptively learns the group-wise features for co-saliency detection. Furthermore, the proposed approach discovers the collaborative and interactive relationships between group-wise feature representation and single image individual feature representation, and model this in a collaborative learning framework. Then, we set up a unified deep learning scheme to jointly optimize the process of group-wise feature representation learning and the collaborative learning, leading to more reliable and robust co-saliency detection results. Finally, we present a graph Laplacian regularized nonlinear regression model for saliency refinement. Experimental results demonstrate the effectiveness of our approach in comparison with the state-of-the-art approaches.

13.

Identifying Objective and Subjective Words via Topic Modeling.

Wang, Hanqi; Wu, Fei; Lu, Weiming; Yang, Yi; Li, Xi; Li, Xuelong; Zhuang, Yueting.

IEEE Trans Neural Netw Learn Syst ; 29(3): 718-730, 2018 03.

Article in English | MEDLINE | ID: mdl-28103560

ABSTRACT

It is observed that distinct words in a given document have either strong or weak ability in delivering facts (i.e., the objective sense) or expressing opinions (i.e., the subjective sense) depending on the topics they associate with. Motivated by the intuitive assumption that different words have varying degree of discriminative power in delivering the objective sense or the subjective sense with respect to their assigned topics, a model named as dentified bjective- ubjective latent Dirichlet allocation (LDA) ( osLDA) is proposed in this paper. In the osLDA model, the simple Pólya urn model adopted in traditional topic models is modified by incorporating it with a probabilistic generative process, in which the novel "Bag-of-Discriminative-Words" (BoDW) representation for the documents is obtained; each document has two different BoDW representations with regard to objective and subjective senses, respectively, which are employed in the joint objective and subjective classification instead of the traditional Bag-of-Topics representation. The experiments reported on documents and images demonstrate that: 1) the BoDW representation is more predictive than the traditional ones; 2) osLDA boosts the performance of topic modeling via the joint discovery of latent topics and the different objective and subjective power hidden in every word; and 3) osLDA has lower computational complexity than supervised LDA, especially under an increasing number of topics.

14.

Data-Dependent Label Distribution Learning for Age Estimation.

He, Zhouzhou; Li, Xi; Zhang, Zhongfei; Wu, Fei; Geng, Xin; Zhang, Yaqing; Yang, Ming-Hsuan; Zhuang, Yueting.

IEEE Trans Image Process ; 26(8): 3846-3858, 2017 Aug.

Article in English | MEDLINE | ID: mdl-28103557

ABSTRACT

As an important and challenging problem in computer vision, face age estimation is typically cast as a classification or regression problem over a set of face samples with respect to several ordinal age labels, which have intrinsically cross-age correlations across adjacent age dimensions. As a result, such correlations usually lead to the age label ambiguities of the face samples. Namely, each face sample is associated with a latent label distribution that encodes the cross-age correlation information on label ambiguities. Motivated by this observation, we propose a totally data-driven label distribution learning approach to adaptively learn the latent label distributions. The proposed approach is capable of effectively discovering the intrinsic age distribution patterns for cross-age correlation analysis on the basis of the local context structures of face samples. Without any prior assumptions on the forms of label distribution learning, our approach is able to flexibly model the sample-specific context aware label distribution properties by solving a multi-task problem, which jointly optimizes the tasks of age-label distribution learning and age prediction for individuals. Experimental results demonstrate the effectiveness of our approach.

Subject(s)

Aging/physiology , Face/diagnostic imaging , Image Processing, Computer-Assisted/methods , Models, Statistical , Adolescent , Adult , Aged , Algorithms , Child , Child, Preschool , Female , Humans , Infant , Infant, Newborn , Male , Middle Aged , Racial Groups/statistics & numerical data , Young Adult

15.

DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection.

Li, Xi; Zhao, Liming; Wei, Lina; Yang, Ming-Hsuan; Wu, Fei; Zhuang, Yueting; Ling, Haibin; Wang, Jingdong.

IEEE Trans Image Process ; 25(8): 3919-30, 2016 08.

Article in English | MEDLINE | ID: mdl-27305676

ABSTRACT

A key problem in salient object detection is how to effectively model the semantic properties of salient objects in a data-driven manner. In this paper, we propose a multi-task deep saliency model based on a fully convolutional neural network with global input (whole raw images) and global output (whole saliency maps). In principle, the proposed saliency model takes a data-driven strategy for encoding the underlying saliency prior information, and then sets up a multi-task learning scheme for exploring the intrinsic correlations between saliency detection and semantic image segmentation. Through collaborative feature learning from such two correlated tasks, the shared fully convolutional layers produce effective features for object perception. Moreover, it is capable of capturing the semantic information on salient objects across different levels using the fully convolutional layers, which investigate the feature-sharing properties of salient object detection with a great reduction of feature redundancy. Finally, we present a graph Laplacian regularized nonlinear regression model for saliency refinement. Experimental results demonstrate the effectiveness of our approach in comparison with the state-of-the-art approaches.

16.

Scalable Linear Visual Feature Learning via Online Parallel Nonnegative Matrix Factorization.

Zhao, Xueyi; Li, Xi; Zhang, Zhongfei; Shen, Chunhua; Zhuang, Yueting; Gao, Lixin; Li, Xuelong.

IEEE Trans Neural Netw Learn Syst ; 27(12): 2628-2642, 2016 12.

Article in English | MEDLINE | ID: mdl-26625429

ABSTRACT

Visual feature learning, which aims to construct an effective feature representation for visual data, has a wide range of applications in computer vision. It is often posed as a problem of nonnegative matrix factorization (NMF), which constructs a linear representation for the data. Although NMF is typically parallelized for efficiency, traditional parallelization methods suffer from either an expensive computation or a high runtime memory usage. To alleviate this problem, we propose a parallel NMF method called alternating least square block decomposition (ALSD), which efficiently solves a set of conditionally independent optimization subproblems based on a highly parallelized fine-grained grid-based blockwise matrix decomposition. By assigning each block optimization subproblem to an individual computing node, ALSD can be effectively implemented in a MapReduce-based Hadoop framework. In order to cope with dynamically varying visual data, we further present an incremental version of ALSD, which is able to incrementally update the NMF solution with a low computational cost. Experimental results demonstrate the efficiency and scalability of the proposed methods as well as their applications to image clustering and image retrieval.

17.

Learning of Multimodal Representations With Random Walks on the Click Graph.

Wu, Fei; Lu, Xinyan; Song, Jun; Yan, Shuicheng; Zhang, Zhongfei Mark; Rui, Yong; Zhuang, Yueting.

IEEE Trans Image Process ; 25(2): 630-42, 2016 Feb.

Article in English | MEDLINE | ID: mdl-26672038

ABSTRACT

In multimedia information retrieval, most classic approaches tend to represent different modalities of media in the same feature space. With the click data collected from the users' searching behavior, existing approaches take either one-to-one paired data (text-image pairs) or ranking examples (text-query-image and/or image-query-text ranking lists) as training examples, which do not make full use of the click data, particularly the implicit connections among the data objects. In this paper, we treat the click data as a large click graph, in which vertices are images/text queries and edges indicate the clicks between an image and a query. We consider learning a multimodal representation from the perspective of encoding the explicit/implicit relevance relationship between the vertices in the click graph. By minimizing both the truncated random walk loss as well as the distance between the learned representation of vertices and their corresponding deep neural network output, the proposed model which is named multimodal random walk neural network (MRW-NN) can be applied to not only learn robust representation of the existing multimodal data in the click graph, but also deal with the unseen queries and images to support cross-modal retrieval. We evaluate the latent representation learned by MRW-NN on a public large-scale click log data set Clickture and further show that MRW-NN achieves much better cross-modal retrieval performance on the unseen queries/images than the other state-of-the-art methods.

18.

Online Metric-Weighted Linear Representations for Robust Visual Tracking.

Li, Xi; Shen, Chunhua; Dick, Anthony; Zhang, Zhongfei Mark; Zhuang, Yueting.

IEEE Trans Pattern Anal Mach Intell ; 38(5): 931-50, 2016 May.

Article in English | MEDLINE | ID: mdl-26390446

ABSTRACT

In this paper, we propose a visual tracker based on a metric-weighted linear representation of appearance. In order to capture the interdependence of different feature dimensions, we develop two online distance metric learning methods using proximity comparison information and structured output learning. The learned metric is then incorporated into a linear representation of appearance. We show that online distance metric learning significantly improves the robustness of the tracker, especially on those sequences exhibiting drastic appearance changes. In order to bound growth in the number of training samples, we design a time-weighted reservoir sampling method. Moreover, we enable our tracker to automatically perform object identification during the process of object tracking, by introducing a collection of static template samples belonging to several object classes of interest. Object identification results for an entire video sequence are achieved by systematically combining the tracking information and visual recognition at each frame. Experimental results on challenging video sequences demonstrate the effectiveness of the method for both inter-frame tracking and object identification.

19.

Cross-modal learning to rank via latent joint representation.

Wu, Fei; Jiang, Xinyang; Li, Xi; Tang, Siliang; Lu, Weiming; Zhang, Zhongfei; Zhuang, Yueting.

IEEE Trans Image Process ; 24(5): 1497-509, 2015 May.

Article in English | MEDLINE | ID: mdl-25700450

ABSTRACT

Cross-modal ranking is a research topic that is imperative to many applications involving multimodal data. Discovering a joint representation for multimodal data and learning a ranking function are essential in order to boost the cross-media retrieval (i.e., image-query-text or text-query-image). In this paper, we propose an approach to discover the latent joint representation of pairs of multimodal data (e.g., pairs of an image query and a text document) via a conditional random field and structural learning in a listwise ranking manner. We call this approach cross-modal learning to rank via latent joint representation (CML²R). In CML²R, the correlations between multimodal data are captured in terms of their sharing hidden variables (e.g., topics), and a hidden-topic-driven discriminative ranking function is learned in a listwise ranking manner. The experiments show that the proposed approach achieves a good performance in cross-media retrieval and meanwhile has the capability to learn the discriminative representation of multimodal data.

20.

Mining Spatial-Temporal Patterns and Structural Sparsity for Human Motion Data Denoising.

Feng, Yinfu; Ji, Mingming; Xiao, Jun; Yang, Xiaosong; Zhang, Jian J; Zhuang, Yueting; Li, Xuelong.

IEEE Trans Cybern ; 45(12): 2693-706, 2015 Dec.

Article in English | MEDLINE | ID: mdl-25561602

ABSTRACT

Motion capture is an important technique with a wide range of applications in areas such as computer vision, computer animation, film production, and medical rehabilitation. Even with the professional motion capture systems, the acquired raw data mostly contain inevitable noises and outliers. To denoise the data, numerous methods have been developed, while this problem still remains a challenge due to the high complexity of human motion and the diversity of real-life situations. In this paper, we propose a data-driven-based robust human motion denoising approach by mining the spatial-temporal patterns and the structural sparsity embedded in motion data. We first replace the regularly used entire pose model with a much fine-grained partlet model as feature representation to exploit the abundant local body part posture and movement similarities. Then, a robust dictionary learning algorithm is proposed to learn multiple compact and representative motion dictionaries from the training data in parallel. Finally, we reformulate the human motion denoising problem as a robust structured sparse coding problem in which both the noise distribution information and the temporal smoothness property of human motion have been jointly taken into account. Compared with several state-of-the-art motion denoising methods on both the synthetic and real noisy motion data, our method consistently yields better performance than its counterparts. The outputs of our approach are much more stable than that of the others. In addition, it is much easier to setup the training dataset of our method than that of the other data-driven-based methods.

Subject(s)

Algorithms , Image Processing, Computer-Assisted/methods , Movement/physiology , Pattern Recognition, Automated/methods , Signal Processing, Computer-Assisted , Data Mining , Human Activities/classification , Humans , Machine Learning

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL