Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
IEEE Trans Pattern Anal Mach Intell ; 45(2): 1545-1562, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-35380955

RESUMO

Meta-learning methods are shown to be effective in quickly adapting a model to novel tasks. Most existing meta-learning methods represent data and carry out fast adaptation in euclidean space. In fact, data of real-world applications usually resides in complex and various Riemannian manifolds. In this paper, we propose a curvature-adaptive meta-learning method that achieves fast adaptation to manifold data by producing suitable curvature. Specifically, we represent data in the product manifold of multiple constant curvature spaces and build a product manifold neural network as the base-learner. In this way, our method is capable of encoding complex manifold data into discriminative and generic representations. Then, we introduce curvature generation and curvature updating schemes, through which suitable product manifolds for various forms of data manifolds are constructed via few optimization steps. The curvature generation scheme identifies task-specific curvature initialization, leading to a shorter optimization trajectory. The curvature updating scheme automatically produces appropriate learning rate and search direction for curvature, making a faster and more adaptive optimization paradigm compared to hand-designed optimization schemes. We evaluate our method on a broad set of problems including few-shot classification, few-shot regression, and reinforcement learning tasks. Experimental results show that our method achieves substantial improvements as compared to meta-learning methods ignoring the geometry of the underlying space.

2.
IEEE Trans Pattern Anal Mach Intell ; 45(5): 5935-5952, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-36260581

RESUMO

Many learning tasks are modeled as optimization problems with nonlinear constraints, such as principal component analysis and fitting a Gaussian mixture model. A popular way to solve such problems is resorting to Riemannian optimization algorithms, which yet heavily rely on both human involvement and expert knowledge about Riemannian manifolds. In this paper, we propose a Riemannian meta-optimization method to automatically learn a Riemannian optimizer. We parameterize the Riemannian optimizer by a novel recurrent network and utilize Riemannian operations to ensure that our method is faithful to the geometry of manifolds. The proposed method explores the distribution of the underlying data by minimizing the objective of updated parameters, and hence is capable of learning task-specific optimizations. We introduce a Riemannian implicit differentiation training scheme to achieve efficient training in terms of numerical stability and computational cost. Unlike conventional meta-optimization training schemes that need to differentiate through the whole optimization trajectory, our training scheme is only related to the final two optimization steps. In this way, our training scheme avoids the exploding gradient problem, and significantly reduces the computational load and memory footprint. We discuss experimental results across various constrained problems, including principal component analysis on Grassmann manifolds, face recognition, person re-identification, and texture image classification on Stiefel manifolds, clustering and similarity learning on symmetric positive definite manifolds, and few-shot learning on hyperbolic manifolds.

3.
Artigo em Inglês | MEDLINE | ID: mdl-37015388

RESUMO

Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.

4.
IEEE Trans Pattern Anal Mach Intell ; 44(8): 4454-4468, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-33656990

RESUMO

It is quite laborious and costly to manually label LiDAR point cloud data for training high-quality 3D object detectors. This work proposes a weakly supervised framework which allows learning 3D detection from a few weakly annotated examples. This is achieved by a two-stage architecture design. Stage-1 learns to generate cylindrical object proposals under inaccurate and inexact supervision, obtained by our proposed BEV center-click annotation strategy, where only the horizontal object centers are click-annotated in bird's view scenes. Stage-2 learns to predict cuboids and confidence scores in a coarse-to-fine, cascade manner, under incomplete supervision, i.e., only a small portion of object cuboids are precisely annotated. With KITTI dataset, using only 500 weakly annotated scenes and 534 precisely labeled vehicle instances, our method achieves 86-97 percent the performance of current top-leading, fully supervised detectors (which require 3,712 exhaustively annotated scenes with 15,654 instances). More importantly, with our elaborately designed network architecture, our trained model can be applied as a 3D object annotator, supporting both automatic and active (human-in-the-loop) working modes. The annotations generated by our model can be used to train 3D object detectors, achieving over 95 percent of their original performance (with manually labeled training data). Our experiments also show our model's potential in boosting performance when given more training data. The above designs make our approach highly practical and open-up opportunities for learning 3D detection at reduced annotation cost.


Assuntos
Algoritmos , Aprendizagem , Humanos
5.
Artigo em Inglês | MEDLINE | ID: mdl-32310773

RESUMO

Many existing methods formulate the action prediction task as recognizing early parts of actions in trimmed videos. In this paper, we focus on predicting actions from ongoing untrimmed videos where actions might not happen at the very beginning of videos. It is extremely challenging to predict actions in such untrimmed videos due to ambiguous or even no information of actions in the early parts of videos. To address this problem, we propose a prediction confidence that assesses the decision quality of a prediction model. Guided by the confidence, the model continuously refines the prediction results by itself with the increasing observed video frames. Specifically, we build a Self Prediction Refining Network (SPR-Net) which incrementally learns the confidence for action prediction. SPR-Net consists of three modules: a temporal hybrid network, an incremental confidence learner, and a self-refining Gumbel softmax sampler. The temporal hybrid network generates the action category distributions by integrating static scene and dynamic motion information. The incremental confidence learner calculates the confidence in an incremental manner, judging the extent to which the temporal hybrid network should believe its prediction result. The self-refining Gumbel softmax sampler models the mutual relationship between the prediction confidence and the category distribution, which enables them to be jointly learned in an end-to-end fashion. We also present a sparse self-attention mechanism to encode local spatio-temporal features into the frame-level motion representation to further improve the prediction performance. Extensive experiments on five datasets (i.e., UT-Interaction, BIT-Interaction, UCF101, THUMOS14, and ActivityNet) validate the effectiveness of the proposed method.

6.
IEEE Trans Neural Netw Learn Syst ; 31(9): 3230-3244, 2020 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-31567102

RESUMO

The symmetric positive definite (SPD) matrices, forming a Riemannian manifold, are commonly used as visual representations. The non-Euclidean geometry of the manifold often makes developing learning algorithms (e.g., classifiers) difficult and complicated. The concept of similarity-based learning has been shown to be effective to address various problems on SPD manifolds. This is mainly because the similarity-based algorithms are agnostic to the geometry and purely work based on the notion of similarities/distances. However, existing similarity-based models on SPD manifolds opt for holistic representations, ignoring characteristics of information captured by SPD matrices. To circumvent this limitation, we propose a novel SPD distance measure for the similarity-based algorithm. Specifically, we introduce the concept of point-to-set transformation, which enables us to learn multiple lower dimensional and discriminative SPD manifolds from a higher dimensional one. For lower dimensional SPD manifolds obtained by the point-to-set transformation, we propose a tailored set-to-set distance measure by making use of the family of alpha-beta divergences. We further propose to learn the point-to-set transformation and the set-to-set distance measure jointly, yielding a powerful similarity-based algorithm on SPD manifolds. Our thorough evaluations on several visual recognition tasks (e.g., action classification and face recognition) suggest that our algorithm comfortably outperforms various state-of-the-art algorithms.

7.
Artigo em Inglês | MEDLINE | ID: mdl-29993980

RESUMO

To enhance the resolution and accuracy of depth data, some video-based depth super-resolution methods have been proposed which utilizes its neighboring depth images in the temporal domain. They often consist of two main stages: motion compensation of temporally neighboring depth images and fusion of compensated depth images. However, large displacement 3D motion often leads to compensation error, and the compensation error is further introduced into the fusion. A video-based depth super-resolution method with novel motion compensation and fusion approaches is proposed in this paper. We claim that, 3D Nearest Neighboring Field (NNF) is a better choice than using positions with true motion displacement for depth enhancements. To handle large displacement 3D motion, the compensation stage utilized 3D NNF instead of true motion used in previous methods. Next, the fusion approach is modeled as a regression problem to predict the super-resolution result efficiently for each depth image by using its compensated depth images. A new deep convolutional neural network architecture is designed for fusion, which is able to employ a large amount of video data for learning the complicated regression function. We comprehensively evaluate our method on various RGB-D video sequences to show its superior performance.

8.
IEEE Trans Pattern Anal Mach Intell ; 38(11): 2241-2254, 2016 11.
Artigo em Inglês | MEDLINE | ID: mdl-26731638

RESUMO

The Iterative Closest Point (ICP) algorithm is one of the most widely used methods for point-set registration. However, being based on local iterative optimization, ICP is known to be susceptible to local minima. Its performance critically relies on the quality of the initialization and only local optimality is guaranteed. This paper presents the first globally optimal algorithm, named Go-ICP, for Euclidean (rigid) registration of two 3D point-sets under the L2 error metric defined in ICP. The Go-ICP method is based on a branch-and-bound scheme that searches the entire 3D motion space SE(3). By exploiting the special structure of SE(3) geometry, we derive novel upper and lower bounds for the registration error function. Local ICP is integrated into the BnB scheme, which speeds up the new method while guaranteeing global optimality. We also discuss extensions, addressing the issue of outlier robustness. The evaluation demonstrates that the proposed method is able to produce reliable registration results regardless of the initialization. Go-ICP can be applied in scenarios where an optimal solution is desirable or where a good initialization is not always available.

9.
IEEE Trans Cybern ; 46(11): 2596-2608, 2016 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26485728

RESUMO

In this paper, we develop a novel transfer latent support vector machine for joint recognition and localization of actions by using Web images and weakly annotated training videos. The model takes training videos which are only annotated with action labels as input for alleviating the laborious and time-consuming manual annotations of action locations. Since the ground-truth of action locations in videos are not available, the locations are modeled as latent variables in our method and are inferred during both training and testing phrases. For the purpose of improving the localization accuracy with some prior information of action locations, we collect a number of Web images which are annotated with both action labels and action locations to learn a discriminative model by enforcing the local similarities between videos and Web images. A structural transformation based on randomized clustering forest is used to map the Web images to videos for handling the heterogeneous features of Web images and videos. Experiments on two public action datasets demonstrate the effectiveness of the proposed model for both action localization and action recognition.

10.
IEEE Trans Image Process ; 24(11): 3729-41, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26151938

RESUMO

The symmetric positive-definite (SPD) matrix, as a connected Riemannian manifold, has become increasingly popular for encoding image information. Most existing sparse models are still primarily developed in the Euclidean space. They do not consider the non-linear geometrical structure of the data space, and thus are not directly applicable to the Riemannian manifold. In this paper, we propose a novel sparse representation method of SPD matrices in the data-dependent manifold kernel space. The graph Laplacian is incorporated into the kernel space to better reflect the underlying geometry of SPD matrices. Under the proposed framework, we design two different positive definite kernel functions that can be readily transformed to the corresponding manifold kernels. The sparse representation obtained has more discriminating power. Extensive experimental results demonstrate good performance of manifold kernel sparse codes in image classification, face recognition, and visual tracking.


Assuntos
Algoritmos , Processamento de Imagem Assistida por Computador/métodos , Identificação Biométrica , Bases de Dados Factuais , Humanos , Aprendizado de Máquina
11.
IEEE Trans Image Process ; 24(11): 4096-108, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26080383

RESUMO

In cross-view action recognition, what you saw in one view is different from what you recognize in another view, since the data distribution even the feature space can change from one view to another. In this paper, we address the problem of transferring action models learned in one view (source view) to another different view (target view), where action instances from these two views are represented by heterogeneous features. A novel learning method, called heterogeneous transfer discriminant-analysis of canonical correlations (HTDCC), is proposed to discover a discriminative common feature space for linking source view and target view to transfer knowledge between them. Two projection matrices are learned to, respectively, map data from the source view and the target view into a common feature space via simultaneously minimizing the canonical correlations of interclass training data, maximizing the canonical correlations of intraclass training data, and reducing the data distribution mismatch between the source and target views in the common feature space. In our method, the source view and the target view neither share any common features nor have any corresponding action instances. Moreover, our HTDCC method is capable of handling only a few or even no labeled samples available in the target view, and can also be easily extended to the situation of multiple source views. We additionally propose a weighting learning framework for multiple source views adaptation to effectively leverage action knowledge learned from multiple source views for the recognition task in the target view. Under this framework, different source views are assigned different weights according to their different relevances to the target view. Each weight represents how contributive the corresponding source view is to the target view. Extensive experiments on the IXMAS data set demonstrate the effectiveness of HTDCC on learning the common feature space for heterogeneous cross-view action recognition. In addition, the weighting learning framework can achieve promising results on automatically adapting multiple transferred source-view knowledge to the target view.

12.
IEEE Trans Image Process ; 24(5): 1510-23, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25706637

RESUMO

The appearance of an object could be continuously changing during tracking, thereby being not independent identically distributed. A good discriminative tracker often needs a large number of training samples to fit the underlying data distribution, which is impractical for visual tracking. In this paper, we present a new discriminative tracker via landmark-based label propagation (LLP) that is nonparametric and makes no specific assumption about the sample distribution. With an undirected graph representation of samples, the LLP locally approximates the soft label of each sample by a linear combination of labels on its nearby landmarks. It is able to effectively propagate a limited amount of initial labels to a large amount of unlabeled samples. To this end, we introduce a local landmarks approximation method to compute the cross-similarity matrix between the whole data and landmarks. Moreover, a soft label prediction function incorporating the graph Laplacian regularizer is used to diffuse the known labels to all the unlabeled vertices in the graph, which explicitly considers the local geometrical structure of all samples. Tracking is then carried out within a Bayesian inference framework, where the soft label prediction value is used to construct the observation model. Both qualitative and quantitative evaluations on the benchmark data set containing 51 challenging image sequences demonstrate that the proposed algorithm outperforms the state-of-the-art methods.

13.
IEEE Trans Cybern ; 45(8): 1549-60, 2015 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-25248209

RESUMO

In this paper, we present a novel patch-based match and fusion algorithm by taking account of moving scene in a multiple exposure image sequence using optimization. A uniform iterative approach is developed to match and find the corresponding patches in different exposure images, which are then fused in each iteration. Our approach does not need to align the input multiple exposure images before the fusion process. Considering that the pixel values are affected by various exposure time, we design a new patch-based energy function that will be optimized to improve the matching accuracy. An efficient patch-based exposure fusion approach using the random walker algorithm is developed to preserve the moving objects from the input multiple exposure images. To the best of our knowledge, our algorithm is the first patch-based exposure fusion work to preserve the moving objects of dynamic scenes that does not need the registration process of different exposure images. Experimental results of moving scenes demonstrate that our algorithm achieves visually pleasing fusion results without ghosting artifacts, while the results produced by the state-of-the-art exposure fusion and tone mapping algorithms exhibit different levels of ghosting artifacts.

14.
IEEE Trans Pattern Anal Mach Intell ; 36(9): 1775-88, 2014 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-26352231

RESUMO

This paper addresses the problem of recognizing human interactions from videos. We propose a novel approach that recognizes human interactions by the learned high-level descriptions, interactive phrases. Interactive phrases describe motion relationships between interacting people. These phrases naturally exploit human knowledge and allow us to construct a more descriptive model for recognizing human interactions. We propose a discriminative model to encode interactive phrases based on the latent SVM formulation. Interactive phrases are treated as latent variables and are used as mid-level features. To complement manually specified interactive phrases, we also discover data-driven phrases from data in order to find potentially useful and discriminative phrases for differentiating human interactions. An information-theoretic approach is employed to learn the data-driven phrases. The interdependencies between interactive phrases are explicitly captured in the model to deal with motion ambiguity and partial occlusion in the interactions. We evaluate our method on the BIT-Interaction data set, UT-Interaction data set, and Collective Activity data set. Experimental results show that our approach achieves superior performance over previous approaches.


Assuntos
Relações Interpessoais , Reconhecimento Automatizado de Padrão/métodos , Fotografação/métodos , Comportamento Social , Percepção Social , Imagem Corporal Total/métodos , Algoritmos , Humanos , Interpretação de Imagem Assistida por Computador/métodos , Reprodutibilidade dos Testes , Semântica , Sensibilidade e Especificidade , Gravação em Vídeo/métodos
15.
IEEE Trans Cybern ; 43(2): 425-36, 2013 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-22907970

RESUMO

In this paper, we present a novel high-quality intrinsic image recovery approach using optimization and user scribbles. Our approach is based on the assumption of color characteristics in a local window in natural images. Our method adopts a premise that neighboring pixels in a local window having similar intensity values should have similar reflectance values. Thus, the intrinsic image decomposition is formulated by minimizing an energy function with the addition of a weighting constraint to the local image properties. In order to improve the intrinsic image decomposition results, we further specify local constraint cues by integrating the user strokes in our energy formulation, including constant-reflectance, constant-illumination, and fixed-illumination brushes. Our experimental results demonstrate that the proposed approach achieves a better recovery result of intrinsic reflectance and illumination components than the previous approaches.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...