Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
Add more filters










Publication year range
1.
Article in English | MEDLINE | ID: mdl-38381637

ABSTRACT

Salient object ranking (SOR) aims to segment salient objects in an image and simultaneously predict their saliency rankings, according to the shifted human attention over different objects. The existing SOR approaches mainly focus on object-based attention, e.g., the semantic and appearance of object. However, we find that the scene context plays a vital role in SOR, in which the saliency ranking of the same object varies a lot at different scenes. In this paper, we thus make the first attempt towards explicitly learning scene context for SOR. Specifically, we establish a large-scale SOR dataset of 24,373 images with rich context annotations, i.e., scene graphs, segmentation, and saliency rankings. Inspired by the data analysis on our dataset, we propose a novel graph hypernetwork, named HyperSOR, for context-aware SOR. In HyperSOR, an initial graph module is developed to segment objects and construct an initial graph by considering both geometry and semantic information. Then, a scene graph generation module with multi-path graph attention mechanism is designed to learn semantic relationships among objects based on the initial graph. Finally, a saliency ranking prediction module dynamically adopts the learned scene context through a novel graph hypernetwork, for inferring the saliency rankings. Experimental results show that our HyperSOR can significantly improve the performance of SOR.

2.
Article in English | MEDLINE | ID: mdl-37756168

ABSTRACT

Classical light field rendering for novel view synthesis can accurately reproduce view-dependent effects such as reflection, refraction, and translucency, but requires a dense view sampling of the scene. Methods based on geometric reconstruction need only sparse views, but cannot accurately model non-Lambertian effects. We introduce a model that combines the strengths and mitigates the limitations of these two directions. By operating on a four-dimensional representation of the light field, our model learns to represent view-dependent effects accurately. By enforcing geometric constraints during training and inference, the scene geometry is implicitly learned from a sparse set of views. Concretely, we introduce a two-stage transformer-based model that first aggregates features along epipolar lines, then aggregates features along reference views to produce the color of a target ray. Additionally, we propose modifications that allow the model to generalize to scenes without any fine-tuning. Our model outperforms the state-of-the-art on multiple forward-facing and 360 ° datasets, with larger margins on scenes with severe view-dependent variations. Code and results can be found at https://light-field-neural-rendering.github.io/.

3.
IEEE Trans Pattern Anal Mach Intell ; 42(12): 3136-3152, 2020 12.
Article in English | MEDLINE | ID: mdl-31199251

ABSTRACT

Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.

4.
Article in English | MEDLINE | ID: mdl-30969924

ABSTRACT

The ability to quickly recognize and learn new visual concepts from limited samples enables humans to quickly adapt to new tasks and environments. This ability is enabled by semantic association of novel concepts with those that have already been learned and stored in memory. Computers can start to ascertain similar abilities by utilizing a semantic concept space. A concept space is a high-dimensional semantic space in which similar abstract concepts appear close and dissimilar ones far apart. In this paper, we propose a novel approach to one-shot learning that builds on this core idea. Our approach learns to map a novel sample instance to a concept, relates that concept to the existing ones in the concept space and, using these relationships, generates new instances, by interpolating among the concepts, to help learning. Instead of synthesizing new image instance, we propose to directly synthesize instance features by leveraging semantics using a novel auto-encoder network we call dual TriNet. The encoder part of the TriNet learns to map multi-layer visual features from CNN to a semantic vector. In semantic space, we search for related concepts, which are then projected back into the image feature spaces by the decoder portion of the TriNet. Two strategies in the semantic space are explored. Notably, this seemingly simple strategy results in complex augmented feature distributions in the image feature space, leading to substantially better performance. The codes and models are released in the github: https://github.com/tankche1/ Semantic-Feature-Augmentation-in-Few-shot-Learning.

5.
IEEE Trans Pattern Anal Mach Intell ; 37(9): 1764-76, 2015 Sep.
Article in English | MEDLINE | ID: mdl-26353125

ABSTRACT

The goal of cross-domain matching (CDM) is to find correspondences between two sets of objects in different domains in an unsupervised way. CDM has various interesting applications, including photo album summarization where photos are automatically aligned into a designed frame expressed in the Cartesian coordinate system, and temporal alignment which aligns sequences such as videos that are potentially expressed using different features. In this paper, we propose an information-theoretic CDM framework based on squared-loss mutual information (SMI). The proposed approach can directly handle non-linearly related objects/sequences with different dimensions, with the ability that hyper-parameters can be objectively optimized by cross-validation. We apply the proposed method to several real-world problems including image matching, unpaired voice conversion, photo album summarization, cross-feature video and cross-domain video-to-mocap alignment, and Kinect-based action recognition, and experimentally demonstrate that the proposed method is a promising alternative to state-of-the-art CDM methods.

6.
Neural Comput ; 26(1): 185-207, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24102126

ABSTRACT

The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a feature-wise kernelized Lasso for capturing nonlinear input-output dependency. We first show that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures such as the Hilbert-Schmidt independence criterion. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.


Subject(s)
Algorithms , Artificial Intelligence , Nonlinear Dynamics , Pattern Recognition, Automated/methods , Animals , Oligonucleotide Array Sequence Analysis , Rats
7.
IEEE Trans Pattern Anal Mach Intell ; 36(2): 235-47, 2014 Feb.
Article in English | MEDLINE | ID: mdl-24356346

ABSTRACT

Discriminative, or (structured) prediction, methods have proved effective for variety of problems in computer vision; a notable example is 3D monocular pose estimation. All methods to date, however, relied on an assumption that training (source) and test (target) data come from the same underlying joint distribution. In many real cases, including standard data sets, this assumption is flawed. In the presence of training set bias, the learning results in a biased model whose performance degrades on the (target) test set. Under the assumption of covariate shift, we propose an unsupervised domain adaptation approach to address this problem. The approach takes the form of training instance reweighting, where the weights are assigned based on the ratio of training and test marginals evaluated at the samples. Learning with the resulting weighted training samples alleviates the bias in the learned models. We show the efficacy of our approach by proposing weighted variants of kernel regression (KR) and twin Gaussian processes (TGP). We show that our weighted variants outperform their unweighted counterparts and improve on the state-of-the-art performance in the public (HumanEva) data set.


Subject(s)
Algorithms , Artificial Intelligence , Image Interpretation, Computer-Assisted/methods , Imaging, Three-Dimensional/methods , Pattern Recognition, Automated/methods , Computer Simulation , Discriminant Analysis , Image Enhancement/methods , Models, Statistical , Reproducibility of Results , Sensitivity and Specificity
8.
IEEE Trans Pattern Anal Mach Intell ; 35(1): 52-65, 2013 Jan.
Article in English | MEDLINE | ID: mdl-22392709

ABSTRACT

We propose a simulation-based dynamical motion prior for tracking human motion from video in presence of physical ground-person interactions. Most tracking approaches to date have focused on efficient inference algorithms and/or learning of prior kinematic motion models; however, few can explicitly account for the physical plausibility of recovered motion. Here, we aim to recover physically plausible motion of a single articulated human subject. Toward this end, we propose a full-body 3D physical simulation-based prior that explicitly incorporates a model of human dynamics into the Bayesian filtering framework. We consider the motion of the subject to be generated by a feedback "control loop" in which Newtonian physics approximates the rigid-body motion dynamics of the human and the environment through the application and integration of interaction forces, motor forces, and gravity. Interaction forces prevent physically impossible hypotheses, enable more appropriate reactions to the environment (e.g., ground contacts), and are produced from detected human-environment collisions. Motor forces actuate the body, ensure that proposed pose transitions are physically feasible, and are generated using a motion controller. For efficient inference in the resulting high-dimensional state space, we utilize an exemplar-based control strategy that reduces the effective search space of motor forces. As a result, we are able to recover physically plausible motion of human subjects from monocular and multiview video. We show, both quantitatively and qualitatively, that our approach performs favorably with respect to Bayesian filtering methods with standard motion priors.


Subject(s)
Artificial Intelligence , Image Interpretation, Computer-Assisted/methods , Joints/physiology , Models, Biological , Movement/physiology , Pattern Recognition, Automated/methods , Whole Body Imaging/methods , Computer Simulation , Humans , Joints/anatomy & histology
9.
IEEE Trans Pattern Anal Mach Intell ; 34(4): 778-90, 2012 Apr.
Article in English | MEDLINE | ID: mdl-21808087

ABSTRACT

Latent variable models, such as the GPLVM and related methods, help mitigate overfitting when learning from small or moderately sized training sets. Nevertheless, existing methods suffer from several problems: 1) complexity, 2) the lack of explicit mappings to and from the latent space, 3) an inability to cope with multimodality, and 4) the lack of a well-defined density over the latent space. We propose an LVM called the Kernel Information Embedding (KIE) that defines a coherent joint density over the input and a learned latent space. Learning is quadratic, and it works well on small data sets. We also introduce a generalization, the shared KIE (sKIE), that allows us to model multiple input spaces (e.g., image features and poses) using a single, shared latent representation. KIE and sKIE permit missing data during inference and partially labeled data during learning. We show that with data sets too large to learn a coherent global model, one can use the sKIE to learn local online models. We use sKIE for human pose inference.


Subject(s)
Models, Theoretical , Humans , Pattern Recognition, Automated/methods
10.
IEEE Trans Pattern Anal Mach Intell ; 26(7): 862-77, 2004 Jul.
Article in English | MEDLINE | ID: mdl-18579945

ABSTRACT

A novel approach for real-time skin segmentation in video sequences is described. The approach enables reliable skin segmentation despite wide variation in illumination during tracking. An explicit second order Markov model is used to predict evolution of the skin-color (HSV) histogram over time. Histograms are dynamically updated based on feedback from the current segmentation and predictions of the Markov model. The evolution of the skin-color distribution at each frame is parameterized by translation, scaling, and rotation in color space. Consequent changes in geometric parameterization of the distribution are propagated by warping and resampling the histogram. The parameters of the discrete-time dynamic Markov model are estimated using Maximum Likelihood Estimation and also evolve over time. The accuracy of the new dynamic skin color segmentation algorithm is compared to that obtained via a static color model. Segmentation accuracy is evaluated using labeled ground-truth video sequences taken from staged experiments and popular movies. An overall increase in segmentation accuracy of up to 24 percent is observed in 17 out of 21 test sequences. In all but one case, the skin-color classification rates for our system were higher, with background classification rates comparable to those of the static segmentation.


Subject(s)
Color , Colorimetry/methods , Image Interpretation, Computer-Assisted/methods , Lighting/methods , Pattern Recognition, Automated/methods , Skin Physiological Phenomena , Video Recording/methods , Algorithms , Artificial Intelligence , Humans , Image Enhancement/methods
SELECTION OF CITATIONS
SEARCH DETAIL
...