Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
1.
IEEE Trans Pattern Anal Mach Intell ; 46(2): 823-836, 2024 Feb.
Article in English | MEDLINE | ID: mdl-37874700

ABSTRACT

In this work, we first propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step before fusing any overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context, establishing a many-to-many splatting scheme with robustness to undesirable artifacts. For each input frame pair, M2M has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. However, directly warping and fusing pixels in the intensity domain is sensitive to the quality of motion estimation and may suffer from less effective representation capacity. To improve interpolation accuracy, we further extend an M2M++ framework by introducing a flexible Spatial Selective Refinement (SSR) component, which allows for trading computational efficiency for interpolation quality and vice versa. Instead of refining the entire interpolated frame, SSR only processes difficult regions selected under the guidance of an estimated error map, thereby avoiding redundant computation. Evaluation on multiple benchmark datasets shows that our method is able to improve the efficiency while maintaining competitive video interpolation quality, and it can be adjusted to use more or less compute as needed.

2.
IEEE Trans Image Process ; 31: 6320-6330, 2022.
Article in English | MEDLINE | ID: mdl-36194704

ABSTRACT

Label-efficient scene segmentation aims to achieve effective per-pixel classification with reduced labeling effort. Recent approaches for this task focus on leveraging unlabelled images by formulating consistency regularization or pseudo labels for individual pixels. Yet most of these methods ignore the 3D geometric structures naturally conveyed by image scenes, which is free for enhancing training segmentation models with better discrimination of image details. In this work, we present a novel Geometric Structure Refinement (GSR) framework to explicitly exploit the geometric structures of image scenes to enhance the semi-supervised training of segmentation models. In the training phase, we generate initial dense pseudo labels based on fast and coarse annotations, and then utilize the free unsupervised 3D reconstruction of the image scene to calibrate the dense pseudo labels with more reliable details. With the calibrated pseudo groundtruth, we are able to conveniently train any existing image segmentation models without increasing the costs of annotations or modifying the models' architectures. Moreover, we explore different strategies for allocating labeling effort in semi-supervised scene segmentation, and find that a combination of finely-labeled samples and coarsely-labeled samples performs better than the traditional dense-fine only annotations. Extensive experiments on datasets including Cityscapes and KITTI are conducted to evaluate our proposed methods. The results demonstrate that GSR can be easily applied to boost the performance of existing models like PSPNet, DeepLabv3+, etc with reduced annotations. With half of the annotation effort, GSR achieves 99% of the accuracy of its fully supervised state-of-the-art counterparts.

3.
IEEE Trans Pattern Anal Mach Intell ; 44(4): 2155-2167, 2022 04.
Article in English | MEDLINE | ID: mdl-33021939

ABSTRACT

Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image. In this paper we address a more realistic version of the natural language grounding task where we must both identify whether the phrase is relevant to an image and localize the phrase. This can also be viewed as a generalization of object detection to an open-ended vocabulary, introducing elements of few- and zero-shot detection. We propose an approach for this task that extends Faster R-CNN to relate image regions and phrases. By carefully initializing the classification layers of our network using canonical correlation analysis (CCA), we encourage a solution that is more discerning when reasoning between similar phrases, resulting in over double the performance compared to a naive adaptation on three popular phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, with test-time phrase vocabulary sizes of 5K, 32K, and 159K, respectively.


Subject(s)
Algorithms , Language , Natural Language Processing , Vocabulary
4.
IEEE Trans Pattern Anal Mach Intell ; 43(11): 4196-4202, 2021 Nov.
Article in English | MEDLINE | ID: mdl-33493111

ABSTRACT

In state-of-the-art deep single-label classification models, the top- k (k=2,3,4, ...) accuracy is usually significantly higher than the top-1 accuracy. This is more evident in fine-grained datasets, where differences between classes are quite subtle. Exploiting the information provided in the top k predicted classes boosts the final prediction of a model. We propose Guided Zoom, a novel way in which explainability could be used to improve model performance. We do so by making sure the model has "the right reasons" for a prediction. The reason/evidence upon which a deep neural network makes a prediction is defined to be the grounding, in the pixel space, for a specific class conditional probability in the model output. Guided Zoom examines how reasonable the evidence used to make each of the top- k predictions is. Test time evidence is deemed reasonable if it is coherent with evidence used to make similar correct decisions at training time. This leads to better informed predictions. We explore a variety of grounding techniques and study their complementarity for computing evidence. We show that Guided Zoom results in an improvement of a model's classification accuracy and achieves state-of-the-art classification performance on four fine-grained classification datasets. Our code is available at https://github.com/andreazuna89/Guided-Zoom.

5.
IEEE Trans Pattern Anal Mach Intell ; 41(10): 2424-2437, 2019 Oct.
Article in English | MEDLINE | ID: mdl-31059428

ABSTRACT

Binary vector embeddings enable fast nearest neighbor retrieval in large databases of high-dimensional objects, and play an important role in many practical applications, such as image and video retrieval. We study the problem of learning binary vector embeddings under a supervised setting, also known as hashing. We propose a novel supervised hashing method based on optimizing an information-theoretic quantity, mutual information. We show that optimizing mutual information can reduce ambiguity in the induced neighborhood structure in the learned Hamming space, which is essential in obtaining high retrieval performance. To this end, we optimize mutual information in deep neural networks with minibatch stochastic gradient descent, with a formulation that maximally and efficiently utilizes available supervision. Experiments on four image retrieval benchmarks, including ImageNet, confirm the effectiveness of our method in learning high-quality binary embeddings for nearest neighbor retrieval.

6.
IEEE Trans Pattern Anal Mach Intell ; 38(5): 889-902, 2016 May.
Article in English | MEDLINE | ID: mdl-26336114

ABSTRACT

We demonstrate the usefulness of surroundedness for eye fixation prediction by proposing a Boolean Map based Saliency model (BMS). In our formulation, an image is characterized by a set of binary images, which are generated by randomly thresholding the image's feature maps in a whitened feature space. Based on a Gestalt principle of figure-ground segregation, BMS computes a saliency map by discovering surrounded regions via topological analysis of Boolean maps. Furthermore, we draw a connection between BMS and the Minimum Barrier Distance to provide insight into why and how BMS can properly captures the surroundedness cue via Boolean maps. The strength of BMS is verified by its simplicity, efficiency and superior performance compared with 10 state-of-the-art methods on seven eye tracking benchmark datasets.

7.
IEEE Trans Pattern Anal Mach Intell ; 37(12): 2558-72, 2015 Dec.
Article in English | MEDLINE | ID: mdl-26539858

ABSTRACT

We propose a novel linearly augmented tree method for efficient scale and rotation invariant object matching. The proposed method enforces pairwise matching consistency defined on trees, and high-order constraints on all the sites of a template. The pairwise constraints admit arbitrary metrics while the high-order constraints use L1 norms and therefore can be linearized. Such a linearly augmented tree formulation introduces hyperedges and loops into the basic tree structure. But, different from a general loopy graph, its special structure allows us to relax and decompose the optimization into a sequence of tree matching problems that are efficiently solvable by dynamic programming. The proposed method also works on continuous scale and rotation parameters; we can match with a scale up to any large value with the same efficiency. Our experiments on ground truth data and a variety of real images and videos show that the proposed method is efficient, accurate and reliable.

8.
IEEE Trans Pattern Anal Mach Intell ; 34(4): 654-69, 2012 Apr.
Article in English | MEDLINE | ID: mdl-21808088

ABSTRACT

The goal of this work is to learn a parsimonious and informative representation for high-dimensional time series. Conceptually, this comprises two distinct yet tightly coupled tasks: learning a low-dimensional manifold and modeling the dynamical process. These two tasks have a complementary relationship as the temporal constraints provide valuable neighborhood information for dimensionality reduction and, conversely, the low-dimensional space allows dynamics to be learned efficiently. Solving these two tasks simultaneously allows important information to be exchanged mutually. If nonlinear models are required to capture the rich complexity of time series, then the learning problem becomes harder as the nonlinearities in both tasks are coupled. A divide, conquer, and coordinate method is proposed. The solution approximates the nonlinear manifold and dynamics using simple piecewise linear models. The interactions and coordinations among the linear models are captured in a graphical model. The model structure setup and parameter learning are done using a variational Bayesian approach, which enables automatic Bayesian model structure selection, hence solving the problem of overfitting. By exploiting the model structure, efficient inference and learning algorithms are obtained without oversimplifying the model of the underlying dynamical process. Evaluation of the proposed framework with competing approaches is conducted in three sets of experiments: dimensionality reduction and reconstruction using synthetic time series, video synthesis using a dynamic texture database, and human motion synthesis, classification, and tracking on a benchmark data set. In all experiments, the proposed approach provides superior performance.


Subject(s)
Algorithms , Linear Models , Movement/physiology , Bayes Theorem , Computer Simulation , Gait/physiology , Humans , Nonlinear Dynamics , Pattern Recognition, Automated/methods
9.
IEEE Trans Pattern Anal Mach Intell ; 33(9): 1758-75, 2011 Sep.
Article in English | MEDLINE | ID: mdl-21383394

ABSTRACT

We propose a representation for scenes containing relocatable objects that can cause partial occlusions of people in a camera's field of view. In many practical applications, relocatable objects tend to appear often; therefore, models for them can be learned offline and stored in a database. We formulate an occluder-centric representation, called a graphical model layer, where a person's motion in the ground plane is defined as a first-order Markov process on activity zones, while image evidence is aggregated in 2D observation regions that are depth-ordered with respect to the occlusion mask of the relocatable object. We represent real-world scenes as a composition of depth-ordered, interacting graphical model layers, and account for image evidence in a way that handles mutual overlap of the observation regions and their occlusions by the relocatable objects. These layers interact: Proximate ground-plane zones of different model instances are linked to allow a person to move between the layers, and image evidence is shared between the observation regions of these models. We demonstrate our formulation in tracking pedestrians in the vicinity of parked vehicles. Our results compare favorably with a sprite-learning algorithm, with a pedestrian tracker based on deformable contours, and with pedestrian detectors.

10.
IEEE Trans Pattern Anal Mach Intell ; 33(3): 514-30, 2011 Mar.
Article in English | MEDLINE | ID: mdl-20548107

ABSTRACT

Object detection is challenging when the object class exhibits large within-class variations. In this work, we show that foreground-background classification (detection) and within-class classification of the foreground class (pose estimation) can be jointly learned in a multiplicative form of two kernel functions. Model training is accomplished via standard SVM learning. When the foreground object masks are provided in training, the detectors can also produce object segmentations. A tracking-by-detection framework to recover foreground state in video sequences is also proposed with our model. The advantages of our method are demonstrated on tasks of object detection, view angle estimation, and tracking. Our approach compares favorably to existing methods on hand and vehicle detection tasks. Quantitative tracking results are given on sequences of moving vehicles and human faces.


Subject(s)
Algorithms , Image Interpretation, Computer-Assisted/instrumentation , Numerical Analysis, Computer-Assisted/instrumentation , Pattern Recognition, Automated/methods , Phantoms, Imaging , Artificial Intelligence , Computer Simulation , Humans , Image Enhancement/methods , Imaging, Three-Dimensional/instrumentation , Learning , Markov Chains , Motion , Motor Vehicles , Reproducibility of Results , Sensitivity and Specificity , Subtraction Technique/instrumentation
11.
IEEE Trans Pattern Anal Mach Intell ; 32(3): 517-29, 2010 Mar.
Article in English | MEDLINE | ID: mdl-20075475

ABSTRACT

Identifying correspondences between trajectory segments observed from nonsynchronized cameras is important for reconstruction of the complete trajectory of moving targets in a large scene. Such a reconstruction can be obtained from motion data by comparing the trajectory segments and estimating both the spatial and temporal alignments. Exhaustive testing of all possible correspondences of trajectories over a temporal window is only viable in the cases with a limited number of moving targets and large view overlaps. Therefore, alternative solutions are required for situations with several trajectories that are only partially visible in each view. In this paper, we propose a new method that is based on view-invariant representation of trajectories, which is used to produce a sparse set of salient points for trajectory segments observed in each view. Only the neighborhoods at these salient points in the view--invariant representation are then used to estimate the spatial and temporal alignment of trajectory pairs in different views. It is demonstrated that, for planar scenes, the method is able to recover with good precision and efficiency both spatial and temporal alignments, even given relatively small overlap between views and arbitrary (unknown) temporal shifts of the cameras. The method also provides the same capabilities in the case of trajectories that are only locally planar, but exhibit some nonplanarity at a global level.

12.
IEEE Trans Pattern Anal Mach Intell ; 31(9): 1685-99, 2009 Sep.
Article in English | MEDLINE | ID: mdl-19574627

ABSTRACT

Within the context of hand gesture recognition, spatiotemporal gesture segmentation is the task of determining, in a video sequence, where the gesturing hand is located and when the gesture starts and ends. Existing gesture recognition methods typically assume either known spatial segmentation or known temporal segmentation, or both. This paper introduces a unified framework for simultaneously performing spatial segmentation, temporal segmentation, and recognition. In the proposed framework, information flows both bottom-up and top-down. A gesture can be recognized even when the hand location is highly ambiguous and when information about when the gesture begins and ends is unavailable. Thus, the method can be applied to continuous image streams where gestures are performed in front of moving, cluttered backgrounds. The proposed method consists of three novel contributions: a spatiotemporal matching algorithm that can accommodate multiple candidate hand detections in every frame, a classifier-based pruning framework that enables accurate and early rejection of poor matches to gesture models, and a subgesture reasoning algorithm that learns which gesture models can falsely match parts of other longer gestures. The performance of the approach is evaluated on two challenging applications: recognition of hand-signed digits gestured by users wearing short-sleeved shirts, in front of a cluttered background, and retrieval of occurrences of signs of interest in a video database containing continuous, unsegmented signing in American Sign Language (ASL).


Subject(s)
Algorithms , Gestures , Hand/anatomy & histology , Image Interpretation, Computer-Assisted/methods , Imaging, Three-Dimensional/methods , Pattern Recognition, Automated/methods , Sign Language , Artificial Intelligence , Humans , Image Enhancement/methods , Reproducibility of Results , Sensitivity and Specificity
13.
IEEE Trans Pattern Anal Mach Intell ; 31(7): 1264-77, 2009 Jul.
Article in English | MEDLINE | ID: mdl-19443924

ABSTRACT

Sign language spotting is the task of detecting and recognizing signs in a signed utterance, in a set vocabulary. The difficulty of sign language spotting is that instances of signs vary in both motion and appearance. Moreover, signs appear within a continuous gesture stream, interspersed with transitional movements between signs in a vocabulary and nonsign patterns (which include out-of-vocabulary signs, epentheses, and other movements that do not correspond to signs). In this paper, a novel method for designing threshold models in a conditional random field (CRF) model is proposed which performs an adaptive threshold for distinguishing between signs in a vocabulary and nonsign patterns. A short-sign detector, a hand appearance-based sign verification method, and a subsign reasoning method are included to further improve sign language spotting accuracy. Experiments demonstrate that our system can spot signs from continuous data with an 87.0 percent spotting rate and can recognize signs from isolated data with a 93.5 percent recognition rate versus 73.5 percent and 85.4 percent, respectively, for CRFs without a threshold model, short-sign detection, subsign reasoning, and hand appearance-based sign verification. Our system can also achieve a 15.0 percent sign error rate (SER) from continuous data and a 6.4 percent SER from isolated data versus 76.2 percent and 14.5 percent, respectively, for conventional CRFs.


Subject(s)
Algorithms , Artificial Intelligence , Image Interpretation, Computer-Assisted/methods , Pattern Recognition, Automated/methods , Sign Language , Subtraction Technique , Computer Simulation , Image Enhancement/methods , Models, Statistical , Reproducibility of Results , Sensitivity and Specificity
14.
IEEE Trans Pattern Anal Mach Intell ; 30(3): 477-92, 2008 Mar.
Article in English | MEDLINE | ID: mdl-18195441

ABSTRACT

This paper proposes a method for detecting object classes that exhibit variable shape structure in heavily cluttered images. The term "variable shape structure" is used to characterize object classes in which some shape parts can be repeated an arbitrary number of times, some parts can be optional, and some parts can have several alternative appearances. Hidden State Shape Models (HSSMs), a generalization of Hidden Markov Models (HMMs), are introduced to model object classes of variable shape structure using a probabilistic framework. A polynomial inference algorithm automatically determines object location, orientation, scale and structure by finding the globally optimal registration of model states with the image features, even in the presence of clutter. Experiments with real images demonstrate that the proposed method can localize objects of variable shape structure with high accuracy. For the task of hand shape localization and structure identification, the proposed method is significantly more accurate than previously proposed methods based on chamfer-distance matching. Furthermore, by integrating simple temporal constraints, the proposed method gains speed-ups of more than an order of magnitude, and produces highly accurate results in experiments on non-rigid hand motion tracking.


Subject(s)
Artificial Intelligence , Biometry/methods , Hand/anatomy & histology , Image Enhancement/methods , Image Interpretation, Computer-Assisted/methods , Models, Statistical , Pattern Recognition, Automated/methods , Algorithms , Computer Simulation , Data Interpretation, Statistical , Humans , Information Storage and Retrieval/methods , Markov Chains , Reproducibility of Results , Sensitivity and Specificity
15.
IEEE Trans Pattern Anal Mach Intell ; 30(1): 89-104, 2008 Jan.
Article in English | MEDLINE | ID: mdl-18000327

ABSTRACT

This paper describes BoostMap, a method for efficient nearest neighbor retrieval under computationally expensive distance measures. Database and query objects are embedded into a vector space, in which distances can be measured efficiently. Each embedding is treated as a classifier that predicts for any three objects X, A, B whether X is closer to A or to B. It is shown that a linear combination of such embeddingbased classifiers naturally corresponds to an embedding and a distance measure. Based on this property, the BoostMap method reduces the problem of embedding construction to the classical boosting problem of combining many weak classifiers into an optimized strong classifier. The classification accuracy of the resulting strong classifier is a direct measure of the amount of nearest neighbor structure preserved by the embedding. An important property of BoostMap is that the embedding optimization criterion is equally valid in both metric and non-metric spaces. Performance is evaluated in databases of hand images, handwritten digits, and time series. In all cases, BoostMap significantly improves retrieval efficiency with small losses in accuracy compared to brute-force search. Moreover, BoostMap significantly outperforms existing nearest neighbor retrieval methods, such as Lipschitz embeddings, FastMap, and VP-trees.


Subject(s)
Algorithms , Artificial Intelligence , Image Enhancement/methods , Image Interpretation, Computer-Assisted/methods , Imaging, Three-Dimensional/methods , Pattern Recognition, Automated/methods , Subtraction Technique , Reproducibility of Results , Sensitivity and Specificity
16.
IEEE Trans Pattern Anal Mach Intell ; 26(7): 862-77, 2004 Jul.
Article in English | MEDLINE | ID: mdl-18579945

ABSTRACT

A novel approach for real-time skin segmentation in video sequences is described. The approach enables reliable skin segmentation despite wide variation in illumination during tracking. An explicit second order Markov model is used to predict evolution of the skin-color (HSV) histogram over time. Histograms are dynamically updated based on feedback from the current segmentation and predictions of the Markov model. The evolution of the skin-color distribution at each frame is parameterized by translation, scaling, and rotation in color space. Consequent changes in geometric parameterization of the distribution are propagated by warping and resampling the histogram. The parameters of the discrete-time dynamic Markov model are estimated using Maximum Likelihood Estimation and also evolve over time. The accuracy of the new dynamic skin color segmentation algorithm is compared to that obtained via a static color model. Segmentation accuracy is evaluated using labeled ground-truth video sequences taken from staged experiments and popular movies. An overall increase in segmentation accuracy of up to 24 percent is observed in 17 out of 21 test sequences. In all but one case, the skin-color classification rates for our system were higher, with background classification rates comparable to those of the static segmentation.


Subject(s)
Color , Colorimetry/methods , Image Interpretation, Computer-Assisted/methods , Lighting/methods , Pattern Recognition, Automated/methods , Skin Physiological Phenomena , Video Recording/methods , Algorithms , Artificial Intelligence , Humans , Image Enhancement/methods
SELECTION OF CITATIONS
SEARCH DETAIL
...