Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Artigo em Inglês | MEDLINE | ID: mdl-35533174

RESUMO

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our approach generalizes to another source of web video and text data. We generate the WebVidVQA3M dataset from videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.

2.
IEEE Trans Pattern Anal Mach Intell ; 43(6): 2014-2028, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-31880540

RESUMO

Performing data augmentation for learning deep neural networks is known to be important for training visual recognition systems. By artificially increasing the number of training examples, it helps reducing overfitting and improves generalization. While simple image transformations can already improve predictive performance in most vision tasks, larger gains can be obtained by leveraging task-specific prior knowledge. In this work, we consider object detection, semantic and instance segmentation and augment the training images by blending objects in existing scenes, using instance segmentation annotations. We observe that randomly pasting objects on images hurts the performance, unless the object is placed in the right context. To resolve this issue, we propose an explicit context model by using a convolutional neural network, which predicts whether an image region is suitable for placing a given object or not. In our experiments, we show that our approach is able to improve object detection, semantic and instance segmentation on the PASCAL VOC12 and COCO datasets, with significant gains in a limited annotation scenario, i.e., when only one category is annotated. We also show that the method is not limited to datasets that come with expensive pixel-wise instance annotations and can be used when only bounding boxes are available, by employing weakly-supervised learning for instance masks approximation.

3.
IEEE Trans Pattern Anal Mach Intell ; 42(5): 1146-1161, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-30640602

RESUMO

We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark and demonstrates satisfying 3D pose results even for multi-person images.


Assuntos
Imageamento Tridimensional/métodos , Redes Neurais de Computação , Postura/fisiologia , Bases de Dados Factuais , Humanos , Gravação em Vídeo
4.
IEEE Trans Pattern Anal Mach Intell ; 40(6): 1510-1517, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-28600238

RESUMO

Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).

5.
IEEE Trans Pattern Anal Mach Intell ; 40(7): 1711-1725, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-28708543

RESUMO

Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout. Semantic flow methods are designed to handle images depicting different instances of the same object or scene category. We introduce a novel approach to semantic flow, dubbed proposal flow, that establishes reliable correspondences using object proposals. Unlike prevailing semantic flow approaches that operate on pixels or regularly sampled local regions, proposal flow benefits from the characteristics of modern object proposals, that exhibit high repeatability at multiple scales, and can take advantage of both local and geometric consistency constraints among proposals. We also show that the corresponding sparse proposal flow can effectively be transformed into a conventional dense flow field. We introduce two new challenging datasets that can be used to evaluate both general semantic flow techniques and region-based approaches such as proposal flow. We use these benchmarks to compare different matching algorithms, object proposals, and region features within proposal flow, to the state of the art in semantic flow. This comparison, along with experiments on standard datasets, demonstrates that proposal flow significantly outperforms existing semantic flow methods in various settings.

6.
IEEE Trans Pattern Anal Mach Intell ; 39(1): 189-203, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-26930676

RESUMO

Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations. We follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images. Our main contribution is a multi-fold multiple instance learning procedure, which prevents training from prematurely locking onto erroneous object locations. This procedure is particularly important when using high-dimensional representations, such as Fisher vectors and convolutional neural network features. We also propose a window refinement method, which improves the localization accuracy by incorporating an objectness prior. We present a detailed experimental evaluation using the PASCAL VOC 2007 dataset, which verifies the effectiveness of our approach.

7.
IEEE Trans Pattern Anal Mach Intell ; 39(1): 87-101, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-26955016

RESUMO

We introduce an Expanded Parts Model (EPM) for recognizing human attributes (e.g., young, short hair, wearing suits) and actions (e.g., running, jumping) in still images. An EPM is a collection of part templates which are learnt discriminatively to explain specific scale-space regions in the images (in human centric coordinates). This is in contrast to current models which consist of a relatively few (i.e., a mixture of) 'average' templates. EPM uses only a subset of the parts to score an image and scores the image sparsely in space, i.e., it ignores redundant and random background in an image. To learn our model, we propose an algorithm which automatically mines parts and learns corresponding discriminative templates together with their respective locations from a large number of candidate parts. We validate our method on three recent challenging datasets of human attributes and actions. We obtain convincing qualitative and state-of-the-art quantitative results on the three datasets.

8.
IEEE Trans Pattern Anal Mach Intell ; 38(11): 2327-2334, 2016 11.
Artigo em Inglês | MEDLINE | ID: mdl-27071159

RESUMO

Object detection is one of the most important challenges in computer vision. Object detectors are usually trained on bounding-boxes from still images. Recently, video has been used as an alternative source of data. Yet, for a given test domain (image or video), the performance of the detector depends on the domain it was trained on. In this paper, we examine the reasons behind this performance gap. We define and evaluate different domain shift factors: spatial location accuracy, appearance diversity, image quality and aspect distribution. We examine the impact of these factors by comparing performance before and after factoring them out. The results show that all four factors affect the performance of the detectors and their combined effect explains nearly the whole performance gap.

9.
IEEE Trans Pattern Anal Mach Intell ; 38(6): 1084-98, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-26441445

RESUMO

The bag-of-words (BoW) model treats images as sets of local descriptors and represents them by visual word histograms. The Fisher vector (FV) representation extends BoW, by considering the first and second order statistics of local descriptors. In both representations local descriptors are assumed to be identically and independently distributed (iid), which is a poor assumption from a modeling perspective. It has been experimentally observed that the performance of BoW and FV representations can be improved by employing discounting transformations such as power normalization. In this paper, we introduce non-iid models by treating the model parameters as latent variables which are integrated out, rendering all local regions dependent. Using the Fisher kernel principle we encode an image by the gradient of the data log-likelihood w.r.t. the model hyper-parameters. Our models naturally generate discounting effects in the representations; suggesting that such transformations have proven successful because they closely correspond to the representations obtained for non-iid models. To enable tractable computation, we rely on variational free-energy bounds to learn the hyper-parameters and to compute approximate Fisher kernels. Our experimental evaluation results validate that our models lead to performance improvements comparable to using power normalization, as employed in state-of-the-art feature aggregation methods.

10.
IEEE Trans Pattern Anal Mach Intell ; 38(7): 1425-38, 2016 07.
Artigo em Inglês | MEDLINE | ID: mdl-26452251

RESUMO

Attributes act as intermediate representations that enable parameter sharing between classes, a must when training data is scarce. We propose to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors. We introduce a function that measures the compatibility between an image and a label embedding. The parameters of this function are learned on a training set of labeled samples to ensure that, given an image, the correct classes rank higher than the incorrect ones. Results on the Animals With Attributes and Caltech-UCSD-Birds datasets show that the proposed framework outperforms the standard Direct Attribute Prediction baseline in a zero-shot learning scenario. Label embedding enjoys a built-in ability to leverage alternative sources of information instead of or in addition to attributes, such as, e.g., class hierarchies or textual descriptions. Moreover, label embedding encompasses the whole range of learning settings from zero-shot learning to regular learning with a large number of labeled examples.

11.
IEEE Trans Pattern Anal Mach Intell ; 36(3): 507-20, 2014 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24457507

RESUMO

We benchmark several SVM objective functions for large-scale image classification. We consider one-versus-rest, multiclass, ranking, and weighted approximate ranking SVMs. A comparison of online and batch methods for optimizing the objectives shows that online methods perform as well as batch methods in terms of classification accuracy, but with a significant gain in training speed. Using stochastic gradient descent, we can scale the training to millions of images and thousands of classes. Our experimental evaluation shows that ranking-based algorithms do not outperform the one-versus-rest strategy when a large number of training examples are used. Furthermore, the gap in accuracy between the different algorithms shrinks as the dimension of the features increases. We also show that learning through cross-validation the optimal rebalancing of positive and negative examples can result in a significant improvement for the one-versus-rest strategy. Finally, early stopping can be used as an effective regularization strategy when training with online algorithms. Following these "good practices," we were able to improve the state of the art on a large subset of 10K classes and 9M images of ImageNet from 16.7 percent Top-1 accuracy to 19.1 percent.

12.
IEEE Trans Pattern Anal Mach Intell ; 35(11): 2782-95, 2013 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-24051735

RESUMO

We address the problem of localizing actions, such as opening a door, in hours of challenging video data. We propose a model based on a sequence of atomic action units, termed "actoms," that are semantically meaningful and characteristic for the action. Our actom sequence model (ASM) represents an action as a sequence of histograms of actom-anchored visual features, which can be seen as a temporally structured extension of the bag-of-features. Training requires the annotation of actoms for action examples. At test time, actoms are localized automatically based on a nonparametric model of the distribution of actoms, which also acts as a prior on an action's temporal structure. We present experimental results on two recent benchmarks for action localization "Coffee and Cigarettes" and the "DLSBP" dataset. We also adapt our approach to a classification-by-localization set-up and demonstrate its applicability on the challenging "Hollywood 2" dataset. We show that our ASM method outperforms the current state of the art in temporal action localization, as well as baselines that localize actions with a sliding window method.


Assuntos
Actigrafia/métodos , Inteligência Artificial , Interpretação de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Técnica de Subtração , Gravação em Vídeo/métodos , Imagem Corporal Total/métodos , Algoritmos , Humanos
13.
IEEE Trans Pattern Anal Mach Intell ; 35(4): 835-48, 2013 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-22889819

RESUMO

We introduce an approach for learning human actions as interactions between persons and objects in realistic videos. Previous work typically represents actions with low-level features such as image gradients or optical flow. In contrast, we explicitly localize in space and track over time both the object and the person, and represent an action as the trajectory of the object w.r.t. to the person position. Our approach relies on state-of-the-art techniques for human detection, object detection, and tracking. We show that this results in human and object tracks of sufficient quality to model and localize human-object interactions in realistic videos. Our human-object interaction features capture the relative trajectory of the object w.r.t. the human. Experimental results on the Coffee and Cigarettes dataset, the video dataset of, and the Rochester Daily Activities dataset show that 1) our explicit human-object model is an informative cue for action recognition; 2) it is complementary to traditional low-level descriptors such as 3D--HOG extracted over human tracks. We show that combining our human-object interaction features with 3D-HOG improves compared to their individual performance as well as over the state of the art.


Assuntos
Atividades Humanas/classificação , Processamento de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Gravação em Vídeo/métodos , Algoritmos , Inteligência Artificial , Bases de Dados Factuais , Humanos
14.
IEEE Trans Pattern Anal Mach Intell ; 34(3): 601-14, 2012 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21808083

RESUMO

We introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: We first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human-object interaction, i.e., the spatial relation between the human and the object. We present an extensive experimental evaluation on the sports action data set from [1], the PASCAL Action 2010 data set [2], and a new human-object interaction data set.


Assuntos
Modelos Estatísticos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Inteligência Artificial , Humanos , Interpretação de Imagem Assistida por Computador/métodos
15.
IEEE Trans Pattern Anal Mach Intell ; 34(9): 1704-16, 2012 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-22156101

RESUMO

This paper addresses the problem of large-scale image search. Three constraints have to be taken into account: search accuracy, efficiency, and memory usage. We first present and evaluate different ways of aggregating local image descriptors into a vector and show that the Fisher kernel achieves better performance than the reference bag-of-visual words approach for any given vector dimension. We then jointly optimize dimensionality reduction and indexing in order to obtain a precise vector comparison as well as a compact representation. The evaluation shows that the image representation can be reduced to a few dozen bytes while preserving high accuracy. Searching a 100 million image data set takes about 250 ms on one processor core.

16.
IEEE Trans Pattern Anal Mach Intell ; 33(1): 117-28, 2011 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-21088323

RESUMO

This paper introduces a product quantization-based approach for approximate nearest neighbor search. The idea is to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately. A vector is represented by a short code composed of its subspace quantization indices. The euclidean distance between two vectors can be efficiently estimated from their codes. An asymmetric version increases precision, as it computes the approximate distance between a vector and a code. Experimental results show that our approach searches for nearest neighbors efficiently, in particular in combination with an inverted file system. Results for SIFT and GIST image descriptors show excellent search accuracy, outperforming three state-of-the-art approaches. The scalability of our approach is validated on a data set of two billion vectors.


Assuntos
Algoritmos , Interpretação de Imagem Assistida por Computador/métodos , Processamento de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Inteligência Artificial , Análise por Conglomerados , Armazenamento e Recuperação da Informação , Modelos Estatísticos
17.
IEEE Trans Pattern Anal Mach Intell ; 32(1): 2-11, 2010 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19926895

RESUMO

This paper introduces the contextual dissimilarity measure, which significantly improves the accuracy of bag-of-features-based image search. Our measure takes into account the local distribution of the vectors and iteratively estimates distance update terms in the spirit of Sinkhorn's scaling algorithm, thereby modifying the neighborhood structure. Experimental results show that our approach gives significantly better results than a standard distance and outperforms the state of the art in terms of accuracy on the Nistér-Stewénius and Lola data sets. This paper also evaluates the impact of a large number of parameters, including the number of descriptors, the clustering method, the visual vocabulary size, and the distance measure. The optimal parameter choice is shown to be quite context-dependent. In particular, using a large number of descriptors is interesting only when using our dissimilarity measure. We have also evaluated two novel variants: multiple assignment and rank aggregation. They are shown to further improve accuracy at the cost of higher memory usage and lower efficiency.

18.
IEEE Trans Image Process ; 18(7): 1512-23, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19482579

RESUMO

Color names are required in real-world applications such as image retrieval and image annotation. Traditionally, they are learned from a collection of labeled color chips. These color chips are labeled with color names within a well-defined experimental setup by human test subjects. However, naming colors in real-world images differs significantly from this experimental setting. In this paper, we investigate how color names learned from color chips compare to color names learned from real-world images. To avoid hand labeling real-world images with color names, we use Google Image to collect a data set. Due to the limitations of Google Image, this data set contains a substantial quantity of wrongly labeled data. We propose several variants of the PLSA model to learn color names from this noisy data. Experimental results show that color names learned from real-world images significantly outperform color names learned from labeled color chips for both image retrieval and image annotation.

19.
IEEE Trans Pattern Anal Mach Intell ; 29(3): 477-91, 2007 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-17224617

RESUMO

This paper presents a novel representation for dynamic scenes composed of multiple rigid objects that may undergo different motions and are observed by a moving camera. Multiview constraints associated with groups of affine-covariant scene patches and a normalized description of their appearance are used to segment a scene into its rigid components, construct three-dimensional models of these components, and match instances of models recovered from different image sequences. The proposed approach has been applied to the detection and matching of moving objects in video sequences and to shot matching, i.e., the identification of shots that depict the same scene in a video clip.


Assuntos
Algoritmos , Inteligência Artificial , Aumento da Imagem/métodos , Interpretação de Imagem Assistida por Computador/métodos , Modelos Teóricos , Reconhecimento Automatizado de Padrão/métodos , Gravação em Vídeo/métodos , Simulação por Computador , Armazenamento e Recuperação da Informação/métodos , Movimento (Física) , Movimento , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Técnica de Subtração
20.
IEEE Trans Pattern Anal Mach Intell ; 27(10): 1615-30, 2005 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-16237996

RESUMO

In this paper, we compare the performance of descriptors computed for local interest regions, as, for example, extracted by the Harris-Affine detector. Many different descriptors have been proposed in the literature. It is unclear which descriptors are more appropriate and how their performance depends on the interest region detector. The descriptors should be distinctive and at the same time robust to changes in viewing conditions as well as to errors of the detector. Our evaluation uses as criterion recall with respect to precision and is carried out for different image transformations. We compare shape context, steerable filters, PCA-SIFT, differential invariants, spin images, SIFT, complex filters, moment invariants, and cross-correlation for different types of interest regions. We also propose an extension of the SIFT descriptor and show that it outperforms the original method. Furthermore, we observe that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best. Moments and steerable filters show the best performance among the low dimensional descriptors.


Assuntos
Algoritmos , Inteligência Artificial , Interpretação Estatística de Dados , Interpretação de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Validação de Programas de Computador , Software , Simulação por Computador , Aumento da Imagem/métodos , Armazenamento e Recuperação da Informação/métodos , Modelos Estatísticos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...