Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3955-3971, 2024 May.
Article in English | MEDLINE | ID: mdl-38215322

ABSTRACT

Motion prediction is crucial for autonomous driving systems to understand complex driving scenarios and make informed decisions. However, this task is challenging due to the diverse behaviors of traffic participants and complex environmental contexts. In this paper, we propose Motion TRansformer (MTR) frameworks to address these challenges. The initial MTR framework utilizes a transformer encoder-decoder structure with learnable intention queries, enabling efficient and accurate prediction of future trajectories. By customizing intention queries for distinct motion modalities, MTR improves multimodal motion prediction while reducing reliance on dense goal candidates. The framework comprises two essential processes: global intention localization, identifying the agent's intent to enhance overall efficiency, and local movement refinement, adaptively refining predicted trajectories for improved accuracy. Moreover, we introduce an advanced MTR++ framework, extending the capability of MTR to simultaneously predict multimodal motion for multiple agents. MTR++ incorporates symmetric context modeling and mutually-guided intention querying modules to facilitate future behavior interaction among multiple agents, resulting in scene-compliant future trajectories. Extensive experimental results demonstrate that the MTR framework achieves state-of-the-art performance on the highly-competitive motion prediction benchmarks, while the MTR++ framework surpasses its precursor, exhibiting enhanced performance and efficiency in predicting accurate multimodal future trajectories for multiple agents.

2.
IEEE Trans Pattern Anal Mach Intell ; 45(1): 123-136, 2023 01.
Article in English | MEDLINE | ID: mdl-35239475

ABSTRACT

Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, and the depth map of the scene. To this aim, we propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 °camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of multiple vision 'teacher' methods and a sound 'student' method - the student method is trained to generate the same results as the teacher methods do. This way, the auditory system can be trained without using human annotations. To further boost the performance, we propose another novel auxiliary task, coined Spatial Sound Super-Resolution, to increase the directional resolution of sounds. We then formulate the four tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results show that 1) our method achieves good results for all four tasks, 2) the four tasks are mutually beneficial - training them together achieves the best performance, 3) the number and orientation of microphones are both important, and 4) features learned from the standard spectrogram and features obtained by the classic signal processing pipeline are complementary for auditory perception tasks. The data and code are released on the project page: https://www.trace.ethz.ch/publications/2020/sound_perception/index.html.


Subject(s)
Algorithms , Semantics , Humans , Sound , Learning , Cues
3.
IEEE Trans Pattern Anal Mach Intell ; 44(6): 3139-3153, 2022 06.
Article in English | MEDLINE | ID: mdl-33338013

ABSTRACT

We address the problem of semantic nighttime image segmentation and improve the state-of-the-art, by adapting daytime models to nighttime without using nighttime annotations. Moreover, we design a new evaluation framework to address the substantial uncertainty of semantics in nighttime images. Our central contributions are: 1) a curriculum framework to gradually adapt semantic segmentation models from day to night through progressively darker times of day, exploiting cross-time-of-day correspondences between daytime images from a reference map and dark images to guide the label inference in the dark domains; 2) a novel uncertainty-aware annotation and evaluation framework and metric for semantic segmentation, including image regions beyond human recognition capability in the evaluation in a principled fashion; 3) the Dark Zurich dataset, comprising 2416 unlabeled nighttime and 2920 unlabeled twilight images with correspondences to their daytime counterparts plus a set of 201 nighttime images with fine pixel-level annotations created with our protocol, which serves as a first benchmark for our novel evaluation. Experiments show that our map-guided curriculum adaptation significantly outperforms state-of-the-art methods on nighttime sets both for standard metrics and our uncertainty-aware metric. Furthermore, our uncertainty-aware evaluation reveals that selective invalidation of predictions can improve results on data with ambiguous content such as our benchmark and profit safety-oriented applications involving invalid inputs.


Subject(s)
Algorithms , Semantics , Curriculum , Humans , Image Processing, Computer-Assisted/methods , Uncertainty
4.
IEEE Trans Pattern Anal Mach Intell ; 44(7): 3614-3633, 2022 Jul.
Article in English | MEDLINE | ID: mdl-33497328

ABSTRACT

With the advent of deep learning, many dense prediction tasks, i.e., tasks that produce pixel-level predictions, have seen significant performance improvements. The typical approach is to learn these tasks in isolation, that is, a separate neural network is trained for each individual task. Yet, recent multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint, by jointly tackling multiple tasks through a learned shared representation. In this survey, we provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision, explicitly emphasizing on dense prediction tasks. Our contributions concern the following. First, we consider MTL from a network architecture point-of-view. We include an extensive overview and discuss the advantages/disadvantages of recent popular MTL models. Second, we examine various optimization methods to tackle the joint learning of multiple tasks. We summarize the qualitative elements of these works and explore their commonalities and differences. Finally, we provide an extensive experimental evaluation across a variety of dense prediction benchmarks to examine the pros and cons of the different methods, including both architectural and optimization based strategies.

5.
Article in English | MEDLINE | ID: mdl-30571632

ABSTRACT

Exemplar-based dynamic texture synthesis (EDTS) is targeted to generate new samples of high quality that are perceptually similar to a given input dynamic texture exemplar. This paper addresses the issue of learning the synthesizability of dynamic texture samples. Given a dynamic texture sample, how is its possibility of being synthesized by EDTS methods estimated, and what is the most suitable EDTS algorithm to complete the task? To this end, we propose associating dynamic texture samples with synthesizability scores by learning regression models on a compiled dynamic texture dataset annotated in terms of synthesizability. More precisely, we first define the synthesizability of DT samples and characterize them by a set of spatiotemporal features. We then train regression models on the annotated dataset with feature representation to predict the synthesizability scores of the DT samples and learn classifiers to select the most suitable EDTS algorithm. We further complete the selection, partition and synthesizability prediction of the DT samples in a hierarchical scheme. The learned synthesizability is finally applied to detecting synthesizable regions in videos. Both quantitative and qualitative experiments demonstrate that our method can efficiently learn and predict the synthesizability of DT samples.

6.
IEEE Trans Pattern Anal Mach Intell ; 40(5): 1114-1127, 2018 05.
Article in English | MEDLINE | ID: mdl-28534767

ABSTRACT

Domain adaptation between diverse source and target domains is challenging, especially in the real-world visual recognition tasks where the images and videos consist of significant variations in viewpoints, illuminations, qualities, etc. In this paper, we propose a new approach for domain generalization and domain adaptation based on exemplar SVMs. Specifically, we decompose the source domain into many subdomains, each of which contains only one positive training sample and all negative samples. Each subdomain is relatively less diverse, and is expected to have a simpler distribution. By training one exemplar SVM for each subdomain, we obtain a set of exemplar SVMs. To further exploit the inherent structure of source domain, we introduce a nuclear-norm based regularizer into the objective function in order to enforce the exemplar SVMs to produce a low-rank output on training samples. In the prediction process, the confident exemplar SVM classifiers are selected and reweigted according to the distribution mismatch between each subdomain and the test sample in the target domain. We formulate our approach based on the logistic regression and least square SVM algorithms, which are referred to as low rank exemplar SVMs (LRE-SVMs) and low rank exemplar least square SVMs (LRE-LSSVMs), respectively. A fast algorithm is also developed for accelerating the training of LRE-LSSVMs. We further extend Domain Adaptation Machine (DAM) to learn an optimal target classifier for domain adaptation, and show that our approach can also be applied to domain adaptation with evolving target domain, where the target data distribution is gradually changing. The comprehensive experiments for object recognition and action recognition demonstrate the effectiveness of our approach for domain generalization and domain adaptation with fixed and evolving target domains.

7.
IEEE Trans Image Process ; 21(9): 4232-43, 2012 Sep.
Article in English | MEDLINE | ID: mdl-22614643

ABSTRACT

We introduce the hierarchical Markov aspect model (HMAM), a computationally efficient graphical model for densely labeling large remote sensing images with their underlying terrain classes. HMAM resolves local ambiguities efficiently by combining the benefits of quadtree representations and aspect models-the former incorporate multiscale visual features and hierarchical smoothing to provide improved local label consistency, while the latter sharpen the labelings by focusing them on the classes that are most relevant for the broader local image context. The full HMAM model takes a grid of local hierarchical Markov quadtrees over image patches and augments it by incorporating a probabilistic latent semantic analysis aspect model over a larger local image tile at each level of the quadtree forest. Bag-of-word visual features are extracted for each level and patch, and given these, the parent-child transition probabilities from the quadtree and the label probabilities from the tile-level aspect models, an efficient forwards-backwards inference pass allows local posteriors for the class labels to be obtained for each patch. Variational expectation-maximization is then used to train the complete model from either pixel-level or tile-keyword-level labelings. Experiments on a complete TerraSAR-X synthetic aperture radar terrain map with pixel-level ground truth show that HMAM is both accurate and efficient, providing significantly better results than comparable single-scale aspect models with only a modest increase in training and test complexity. Keyword-level training greatly reduces the cost of providing training data with little loss of accuracy relative to pixel-level training.

SELECTION OF CITATIONS
SEARCH DETAIL
...