Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 35
Filter
Add more filters










Publication year range
1.
Article in English | MEDLINE | ID: mdl-38598386

ABSTRACT

Deep learning based semantic segmentation solutions have yielded compelling results over the preceding decade. They encompass diverse network architectures (FCN based or attention based), along with various mask decoding schemes (parametric softmax based or pixel-query based). Despite the divergence, they can be grouped within a unified framework by interpreting the softmax weights or query vectors as learnable class prototypes. In light of this prototype view, we reveal inherent limitations within the parametric segmentation regime, and accordingly develop a nonparametric alternative based on non-learnable prototypes. In contrast to previous approaches that entail the learning of a single weight/query vector per class in a fully parametric manner, our approach represents each class as a set of non-learnable prototypes, relying solely upon the mean features of training pixels within that class. The pixel-wise prediction is thus achieved by nonparametric nearest prototype retrieving. This allows our model to directly shape the pixel embedding space by optimizing the arrangement between embedded pixels and anchored prototypes. It is able to accommodate an arbitrary number of classes with a constant number of learnable parameters. Through empirical evaluation with FCN based and Transformer based segmentation models (i.e., HRNet, Swin, SegFormer, Mask2Former) and backbones (i.e., ResNet, HRNet, Swin, MiT), our nonparametric framework shows superior performance on standard segmentation datasets (i.e., ADE20K, Cityscapes, COCO-Stuff), as well as in large-vocabulary semantic segmentation scenarios. We expect that this study will provoke a rethink of the current de facto semantic segmentation model design.

2.
Article in English | MEDLINE | ID: mdl-38593013

ABSTRACT

Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.

3.
Article in English | MEDLINE | ID: mdl-38564351

ABSTRACT

This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS). Previous VOS methods decode features with a single positive object, limiting the learning of multi-object representation as they must match and segment each target separately under multi-object scenarios. Additionally, earlier techniques catered to specific application objectives and lacked the flexibility to fulfill different speed-accuracy requirements. To address these problems, we present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST). In pursuing effective multi-object modeling, AOT introduces the IDentification (ID) mechanism to allocate each object a unique identity. This approach enables the network to model the associations among all objects simultaneously, thus facilitating the tracking and segmentation of objects in a single network pass. To address the challenge of inflexible deployment, AOST further integrates scalable long short-term transformers that incorporate scalable supervision and layer-wise ID-based attention. This enables online architecture scalability in VOS for the first time and overcomes ID embeddings' representation limitations. Given the absence of a benchmark for VOS involving densely multi-object annotations, we propose a challenging Video Object Segmentation in the Wild (VOSW) benchmark to validate our approaches. We evaluated various AOT and AOST variants using extensive experiments across VOSW and five commonly used VOS benchmarks, including YouTube-VOS 2018 & 2019 Val, DAVIS-2017 Val & Test, and DAVIS-2016. Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks. Moreover, we notably achieved the 1st position in the 3 rd Large-scale Video Object Segmentation Challenge. Project page: https://github.com/yoxu515/aot-benchmark.

4.
Article in English | MEDLINE | ID: mdl-38386572

ABSTRACT

This work studies the problem of image semantic segmentation. Current approaches focus mainly on mining "local" context, i.e., dependencies between pixels within individual images, by specifically-designed, context aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization objectives (e.g., IoU-like loss). However, they ignore "global" context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive algorithm, dubbed as PiCo, for semantic segmentation in the fully supervised learning setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely studied before. Our training algorithm is compatible with modern segmentation solutions without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCRNet, SegFormer, Segmenter, MaskFormer) and backbones (i.e., MobileNet, ResNet, HRNet, MiT, ViT), our algorithm brings consistent performance improvements across diverse datasets (i.e., Cityscapes, ADE20K, PASCAL-Context, COCO-Stuff, CamVid). We expect that this work will encourage our community to rethink the current de facto training paradigm in semantic segmentation. Our code is available at https://github.com/tfzhou/ContrastiveSeg.

5.
Genome Biol ; 24(1): 235, 2023 10 19.
Article in English | MEDLINE | ID: mdl-37858204

ABSTRACT

When analyzing data from in situ RNA detection technologies, cell segmentation is an essential step in identifying cell boundaries, assigning RNA reads to cells, and studying the gene expression and morphological features of cells. We developed a deep-learning-based method, GeneSegNet, that integrates both gene expression and imaging information to perform cell segmentation. GeneSegNet also employs a recursive training strategy to deal with noisy training labels. We show that GeneSegNet significantly improves cell segmentation performances over existing methods that either ignore gene expression information or underutilize imaging information.


Subject(s)
Deep Learning , Tomography, X-Ray Computed , RNA , Gene Expression , Image Processing, Computer-Assisted/methods
6.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 10055-10069, 2023 Aug.
Article in English | MEDLINE | ID: mdl-37819831

ABSTRACT

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S +, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S + show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution.

7.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 8296-8310, 2023 07.
Article in English | MEDLINE | ID: mdl-37022259

ABSTRACT

In this work, we study the challenging problem of instance-aware human body part parsing. We introduce a new bottom-up regime which achieves the task through learning category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner. The output is a compact, efficient and powerful framework that exploits structural information over different human granularities and eases the difficulty of person partitioning. Specifically, a dense-to-sparse projection field, which allows explicitly associating dense human semantics with sparse keypoints, is learnt and progressively improved over the network feature pyramid for robustness. Then, the difficult pixel grouping problem is cast as an easier, multi-person joint assembling task. By formulating joint association as maximum-weight bipartite matching, we develop two novel algorithms based on projected gradient descent and unbalanced optimal transport, respectively, to solve the matching problem differentiablly. These algorithms make our method end-to-end trainable and allow back-propagating the grouping error to directly supervise multi-granularity human representation learning. This is significantly distinguished from current bottom-up human parsers or pose estimators which require sophisticated post-processing or heuristic greedy algorithms. Extensive experiments on three instance-aware human parsing datasets (i.e., MHP-v2, DensePose-COCO, PASCAL-Person-Part) demonstrate that our approach outperforms most existing human parsers with much more efficient inference. Our code is available at https://github.com/tfzhou/MG-HumanParsing.


Subject(s)
Algorithms , Learning , Humans , Semantics , Software
8.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 8646-8659, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37018636

ABSTRACT

Given a natural language referring expression, the goal of referring video segmentation task is to predict the segmentation mask of the referred object in the video. Previous methods only adopt 3D CNNs upon the video clip as a single encoder to extract a mixed spatio-temporal feature for the target frame. Though 3D convolutions are able to recognize which object is performing the described actions, they still introduce misaligned spatial information from adjacent frames, which inevitably confuses features of the target frame and leads to inaccurate segmentation. To tackle this issue, we propose a language-aware spatial-temporal collaboration framework that contains a 3D temporal encoder upon the video clip to recognize the described actions, and a 2D spatial encoder upon the target frame to provide undisturbed spatial features of the referred object. For multimodal features extraction, we propose a Cross-Modal Adaptive Modulation (CMAM) module and its improved version CMAM+ to conduct adaptive cross-modal interaction in the encoders with spatial- or temporal-relevant language features which are also updated progressively to enrich linguistic global context. In addition, we also propose a Language-Aware Semantic Propagation (LASP) module in the decoder to propagate semantic information from deep stages to the shallow stages with language-aware sampling and assignment, which is able to highlight language-compatible foreground visual features and suppress language-incompatible background visual features for better facilitating the spatial-temporal collaboration. Extensive experiments on four popular referring video segmentation benchmarks demonstrate the superiority of our method over the previous state-of-the-art methods.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7099-7122, 2023 Jun.
Article in English | MEDLINE | ID: mdl-36449595

ABSTRACT

Video segmentation-partitioning video frames into multiple segments or objects-plays a critical role in a broad range of practical applications, from enhancing visual effects in movie, to understanding scenes in autonomous driving, to creating virtual background in video conferencing. Recently, with the renaissance of connectionism in computer vision, there has been an influx of deep learning based approaches for video segmentation that have delivered compelling performance. In this survey, we comprehensively review two basic lines of research - generic object segmentation (of unknown categories) in videos, and video semantic segmentation - by introducing their respective task settings, background concepts, perceived need, development history, and main challenges. We also offer a detailed overview of representative literature on both methods and datasets. We further benchmark the reviewed methods on several well-known datasets. Finally, we point out open issues in this field, and suggest opportunities for further research. We also provide a public website to continuously track developments in this fast advancing field: https://github.com/tfzhou/VS-Survey.

10.
Article in English | MEDLINE | ID: mdl-35439127

ABSTRACT

This article studies the problem of learning weakly supervised semantic segmentation (WSSS) from image-level supervision only. Rather than previous efforts that primarily focus on intra-image information, we address the value of cross-image semantic relations for comprehensive object pattern mining. To achieve this, two neural co-attentions are incorporated into the classifier to complimentarily capture cross-image semantic similarities and differences. In particular, given a pair of training images, one co-attention enforces the classifier to recognize the common semantics from co-attentive objects, while the other one, called contrastive co-attention, drives the classifier to identify the unique semantics from the rest, unshared objects. This helps the classifier discover more object patterns and better ground semantics in image regions. More importantly, our algorithm provides a unified framework that handles well different WSSS settings, i.e., learning WSSS with (1) precise image-level supervision only, (2) extra simple single-label data, and (3) extra noisy web data. Without bells and whistles, it sets new state-of-the-arts on all these settings. Moreover, our approach ranked 1 st place in the WSSS Track of CVPR2020 LID Challenge. The extensive experimental results demonstrate well the efficacy and high utility of our method.

11.
IEEE Trans Pattern Anal Mach Intell ; 44(6): 3239-3259, 2022 06.
Article in English | MEDLINE | ID: mdl-33434124

ABSTRACT

As an essential problem in computer vision, salient object detection (SOD) has attracted an increasing amount of research attention over the years. Recent advances in SOD are predominantly led by deep learning-based solutions (named deep SOD). To enable in-depth understanding of deep SOD, in this paper, we provide a comprehensive survey covering various aspects, ranging from algorithm taxonomy to unsolved issues. In particular, we first review deep SOD algorithms from different perspectives, including network architecture, level of supervision, learning paradigm, and object-/instance-level detection. Following that, we summarize and analyze existing SOD datasets and evaluation metrics. Then, we benchmark a large group of representative SOD models, and provide detailed analyses of the comparison results. Moreover, we study the performance of SOD algorithms under different attribute settings, which has not been thoroughly explored previously, by constructing a novel SOD dataset with rich attribute annotations covering various salient object types, challenging factors, and scene categories. We further analyze, for the first time in the field, the robustness of SOD models to random input perturbations and adversarial attacks. We also look into the generalization and difficulty of existing SOD datasets. Finally, we discuss several open issues of SOD and outline future research directions. All the saliency prediction maps, our constructed dataset with annotations, and codes for evaluation are publicly available at https://github.com/wenguanwang/SODsurvey.


Subject(s)
Deep Learning , Algorithms , Attention , Benchmarking , Superoxide Dismutase
12.
IEEE Trans Pattern Anal Mach Intell ; 44(4): 2228-2242, 2022 Apr.
Article in English | MEDLINE | ID: mdl-33232224

ABSTRACT

We introduce a novel network, called CO-attention siamese network (COSNet), to address the zero-shot video object segmentation task in a holistic fashion. We exploit the inherent correlation among video frames and incorporate a global co-attention mechanism to further improve the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in COSNet provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. COSNet is a unified and end-to-end trainable framework where different co-attention variants can be derived for capturing diverse properties of the learned joint feature space. We train COSNet with pairs (or groups) of video frames, and this naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. Our extensive experiments over three large benchmarks demonstrate that COSNet outperforms the current alternatives by a large margin. Our implementations are available at https://github.com/carrierlxk/COSNet.

13.
IEEE Trans Pattern Anal Mach Intell ; 44(6): 2827-2840, 2022 06.
Article in English | MEDLINE | ID: mdl-33400648

ABSTRACT

This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images. Considering the intrinsic complexity and structural nature of the task, we introduce a cascaded parsing network (CP-HOI) for a multi-stage, structured HOI understanding. At each cascade stage, an instance detection module progressively refines HOI proposals and feeds them into a structured interaction reasoning module. Each of the two modules is also connected to its predecessor in the previous stage, enabling efficient cross-stage information propagation. The structured interaction reasoning module is built upon a graph parsing neural network (GPNN), which efficiently models potential HOI structures as graphs and mines rich context for comprehensive relation understanding. In particular, GPNN infers a parse graph that i) interprets meaningful HOI structures by a learnable adjacency matrix, and ii) predicts action (edge) labels. Within an end-to-end, message-passing framework, GPNN blends learning and inference, iteratively parsing HOI structures and reasoning HOI representations (i.e., instance and relation features). Further beyond relation detection at a bounding-box level, we make our framework flexible to perform fine-grained pixel-wise relation segmentation; this provides a new glimpse into better relation modeling. A preliminary version of our CP-HOI model reached 1st place in the ICCV2019 Person in Context Challenge, on both relation detection and segmentation. In addition, our CP-HOI shows promising results on two popular HOI recognition benchmarks, i.e., V-COCO and HICO-DET.


Subject(s)
Algorithms , Neural Networks, Computer , Humans , Learning , Visual Perception
14.
IEEE Trans Pattern Anal Mach Intell ; 44(5): 2468-2484, 2022 May.
Article in English | MEDLINE | ID: mdl-33320811

ABSTRACT

3D data that contains rich geometry information of objects and scenes is valuable for understanding 3D physical world. With the recent emergence of large-scale 3D datasets, it becomes increasingly crucial to have a powerful 3D generative model for 3D shape synthesis and analysis. This paper proposes a deep 3D energy-based model to represent volumetric shapes. The maximum likelihood training of the model follows an "analysis by synthesis" scheme. The benefits of the proposed model are six-fold: first, unlike GANs and VAEs, the model training does not rely on any auxiliary models; second, the model can synthesize realistic 3D shapes by Markov chain Monte Carlo (MCMC); third, the conditional model can be applied to 3D object recovery and super resolution; fourth, the model can serve as a building block in a multi-grid modeling and sampling framework for high resolution 3D shape synthesis; fifth, the model can be used to train a 3D generator via MCMC teaching; sixth, the unsupervisedly trained model provides a powerful feature extractor for 3D data, which is useful for 3D object classification. Experiments demonstrate that the proposed model can generate high-quality 3D shape patterns and can be useful for a wide variety of 3D shape analysis.

15.
IEEE Trans Pattern Anal Mach Intell ; 44(7): 3508-3522, 2022 07.
Article in English | MEDLINE | ID: mdl-33513100

ABSTRACT

Modeling the human structure is central for human parsing that extracts pixel-wise semantic information from images. We start with analyzing three types of inference processes over the hierarchical structure of human bodies: direct inference (directly predicting human semantic parts using image information), bottom-up inference (assembling knowledge from constituent parts), and top-down inference (leveraging context from parent nodes). We then formulate the problem as a compositional neural information fusion (CNIF) framework, which assembles the information from the three inference processes in a conditional manner, i.e., considering the confidence of the sources. Based on CNIF, we further present a part-relation-aware human parser (PRHP), which precisely describes three kinds of human part relations, i.e., decomposition, composition, and dependency, by three distinct relation networks. Expressive relation information can be captured by imposing the parameters in the relation networks to satisfy specific geometric characteristics of different relations. By assimilating generic message-passing networks with their edge-typed, convolutional counterparts, PRHP performs iterative reasoning over the human body hierarchy. With these efforts, PRHP provides a more general and powerful form of CNIF, and lays the foundation for more sophisticated and flexible human relation patterns of reasoning. Experiments on five datasets demonstrate that our two human parsers outperform the state-of-the-arts in all cases.


Subject(s)
Algorithms , Semantics , Humans , Software
16.
IEEE Trans Pattern Anal Mach Intell ; 44(8): 4454-4468, 2022 08.
Article in English | MEDLINE | ID: mdl-33656990

ABSTRACT

It is quite laborious and costly to manually label LiDAR point cloud data for training high-quality 3D object detectors. This work proposes a weakly supervised framework which allows learning 3D detection from a few weakly annotated examples. This is achieved by a two-stage architecture design. Stage-1 learns to generate cylindrical object proposals under inaccurate and inexact supervision, obtained by our proposed BEV center-click annotation strategy, where only the horizontal object centers are click-annotated in bird's view scenes. Stage-2 learns to predict cuboids and confidence scores in a coarse-to-fine, cascade manner, under incomplete supervision, i.e., only a small portion of object cuboids are precisely annotated. With KITTI dataset, using only 500 weakly annotated scenes and 534 precisely labeled vehicle instances, our method achieves 86-97 percent the performance of current top-leading, fully supervised detectors (which require 3,712 exhaustively annotated scenes with 15,654 instances). More importantly, with our elaborately designed network architecture, our trained model can be applied as a 3D object annotator, supporting both automatic and active (human-in-the-loop) working modes. The annotations generated by our model can be used to train 3D object detectors, achieving over 95 percent of their original performance (with manually labeled training data). Our experiments also show our model's potential in boosting performance when given more training data. The above designs make our approach highly practical and open-up opportunities for learning 3D detection at reduced annotation cost.


Subject(s)
Algorithms , Learning , Humans
17.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6327-6344, 2022 10.
Article in English | MEDLINE | ID: mdl-34106844

ABSTRACT

In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation from a monocular RGB image. Our model takes estimated 2D pose as the input and learns a generalized 2D-3D mapping function to leverage into 3D pose. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNNs) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a data augmentation algorithm to further improve model robustness against appearance variations and cross-view generalization ability. We validate our method on public 3D human pose benchmarks and propose a new evaluation protocol working on cross-view setting to verify the generalization capability of different methods. We empirically observe that most state-of-the-art methods encounter difficulty under such setting while our method can well handle such challenges.


Subject(s)
Algorithms , Posture , Biomechanical Phenomena , Humans
18.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 7885-7897, 2022 11.
Article in English | MEDLINE | ID: mdl-34582345

ABSTRACT

In this article, we model a set of pixelwise object segmentation tasks - automatic video segmentation (AVS), image co-segmentation (ICS) and few-shot semantic segmentation (FSS) - in a unified view of segmenting objects from relational visual data. To this end, we propose an attentive graph neural network (AGNN) that addresses these tasks in a holistic fashion, by formulating them as a process of iterative information fusion over data graphs. It builds a fully-connected graph to efficiently represent visual data as nodes and relations between data instances as edges. The underlying relations are described by a differentiable attention mechanism, which thoroughly examines fine-grained semantic similarities between all the possible location pairs in two data instances. Through parametric message passing, AGNN is able to capture knowledge from the relational visual data, enabling more accurate object discovery and segmentation. Experiments show that AGNN can automatically highlight primary foreground objects from video sequences (i.e., automatic video segmentation), and extract common objects from noisy collections of semantically related images (i.e., image co-segmentation). AGNN can even generalize segment new categories with little annotated data (i.e., few-shot semantic segmentation). Taken together, our results demonstrate that AGNN provides a powerful tool that is applicable to a wide range of pixel-wise object pattern understanding tasks with relational visual data. Our algorithm implementations have been made publicly available at https://github.com/carrierlxk/AGNN.


Subject(s)
Algorithms , Neural Networks, Computer
19.
IEEE Trans Pattern Anal Mach Intell ; 43(5): 1515-1529, 2021 May.
Article in English | MEDLINE | ID: mdl-31796388

ABSTRACT

Hyperparameters are numerical pre-sets whose values are assigned prior to the commencement of a learning process. Selecting appropriate hyperparameters is often critical for achieving satisfactory performance in many vision problems, such as deep learning-based visual object tracking. However, it is often difficult to determine their optimal values, especially if they are specific to each video input. Most hyperparameter optimization algorithms tend to search a generic range and are imposed blindly on all sequences. In this paper, we propose a novel dynamical hyperparameter optimization method that adaptively optimizes hyperparameters for a given sequence using an action-prediction network leveraged on continuous deep Q-learning. Since the observation space for object tracking is significantly more complex than those in traditional control problems, existing continuous deep Q-learning algorithms cannot be directly applied. To overcome this challenge, we introduce an efficient heuristic strategy to handle high dimensional state space, while also accelerating the convergence behavior. The proposed algorithm is applied to improve two representative trackers, a Siamese-based one and a correlation-filter-based one, to evaluate its generalizability. Their superior performances on several popular benchmarks are clearly demonstrated. Our source code is available at https://github.com/shenjianbing/dqltracking.

20.
IEEE Trans Pattern Anal Mach Intell ; 43(7): 2413-2428, 2021 07.
Article in English | MEDLINE | ID: mdl-31940522

ABSTRACT

This paper conducts a systematic study on the role of visual attention in video object pattern understanding. By elaborately annotating three popular video segmentation datasets (DAVIS 16, Youtube-Objects, and SegTrack V2) with dynamic eye-tracking data in the unsupervised video object segmentation (UVOS) setting. For the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgments during dynamic, task-driven viewing. Such novel observations provide an in-depth insight of the underlying rationale behind video object pattens. Inspired by these findings, we decouple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain, and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major advantages: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on four popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance compared with state-of-the-arts and enjoys fast processing speed (10 fps on a single GPU). Our collected eye-tracking data and algorithm implementations have been made publicly available at https://github.com/wenguanwang/AGS.

SELECTION OF CITATIONS
SEARCH DETAIL
...