Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 76
Filter
1.
Article in English | MEDLINE | ID: mdl-38976474

ABSTRACT

In this article, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation. To capture the intricate long-range contextual dynamics and information embedded within videos, a temporal transformer module is introduced, facilitating efficacious interframe interactions throughout a video clip. Furthermore, we employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features, aiming to generate increasingly precise segmentation masks. As a result, MTNet provides a strong and compact framework that explores both temporal and cross-modality knowledge to robustly localize and track the primary object accurately in various challenging scenarios efficiently. Extensive experiments across diverse benchmarks conclusively show that our method not only attains state-of-the-art performance in UVOS but also delivers competitive results in video salient object detection (VSOD). These findings highlight the method's robust versatility and its adeptness in adapting to a range of segmentation tasks. The source code is available at https://github.com/hy0523/MTNet.

2.
Article in English | MEDLINE | ID: mdl-38905087

ABSTRACT

Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and videos, i.e. zooming in and out. Specifically, our approach employs the zooming strategy to learn discriminative mixed-scale semantics by the multi-head scale integration and rich granularity perception units, which are designed to fully explore imperceptible clues between candidate objects and background surroundings. The former's intrinsic multi-head aggregation provides more diverse visual patterns. The latter's routing mechanism can effectively propagate inter-frame differences in spatiotemporal scenarios and be adaptively deactivated and output all-zero results for static representations. They provide a solid foundation for realizing a unified architecture for static and dynamic COD. Moreover, considering the uncertainty and ambiguity derived from indistinguishable textures, we construct a simple yet effective regularization, uncertainty awareness loss, to encourage predictions with higher confidence in candidate regions. Our highly task-friendly framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks.

3.
IEEE Trans Med Imaging ; PP2024 May 13.
Article in English | MEDLINE | ID: mdl-38739510

ABSTRACT

Pyramid-based deformation decomposition is a promising registration framework, which gradually decomposes the deformation field into multi-resolution subfields for precise registration. However, most pyramid-based methods directly produce one subfield per resolution level, which does not fully depict the spatial deformation. In this paper, we propose a novel registration model, called GroupMorph. Different from typical pyramid-based methods, we adopt the grouping-combination strategy to predict deformation field at each resolution. Specifically, we perform group-wise correlation calculation to measure the similarities of grouped features. After that, n groups of deformation subfields with different receptive fields are predicted in parallel. By composing these subfields, a deformation field with multi-receptive field ranges is formed, which can effectively identify both large and small deformations. Meanwhile, a contextual fusion module is designed to fuse the contextual features and provide the inter-group information for the field estimator of the next level. By leveraging the inter-group correspondence, the synergy among deformation subfields is enhanced. Extensive experiments on four public datasets demonstrate the effectiveness of GroupMorph. Code is available at https://github.com/TVayne/GroupMorph.

4.
IEEE Trans Image Process ; 33: 3341-3352, 2024.
Article in English | MEDLINE | ID: mdl-38713578

ABSTRACT

Image-text matching remains a challenging task due to heterogeneous semantic diversity across modalities and insufficient distance separability within triplets. Different from previous approaches focusing on enhancing multi-modal representations or exploiting cross-modal correspondence for more accurate retrieval, in this paper we aim to leverage the knowledge transfer between peer branches in a boosting manner to seek a more powerful matching model. Specifically, we propose a brand-new Deep Boosting Learning (DBL) algorithm, where an anchor branch is first trained to provide insights into the data properties, with a target branch gaining more advanced knowledge to develop optimal features and distance metrics. Concretely, an anchor branch initially learns the absolute or relative distance between positive and negative pairs, providing a foundational understanding of the particular network and data distribution. Building upon this knowledge, a target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples. Extensive experiments validate that our DBL can achieve impressive and consistent improvements based on various recent state-of-the-art models in the image-text matching field, and outperform related popular cooperative strategies, e.g., Conventional Distillation, Mutual Learning, and Contrastive Learning. Beyond the above, we confirm that DBL can be seamlessly integrated into their training scenarios and achieve superior performance under the same computational costs, demonstrating the flexibility and broad applicability of our proposed method.

5.
Article in English | MEDLINE | ID: mdl-37910414

ABSTRACT

With the growing demands of applications on online devices, the speed-accuracy trade-off is critical in the semantic segmentation system. Recently, the bilateral segmentation network has shown promising capacity to achieve the balance between favorable accuracy and fast speed, and has become the mainstream backbone in real-time semantic segmentation. Segmentation of target objects relies on high-level semantics, whereas it requires detailed low-level features to model specific local patterns for accurate location. However, the lightweight backbone of bilateral architecture limits the extraction of semantic context and spatial details. And the late fusion of the bilateral streams incurs the insufficient aggregation of semantic context and spatial details. In this article, we propose a densely aggregated bilateral network (DAB-Net) for real-time semantic segmentation. In the context path, a patchwise context enhancement (PCE) module is proposed to efficiently capture the local semantic contextual information from spatialwise and channelwise, respectively. Meanwhile, a context-guided spatial path (CGSP) is designed to exploit more spatial information by encoding finer details from the raw image and the transition from the context path. Finally, with multiple interactions between bilateral branches, the intertwined outputs from bilateral streams are combined in a unified decoder for a final interaction to further enhance the feature representation, which generates the final segmentation prediction. Experimental results on three public benchmarks demonstrate that our proposed method achieves higher accuracy with a limited decay in speed, which performs favorably against state-of-the-art real-time approaches and runs at 31.1 frames/s (FPS) on the high resolution of [Formula: see text] . The source code is released at https://github.com/isyangshu/DABNet.

6.
Article in English | MEDLINE | ID: mdl-37883252

ABSTRACT

Existing image inpainting methods often produce artifacts that are caused by using vanilla convolution layers as building blocks that treat all image regions equally and generate holes at random locations with equal probability. This design does not differentiate the missing regions and valid regions in inference and does not consider the predictability of missing regions in training. To address these issues, we propose a deformable dynamic sampling (DDS) mechanism which is built on deformable convolutions (DCs), and a constraint is proposed to avoid the deformably sampled elements falling into the corrupted regions. Furthermore, to select both valid sample locations and suitable kernels dynamically, we equip DCs with content-aware dynamic kernel selection (DKS). In addition, to further encourage the DDS mechanism to find meaningful sampling locations, we propose to train the inpainting model with mined predictable regions as holes. During training, we jointly train a mask generator with the inpainting network to generate hole masks dynamically for each training sample. Thus, the mask generator can find large yet predictable missing regions as a better alternative to random masks. Extensive experiments demonstrate the advantages of our method over state-of-the-art methods qualitatively and quantitatively.

7.
Article in English | MEDLINE | ID: mdl-37713223

ABSTRACT

Existing works mainly focus on crowd and ignore the confusion regions which contain extremely similar appearance to crowd in the background, while crowd counting needs to face these two sides at the same time. To address this issue, we propose a novel end-to-end trainable confusion region discriminating and erasing network called CDENet. Specifically, CDENet is composed of two modules of confusion region mining module (CRM) and guided erasing module (GEM). CRM consists of basic density estimation (BDE) network, confusion region aware bridge and confusion region discriminating network. The BDE network first generates a primary density map, and then the confusion region aware bridge excavates the confusion regions by comparing the primary prediction result with the ground-truth density map. Finally, the confusion region discriminating network learns the difference of feature representations in confusion regions and crowds. Furthermore, GEM gives the refined density map by erasing the confusion regions. We evaluate the proposed method on four crowd counting benchmarks, including ShanghaiTech Part_A, ShanghaiTech Part_B, UCF_CC_50, and UCF-QNRF, and our CDENet achieves superior performance compared with the state-of-the-arts.

8.
IEEE Trans Image Process ; 32: 5340-5352, 2023.
Article in English | MEDLINE | ID: mdl-37729570

ABSTRACT

Depth data with a predominance of discriminative power in location is advantageous for accurate salient object detection (SOD). Existing RGBD SOD methods have focused on how to properly use depth information for complementary fusion with RGB data, having achieved great success. In this work, we attempt a far more ambitious use of the depth information by injecting the depth maps into the encoder in a single-stream model. Specifically, we propose a depth injection framework (DIF) equipped with an Injection Scheme (IS) and a Depth Injection Module (DIM). The proposed IS enhances the semantic representation of the RGB features in the encoder by directly injecting depth maps into the high-level encoder blocks, while helping our model maintain computational convenience. Our proposed DIM acts as a bridge between the depth maps and the hierarchical RGB features of the encoder and helps the information of two modalities complement and guide each other, contributing to a great fusion effect. Experimental results demonstrate that our proposed method can achieve state-of-the-art performance on six RGBD datasets. Moreover, our method can achieve excellent performance on RGBT SOD and our DIM can be easily applied to single-stream SOD models and the transformer architecture, proving a powerful generalization ability.

9.
Article in English | MEDLINE | ID: mdl-37310821

ABSTRACT

Recently, referring image segmentation has attracted wide attention given its huge potential in human-robot interaction. Networks to identify the referred region must have a deep understanding of both the image and language semantics. To do so, existing works tend to design various mechanisms to achieve cross-modality fusion, for example, tile and concatenation and vanilla nonlocal manipulation. However, the plain fusion usually is either coarse or constrained by the exorbitant computation overhead, finally causing not enough understanding of the referent. In this work, we propose a fine-grained semantic funneling infusion (FSFI) mechanism to solve the problem. The FSFI introduces a constant spatial constraint on the querying entities from different encoding stages and dynamically infuses the gleaned language semantic into the vision branch. Moreover, it decomposes the features from different modalities into more delicate components, allowing the fusion to happen in multiple low-dimensional spaces. The fusion is more effective than the one only happening in one high-dimensional space, given its ability to sink more representative information along the channel dimension. Another problem haunting the task is that the instilling of high-abstract semantic will blur the details of the referent. Targetedly, we propose a multiscale attention-enhanced decoder (MAED) to alleviate the problem. We design a detail enhancement operator (DeEh) and apply it in a multiscale and progressive way. Features from the higher level are used to generate attention guidance to enlighten the lower-level features to more attend to the detail regions. Extensive results on the challenging benchmarks show that our network performs favorably against the state-of-the-arts (SOTAs).

10.
Article in English | MEDLINE | ID: mdl-37235467

ABSTRACT

Advanced deep convolutional neural networks (CNNs) have shown great success in video-based person re-identification (Re-ID). However, they usually focus on the most obvious regions of persons with a limited global representation ability. Recently, it witnesses that Transformers explore the interpatch relationships with global observations for performance improvements. In this work, we take both the sides and propose a novel spatial-temporal complementary learning framework named deeply coupled convolution-transformer (DCCT) for high-performance video-based person Re-ID. First, we couple CNNs and Transformers to extract two kinds of visual features and experimentally verify their complementarity. Furthermore, in spatial, we propose a complementary content attention (CCA) to take advantages of the coupled structure and guide independent features for spatial complementary learning. In temporal, a hierarchical temporal aggregation (HTA) is proposed to progressively capture the interframe dependencies and encode temporal information. Besides, a gated attention (GA) is used to deliver aggregated temporal information into the CNN and Transformer branches for temporal complementary learning. Finally, we introduce a self-distillation training strategy to transfer the superior spatial-temporal knowledge to backbone networks for higher accuracy and more efficiency. In this way, two kinds of typical features from same videos are integrated mechanically for more informative representations. Extensive experiments on four public Re-ID benchmarks demonstrate that our framework could attain better performances than most state-of-the-art methods.

11.
IEEE Trans Image Process ; 32: 3108-3120, 2023.
Article in English | MEDLINE | ID: mdl-37220043

ABSTRACT

Both salient object detection (SOD) and camouflaged object detection (COD) are typical object segmentation tasks. They are intuitively contradictory, but are intrinsically related. In this paper, we explore the relationship between SOD and COD, and then borrow successful SOD models to detect camouflaged objects to save the design cost of COD models. The core insight is that both SOD and COD leverage two aspects of information: object semantic representations for distinguishing object and background, and context attributes that decide object category. Specifically, we start by decoupling context attributes and object semantic representations from both SOD and COD datasets through designing a novel decoupling framework with triple measure constraints. Then, we transfer saliency context attributes to the camouflaged images through introducing an attribute transfer network. The generated weakly camouflaged images can bridge the context attribute gap between SOD and COD, thereby improving the SOD models' performances on COD datasets. Comprehensive experiments on three widely-used COD datasets verify the ability of the proposed method. Code and model are available at: https://github.com/wdzhao123/SAT.

12.
Article in English | MEDLINE | ID: mdl-37040245

ABSTRACT

General deep learning-based methods for infrared and visible image fusion rely on the unsupervised mechanism for vital information retention by utilizing elaborately designed loss functions. However, the unsupervised mechanism depends on a well-designed loss function, which cannot guarantee that all vital information of source images is sufficiently extracted. In this work, we propose a novel interactive feature embedding in a self-supervised learning framework for infrared and visible image fusion, attempting to overcome the issue of vital information degradation. With the help of a self-supervised learning framework, hierarchical representations of source images can be efficiently extracted. In particular, interactive feature embedding models are tactfully designed to build a bridge between self-supervised learning and infrared and visible image fusion learning, achieving vital information retention. Qualitative and quantitative evaluations exhibit that the proposed method performs favorably against state-of-the-art methods.

13.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 8507-8523, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37015509

ABSTRACT

Correlation has a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion method that considers the similarity between the template and the search region. However, the correlation operation is a local linear matching process, losing semantic information and easily falling into a local optimum, which may be the bottleneck in designing high-accuracy tracking algorithms. In this work, to determine whether a better feature fusion method exists than correlation, a novel attention-based feature fusion network, inspired by the transformer, is presented. This network effectively combines the template and search region features using attention mechanism. Specifically, the proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. First, we present a transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression heads. Based on the TransT baseline, we also design a segmentation branch to generate the accurate mask. Finally, we propose a stronger version of TransT by extending it with a multi-template scheme and an IoU prediction head, named TransT-M. Experiments show that our TransT and TransT-M methods achieve promising results on seven popular benchmarks. Code and models are available at https://github.com/chenxin-dlut/TransT-M.

14.
Article in English | MEDLINE | ID: mdl-37018296

ABSTRACT

While deep-learning-based tracking methods have achieved substantial progress, they entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised (SS) learning for visual tracking. In this work, we develop the crop-transform-paste operation, which is able to synthesize sufficient training data by simulating various appearance variations during tracking, including appearance variations of objects and background interference. Since the target state is known in all synthesized data, existing deep trackers can be trained in routine ways using the synthesized data without human annotation. The proposed target-aware data-synthesis method adapts existing tracking approaches within a SS learning framework without algorithmic changes. Thus, the proposed SS learning mechanism can be seamlessly integrated into existing tracking frameworks to perform training. Extensive experiments show that our method: 1) achieves favorable performance against supervised (Su) learning schemes under the cases with limited annotations; 2) helps deal with various tracking challenges such as object deformation, occlusion (OCC), or background clutter (BC) due to its manipulability; 3) performs favorably against the state-of-the-art unsupervised tracking methods; and 4) boosts the performance of various state-of-the-art Su learning frameworks, including SiamRPN++, DiMP, and TransT.

15.
Article in English | MEDLINE | ID: mdl-37018701

ABSTRACT

Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed cross-modal view-mixed transformer (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components.

16.
IEEE Trans Image Process ; 32: 2322-2334, 2023.
Article in English | MEDLINE | ID: mdl-37071519

ABSTRACT

Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback. In this paper, we develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Specifically, we propose (i) a Recurrent Correspondence Regulator (RCR) which facilitates the cross-modal attention unit progressively with adaptive attention factors to capture more flexible correspondence, and (ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation weights repeatedly to increasingly emphasize important alignments and dilute unimportant ones. Besides, it is interesting that RCR and RAR are "plug-and-play": both of them can be incorporated into many frameworks based on cross-modal interaction to obtain significant benefits, and their cooperation achieves further improvements. Extensive experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models, confirming the general effectiveness and generalization ability of the proposed methods.

17.
IEEE Trans Cybern ; 53(1): 379-391, 2023 Jan.
Article in English | MEDLINE | ID: mdl-34406954

ABSTRACT

Most existing light field saliency detection methods have achieved great success by exploiting unique light field data-focus information in focal slices. However, they process light field data in a slicewise way, leading to suboptimal results because the relative contribution of different regions in focal slices is ignored. How we can comprehensively explore and integrate focused saliency regions that would positively contribute to accurate saliency detection. Answering this question inspires us to develop a new insight. In this article, we propose a patch-aware network to explore light field data in a regionwise way. First, we excavate focused salient regions with a proposed multisource learning module (MSLM), which generates a filtering strategy for integration followed by three guidances based on saliency, boundary, and position. Second, we design a sharpness recognition module (SRM) to refine and update this strategy and perform feature integration. With our proposed MSLM and SRM, we can obtain more accurate and complete saliency maps. Comprehensive experiments on three benchmark datasets prove that our proposed method achieves competitive performance over 2-D, 3-D, and 4-D salient object detection methods. The code and results of our method are available at https://github.com/OIPLab-DUT/IEEE-TCYB-PANet.

18.
IEEE Trans Neural Netw Learn Syst ; 34(5): 2246-2258, 2023 May.
Article in English | MEDLINE | ID: mdl-34469313

ABSTRACT

Recently, referring image localization and segmentation has aroused widespread interest. However, the existing methods lack a clear description of the interdependence between language and vision. To this end, we present a bidirectional relationship inferring network (BRINet) to effectively address the challenging tasks. Specifically, we first employ a vision-guided linguistic attention module to perceive the keywords corresponding to each image region. Then, language-guided visual attention adopts the learned adaptive language to guide the update of the visual features. Together, they form a bidirectional cross-modal attention module (BCAM) to achieve the mutual guidance between language and vision. They can help the network align the cross-modal features better. Based on the vanilla language-guided visual attention, we further design an asymmetric language-guided visual attention, which significantly reduces the computational cost by modeling the relationship between each pixel and each pooled subregion. In addition, a segmentation-guided bottom-up augmentation module (SBAM) is utilized to selectively combine multilevel information flow for object localization. Experiments show that our method outperforms other state-of-the-art methods on three referring image localization datasets and four referring image segmentation datasets.

19.
IEEE Trans Pattern Anal Mach Intell ; 45(1): 460-474, 2023 Jan.
Article in English | MEDLINE | ID: mdl-35196229

ABSTRACT

Compared with short-term tracking, long-term tracking remains a challenging task that usually requires the tracking algorithm to track targets within a local region and re-detect targets over the entire image. However, few works have been done and their performances have also been limited. In this paper, we present a novel robust and real-time long-term tracking framework based on the proposed local search module and re-detection module. The local search module consists of an effective bounding box regressor to generate a series of candidate proposals and a target verifier to infer the optimal candidate with its confidence score. For local search, we design a long short-term updated scheme to improve the target verifier. The verification capability of the tracker can be improved by using several templates updated at different times. Based on the verification scores, our tracker determines whether the tracked object is present or absent and then chooses the tracking strategies of local or global search, respectively, in the next frame. For global re-detection, we develop a novel re-detection module that can estimate the target position and target size for a given base tracker. We conduct a series of experiments to demonstrate that this module can be flexibly integrated into many other tracking algorithms for long-term tracking and that it can improve long-term tracking performance effectively. Numerous experiments and discussions are conducted on several popular tracking datasets, including VOT, OxUvA, TLP, and LaSOT. The experimental results demonstrate that the proposed tracker achieves satisfactory performance with a real-time speed. Code is available at https://github.com/difhnp/ELGLT.

20.
IEEE Trans Pattern Anal Mach Intell ; 45(5): 6168-6182, 2023 May.
Article in English | MEDLINE | ID: mdl-36040937

ABSTRACT

In a sequence, the appearance of both the target and background often changes dramatically. Offline-trained models may not handle huge appearance variations well, causing tracking failures. Most discriminative trackers address this issue by introducing an online update scheme, making the model dynamically adapt the changes of the target and background. Although the online update scheme plays an important role in improving the tracker's accuracy, it inevitably pollutes the model with noisy observation samples. It is necessary to reduce the risk of the online update scheme for better tracking. In this work, we propose a novel offline-trained Meta-Updater to address an important but unsolved problem: Is the tracker ready for updating in the current frame? The proposed module can effectively integrate geometric, discriminative, and appearance cues in a sequential manner, and then mine the sequential information with a designed cascaded LSTM module. Moreover, we strengthen the effect of appearance information on the module, i.e., the additional local outlier factor is introduced to integrate into a newly designed network. We integrate our meta-updater into eight different types of online update trackers. Extensive experiments on four long-term and two short-term tracking benchmarks demonstrate that our meta-updater is effective and has strong generalization ability.

SELECTION OF CITATIONS
SEARCH DETAIL
...