Search | VHL Regional Portal

1.

AGDF-Net: Learning Domain Generalizable Depth Features With Adaptive Guidance Fusion.

Liu, Lina; Song, Xibin; Wang, Mengmeng; Dai, Yuchao; Liu, Yong; Zhang, Liangjun.

IEEE Trans Pattern Anal Mach Intell ; 46(5): 3137-3155, 2024 May.

Article in English | MEDLINE | ID: mdl-38090832

ABSTRACT

Cross-domain generalizable depth estimation aims to estimate the depth of target domains (i.e., real-world) using models trained on the source domains (i.e., synthetic). Previous methods mainly use additional real-world domain datasets to extract depth specific information for cross-domain generalizable depth estimation. Unfortunately, due to the large domain gap, adequate depth specific information is hard to obtain and interference is difficult to remove, which limits the performance. To relieve these problems, we propose a domain generalizable feature extraction network with adaptive guidance fusion (AGDF-Net) to fully acquire essential features for depth estimation at multi-scale feature levels. Specifically, our AGDF-Net first separates the image into initial depth and weak-related depth components with reconstruction and contrary losses. Subsequently, an adaptive guidance fusion module is designed to sufficiently intensify the initial depth features for domain generalizable intensified depth features acquisition. Finally, taking intensified depth features as input, an arbitrary depth estimation network can be used for real-world depth estimation. Using only synthetic datasets, our AGDF-Net can be applied to various real-world datasets (i.e., KITTI, NYUDv2, NuScenes, DrivingStereo and CityScapes) with state-of-the-art performances. Furthermore, experiments with a small amount of real-world data in a semi-supervised setting also demonstrate the superiority of AGDF-Net over state-of-the-art approaches.

2.

Digging Into Uncertainty-Based Pseudo-Label for Robust Stereo Matching.

Shen, Zhelun; Song, Xibin; Dai, Yuchao; Zhou, Dingfu; Rao, Zhibo; Zhang, Liangjun.

IEEE Trans Pattern Anal Mach Intell ; 45(12): 14301-14320, 2023 Dec.

Article in English | MEDLINE | ID: mdl-37590113

ABSTRACT

Due to the domain differences and unbalanced disparity distribution across multiple datasets, current stereo matching approaches are commonly limited to a specific dataset and generalize poorly to others. Such domain shift issue is usually addressed by substantial adaptation on costly target-domain ground-truth data, which cannot be easily obtained in practical settings. In this paper, we propose to dig into uncertainty estimation for robust stereo matching. Specifically, to balance the disparity distribution, we employ a pixel-level uncertainty estimation to adaptively adjust the next stage disparity searching space, in this way driving the network progressively prune out the space of unlikely correspondences. Then, to solve the limited ground truth data, an uncertainty-based pseudo-label is proposed to adapt the pre-trained model to the new domain, where pixel-level and area-level uncertainty estimation are proposed to filter out the high-uncertainty pixels of predicted disparity maps and generate sparse while reliable pseudo-labels to align the domain gap. Experimentally, our method shows strong cross-domain, adapt, and joint generalization and obtains 1st place on the stereo task of Robust Vision Challenge 2020. Additionally, our uncertainty-based pseudo-labels can be extended to train monocular depth estimation networks in an unsupervised way and even achieves comparable performance with the supervised methods.

3.

TUSR-Net: Triple Unfolding Single Image Dehazing with Self-Regularization and Dual Feature to Pixel Attention.

Song, Xibin; Zhou, Dingfu; Li, Wei; Dai, Yuchao; Shen, Zhelun; Zhang, Liangjun; Li, Hongdong.

IEEE Trans Image Process ; PP2023 Feb 13.

Article in English | MEDLINE | ID: mdl-37022903

ABSTRACT

Single image dehazing is a challenging and illposed problem due to severe information degeneration of images captured in hazy conditions. Remarkable progresses have been achieved by deep-learning based image dehazing methods, where residual learning is commonly used to separate the hazy image into clear and haze components. However, the nature of low similarity between haze and clear components is commonly neglected, while the lack of constraint of contrastive peculiarity between the two components always restricts the performance of these approaches. To deal with these problems, we propose an end-to-end self-regularized network (TUSR-Net) which exploits the contrastive peculiarity of different components of the hazy image, i.e, self-regularization (SR). In specific, the hazy image is separated into clear and hazy components and constraint between different image components, i.e., self-regularization, is leveraged to pull the recovered clear image closer to groundtruth, which largely promotes the performance of image dehazing. Meanwhile, an effective triple unfolding framework combined with dual feature to pixel attention is proposed to intensify and fuse the intermediate information in feature, channel and pixel levels, respectively, thus features with better representational ability can be obtained. Our TUSR-Net achieves better trade-off between performance and parameter size with weight-sharing strategy and is much more flexible. Experiments on various benchmarking datasets demonstrate the superiority of our TUSR-Net over state-of-the-art single image dehazing methods.

4.

DiT-SLAM: Real-Time Dense Visual-Inertial SLAM with Implicit Depth Representation and Tightly-Coupled Graph Optimization.

Zhao, Mingle; Zhou, Dingfu; Song, Xibin; Chen, Xiuwan; Zhang, Liangjun.

Sensors (Basel) ; 22(9)2022 Apr 28.

Article in English | MEDLINE | ID: mdl-35591079

ABSTRACT

Recently, generating dense maps in real-time has become a hot research topic in the mobile robotics community, since dense maps can provide more informative and continuous features compared with sparse maps. Implicit depth representation (e.g., the depth code) derived from deep neural networks has been employed in the visual-only or visual-inertial simultaneous localization and mapping (SLAM) systems, which achieve promising performances on both camera motion and local dense geometry estimations from monocular images. However, the existing visual-inertial SLAM systems combined with depth codes are either built on a filter-based SLAM framework, which can only update poses and maps in a relatively small local time window, or based on a loosely-coupled framework, while the prior geometric constraints from the depth estimation network have not been employed for boosting the state estimation. To well address these drawbacks, we propose DiT-SLAM, a novel real-time Dense visual-inertial SLAM with implicit depth representation and Tightly-coupled graph optimization. Most importantly, the poses, sparse maps, and low-dimensional depth codes are optimized with the tightly-coupled graph by considering the visual, inertial, and depth residuals simultaneously. Meanwhile, we propose a light-weight monocular depth estimation and completion network, which is combined with attention mechanisms and the conditional variational auto-encoder (CVAE) to predict the uncertainty-aware dense depth maps from more low-dimensional codes. Furthermore, a robust point sampling strategy introducing the spatial distribution of 2D feature points is also proposed to provide geometric constraints in the tightly-coupled optimization, especially for textureless or featureless cases in indoor environments. We evaluate our system on open benchmarks. The proposed methods achieve better performances on both the dense depth estimation and the trajectory estimation compared to the baseline and other systems.

5.

An autonomous excavator system for material loading tasks.

Zhang, Liangjun; Zhao, Jinxin; Long, Pinxin; Wang, Liyang; Qian, Lingfeng; Lu, Feixiang; Song, Xibin; Manocha, Dinesh.

Sci Robot ; 6(55)2021 06 30.

Article in English | MEDLINE | ID: mdl-34193561

ABSTRACT

Excavators are widely used for material handling applications in unstructured environments, including mining and construction. Operating excavators in a real-world environment can be challenging due to extreme conditions-such as rock sliding, ground collapse, or excessive dust-and can result in fatalities and injuries. Here, we present an autonomous excavator system (AES) for material loading tasks. Our system can handle different environments and uses an architecture that combines perception and planning. We fuse multimodal perception sensors, including LiDAR and cameras, along with advanced image enhancement, material and texture classification, and object detection algorithms. We also present hierarchical task and motion planning algorithms that combine learning-based techniques with optimization-based methods and are tightly integrated with the perception modules and the controller modules. We have evaluated AES performance on compact and standard excavators in many complex indoor and outdoor scenarios corresponding to material loading into dump trucks, waste material handling, rock capturing, pile removal, and trenching tasks. We demonstrate that our architecture improves the efficiency and autonomously handles different scenarios. AES has been deployed for real-world operations for long periods and can operate robustly in challenging scenarios. AES achieves 24 hours per intervention, i.e., the system can continuously operate for 24 hours without any human intervention. Moreover, the amount of material handled by AES per hour is closely equivalent to an experienced human operator.

6.

MLDA-Net: Multi-Level Dual Attention-Based Network for Self-Supervised Monocular Depth Estimation.

Song, Xibin; Li, Wei; Zhou, Dingfu; Dai, Yuchao; Fang, Jin; Li, Hongdong; Zhang, Liangjun.

IEEE Trans Image Process ; 30: 4691-4705, 2021.

Article in English | MEDLINE | ID: mdl-33900917

ABSTRACT

The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL