Search | VHL Regional Portal

Multi-Person Pose Regression With Distribution-Aware Single-Stage Models.

Zhu, Leyan; Wang, Zitian; Liu, Si; Nie, Xuecheng; Liu, Luoqi; Li, Bo.

IEEE Trans Pattern Anal Mach Intell ; 46(8): 5384-5397, 2024 Aug.

Article in English | MEDLINE | ID: mdl-38335082

ABSTRACT

Understanding human posture is a challenging topic, which encompasses several tasks, e.g., pose estimation, body mesh recovery and pose tracking. In this article, we propose a novel Distribution-Aware Single-stage (DAS) model for the pose-related tasks. The proposed DAS model estimates human position and localizes joints simultaneously, which requires only a single pass. Meanwhile, we utilize normalizing flow to enable DAS to learn the true distribution of joint locations, rather than making simple Gaussian or Laplacian assumptions. This provides a pivotal prior and greatly boosts the accuracy of regression-based methods, thus making DAS achieve comparable performance to the volumetric-based methods. We also introduce a recursively update strategy to progressively approach the regression target, reducing the difficulty of regression and improving the regression performance. We further adapt DAS to multi-person mesh recovery and pose tracking tasks and achieve considerable performance on both tasks. Comprehensive experiments on CMU Panoptic and MuPoTS-3D demonstrate the superior efficiency of DAS, specifically 1.5 times speedup over previous best method, and its state-of-the-art accuracy for multi-person pose estimation. Extensive experiments on 3DPW and PoseTrack2018 indicate the effectiveness and efficiency of DAS for human body mesh recovery and pose tracking, respectively, which prove the generality of our proposed DAS model.

Subject(s)

Algorithms , Posture , Humans , Posture/physiology , Imaging, Three-Dimensional/methods , Image Processing, Computer-Assisted/methods , Regression Analysis

Language-Aware Spatial-Temporal Collaboration for Referring Video Segmentation.

Hui, Tianrui; Liu, Si; Ding, Zihan; Huang, Shaofei; Li, Guanbin; Wang, Wenguan; Liu, Luoqi; Han, Jizhong.

IEEE Trans Pattern Anal Mach Intell ; 45(7): 8646-8659, 2023 Jul.

Article in English | MEDLINE | ID: mdl-37018636

ABSTRACT

Given a natural language referring expression, the goal of referring video segmentation task is to predict the segmentation mask of the referred object in the video. Previous methods only adopt 3D CNNs upon the video clip as a single encoder to extract a mixed spatio-temporal feature for the target frame. Though 3D convolutions are able to recognize which object is performing the described actions, they still introduce misaligned spatial information from adjacent frames, which inevitably confuses features of the target frame and leads to inaccurate segmentation. To tackle this issue, we propose a language-aware spatial-temporal collaboration framework that contains a 3D temporal encoder upon the video clip to recognize the described actions, and a 2D spatial encoder upon the target frame to provide undisturbed spatial features of the referred object. For multimodal features extraction, we propose a Cross-Modal Adaptive Modulation (CMAM) module and its improved version CMAM+ to conduct adaptive cross-modal interaction in the encoders with spatial- or temporal-relevant language features which are also updated progressively to enrich linguistic global context. In addition, we also propose a Language-Aware Semantic Propagation (LASP) module in the decoder to propagate semantic information from deep stages to the shallow stages with language-aware sampling and assignment, which is able to highlight language-compatible foreground visual features and suppress language-incompatible background visual features for better facilitating the spatial-temporal collaboration. Extensive experiments on four popular referring video segmentation benchmarks demonstrate the superiority of our method over the previous state-of-the-art methods.

Cross-Modal Retrieval With CNN Visual Features: A New Baseline.

Wei, Yunchao; Zhao, Yao; Lu, Canyi; Wei, Shikui; Liu, Luoqi; Zhu, Zhenfeng; Yan, Shuicheng.

IEEE Trans Cybern ; 47(2): 449-460, 2017 Feb.

Article in English | MEDLINE | ID: mdl-27046859

ABSTRACT

Recently, convolutional neural network (CNN) visual features have demonstrated their powerful ability as a universal representation for various recognition tasks. In this paper, cross-modal retrieval with CNN visual features is implemented with several classic methods. Specifically, off-the-shelf CNN visual features are extracted from the CNN model, which is pretrained on ImageNet with more than one million images from 1000 object categories, as a generic image representation to tackle cross-modal retrieval. To further enhance the representational ability of CNN visual features, based on the pretrained CNN model on ImageNet, a fine-tuning step is performed by using the open source Caffe CNN library for each target data set. Besides, we propose a deep semantic matching method to address the cross-modal retrieval problem with respect to samples which are annotated with one or multiple labels. Extensive experiments on five popular publicly available data sets well demonstrate the superiority of CNN visual features for cross-modal retrieval.

Deep Human Parsing with Active Template Regression.

Liang, Xiaodan; Liu, Si; Shen, Xiaohui; Yang, Jianchao; Liu, Luoqi; Dong, Jian; Lin, Liang; Yan, Shuicheng.

IEEE Trans Pattern Anal Mach Intell ; 37(12): 2402-14, 2015 Dec.

Article in English | MEDLINE | ID: mdl-26539846

ABSTRACT

In this work, the human parsing task, namely decomposing a human image into semantic fashion/body regions, is formulated as an active template regression (ATR) problem, where the normalized mask of each fashion/body item is expressed as the linear combination of the learned mask templates, and then morphed to a more precise mask with the active shape parameters, including position, scale and visibility of each semantic region. The mask template coefficients and the active shape parameters together can generate the human parsing results, and are thus called the structure outputs for human parsing. The deep Convolutional Neural Network (CNN) is utilized to build the end-to-end relation between the input human image and the structure outputs for human parsing. More specifically, the structure outputs are predicted by two separate networks. The first CNN network is with max-pooling, and designed to predict the template coefficients for each label mask, while the second CNN network is without max-pooling to preserve sensitivity to label mask position and accurately predict the active shape parameters. For a new image, the structure outputs of the two networks are fused to generate the probability of each label for each pixel, and super-pixel smoothing is finally used to refine the human parsing result. Comprehensive evaluations on a large dataset well demonstrate the significant superiority of the ATR framework over other state-of-the-arts for human parsing. In particular, the F1-score reaches 64.38 percent by our ATR framework, significantly higher than 44.76 percent based on the state-of-the-art algorithm [28].

Subject(s)

Image Interpretation, Computer-Assisted/methods , Imaging, Three-Dimensional/methods , Pattern Recognition, Automated/methods , Photography/methods , Subtraction Technique , Whole Body Imaging/methods , Algorithms , Computer Simulation , Humans , Image Enhancement/methods , Machine Learning , Models, Biological , Models, Statistical , Regression Analysis , Reproducibility of Results , Sensitivity and Specificity

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL