Pesquisa | Portal Regional da BVS

Hybrid Perturbation Strategy for Semi-Supervised Crowd Counting.

Wang, Xin; Zhan, Yue; Zhao, Yang; Yang, Tangwen; Ruan, Qiuqi.

IEEE Trans Image Process ; 33: 1227-1240, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38329847

RESUMO

A simple yet effective semi-supervised method is proposed in this paper based on consistency regularization for crowd counting, and a hybrid perturbation strategy is used to generate strong, diverse perturbations, and enhance unlabeled images information mining. The conventional CNN-based counting methods are sensitive to texture perturbation and imperceptible noises raised by adversarial attack, therefore, the hybrid strategy is proposed to combine a spatial texture transformation and an adversarial perturbation module to perturb the unlabeled data in the semantic and non-semantic spaces, respectively. Moreover, a cross-distribution normalization technique is introduced to address the model optimization failure caused by BN layer in the strong perturbation, and to stabilize the optimization of the learning model. Extensive experiments have been conducted on the datasets of ShanghaiTech, UCF-QNRF, NWPU-Crowd, and JHU-Crowd++. The results demonstrate that the proposed semi-supervised counting method performs better over the state-of-the-art methods, and it shows better robustness to various perturbations.

Collaborative and Multilevel Feature Selection Network for Action Recognition.

Zheng, Zhenxing; An, Gaoyun; Cao, Shan; Wu, Dapeng; Ruan, Qiuqi.

IEEE Trans Neural Netw Learn Syst ; 34(3): 1304-1318, 2023 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-34424850

RESUMO

The feature pyramid has been widely used in many visual tasks, such as fine-grained image classification, instance segmentation, and object detection, and had been achieving promising performance. Although many algorithms exploit different-level features to construct the feature pyramid, they usually treat them equally and do not make an in-depth investigation on the inherent complementary advantages of different-level features. In this article, to learn a pyramid feature with the robust representational ability for action recognition, we propose a novel collaborative and multilevel feature selection network (FSNet) that applies feature selection and aggregation on multilevel features according to action context. Unlike previous works that learn the pattern of frame appearance by enhancing spatial encoding, the proposed network consists of the position selection module and channel selection module that can adaptively aggregate multilevel features into a new informative feature from both position and channel dimensions. The position selection module integrates the vectors at the same spatial location across multilevel features with positionwise attention. Similarly, the channel selection module selectively aggregates the channel maps at the same channel location across multilevel features with channelwise attention. Positionwise features with different receptive fields and channelwise features with different pattern-specific responses are emphasized respectively depending on their correlations to actions, which are fused as a new informative feature for action recognition. The proposed FSNet can be inserted into different backbone networks flexibly, and extensive experiments are conducted on three benchmark action datasets, Kinetics, UCF101, and HMDB51. Experimental results show that FSNet is practical and can be collaboratively trained to boost the representational ability of existing networks. FSNet achieves superior performance against most top-tier models on Kinetics and all models on UCF101 and HMDB51.

Global and Local Knowledge-Aware Attention Network for Action Recognition.

Zheng, Zhenxing; An, Gaoyun; Wu, Dapeng; Ruan, Qiuqi.

IEEE Trans Neural Netw Learn Syst ; 32(1): 334-347, 2021 01.

Artigo em Inglês | MEDLINE | ID: mdl-32224465

RESUMO

Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.

Assuntos

Movimento , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Atenção , Benchmarking , Sistemas Computacionais , Bases de Dados Factuais , Humanos , Processamento de Imagem Assistida por Computador , Conhecimento , Aprendizado de Máquina , Reprodutibilidade dos Testes , Gravação em Vídeo

Hierarchical and Spatio-Temporal Sparse Representation for Human Action Recognition.

Tian, Yi; Kong, Yu; Ruan, Qiuqi; An, Gaoyun; Fu, Yun.

IEEE Trans Image Process ; 27(4): 1748-1762, 2018 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-29346092

RESUMO

In this paper, we present a novel two-layer video representation for human action recognition employing hierarchical group sparse encoding technique and spatio-temporal structure. In the first layer, a new sparse encoding method named locally consistent group sparse coding (LCGSC) is proposed to make full use of motion and appearance information of local features. LCGSC method not only encodes global layouts of features within the same video-level groups, but also captures local correlations between them, which obtains expressive sparse representations of video sequences. Meanwhile, two kinds of efficient location estimation models, namely an absolute location model and a relative location model, are developed to incorporate spatio-temporal structure into LCGSC representations. In the second layer, action-level group is established, where a hierarchical LCGSC encoding scheme is applied to describe videos at different levels of abstractions. On the one hand, the new layer captures higher order dependency between video sequences; on the other hand, it takes label information into consideration to improve discrimination of videos' representations. The superiorities of our hierarchical framework are demonstrated on several challenging datasets.

CSMMI: class-specific maximization of mutual information for action and gesture recognition.

Wan, Jun; Athitsos, Vassilis; Jangyodsuk, Pat; Escalante, Hugo Jair; Ruan, Qiuqi; Guyon, Isabelle.

IEEE Trans Image Process ; 23(7): 3152-65, 2014 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-24983106

RESUMO

In this paper, we propose a novel approach called class-specific maximization of mutual information (CSMMI) using a submodular method, which aims at learning a compact and discriminative dictionary for each class. Unlike traditional dictionary-based algorithms, which typically learn a shared dictionary for all of the classes, we unify the intraclass and interclass mutual information (MI) into an single objective function to optimize class-specific dictionary. The objective function has two aims: 1) maximizing the MI between dictionary items within a specific class (intrinsic structure) and 2) minimizing the MI between the dictionary items in a given class and those of the other classes (extrinsic structure). We significantly reduce the computational complexity of CSMMI by introducing an novel submodular method, which is one of the important contributions of this paper. This paper also contributes a state-of-the-art end-to-end system for action and gesture recognition incorporating CSMMI, with feature extraction, learning initial dictionary per each class by sparse coding, CSMMI via submodularity, and classification based on reconstruction errors. We performed extensive experiments on synthetic data and eight benchmark data sets. Our experimental results show that CSMMI outperforms shared dictionary methods and that our end-to-end system is competitive with other state-of-the-art approaches.

Assuntos

Algoritmos , Inteligência Artificial , Gestos , Processamento de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Bases de Dados Factuais , Humanos , Movimento , Esportes

Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition.

Zhi, Ruicong; Flierl, Markus; Ruan, Qiuqi; Kleijn, W Bastiaan.

IEEE Trans Syst Man Cybern B Cybern ; 41(1): 38-52, 2011 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-20403788

RESUMO

In this paper, a novel graph-preserving sparse nonnegative matrix factorization (GSNMF) algorithm is proposed for facial expression recognition. The GSNMF algorithm is derived from the original NMF algorithm by exploiting both sparse and graph-preserving properties. The latter may contain the class information of the samples. Therefore, GSNMF can be conducted as an unsupervised or a supervised dimension reduction method. A sparse representation of the facial images is obtained by minimizing the l(1)-norm of the basis images. Furthermore, according to the graph embedding theory, the neighborhood of the samples is preserved by retaining the graph structure in the mapped space. The GSNMF decomposition transforms the high-dimensional facial expression images into a locality-preserving subspace with sparse representation. To guarantee convergence, we use the projected gradient method to calculate the nonnegative solution of GSNMF. Experiments are conducted on the JAFFE database and the Cohn-Kanade database with unoccluded and partially occluded facial images. The results show that the GSNMF algorithm provides better facial representations and achieves higher recognition rates than nonnegative matrix factorization. Moreover, GSNMF is also more robust to partial occlusions than other tested methods.

Assuntos

Algoritmos , Identificação Biométrica/métodos , Expressão Facial , Processamento de Imagem Assistida por Computador/métodos , Bases de Dados Factuais , Feminino , Humanos , Masculino , Análise Multivariada

Pinpoint authentication watermarking based on a chaotic system.

Ni, Rongrong; Ruan, Qiuqi; Zhao, Yao.

Forensic Sci Int ; 179(1): 54-62, 2008 Jul 18.

Artigo em Inglês | MEDLINE | ID: mdl-18541396

RESUMO

Watermarking technique is one of the active research fields in recent ten years, which can be used in copyright management, content authentication, and so on. For the authentication watermarking, tamper localization and detection accuracy are two important performances. However, most methods in literature cannot obtain precise localization. In addition, few researchers pay attention to the problem of detection accuracy. In this paper, a pinpoint authentication watermarking is proposed based on a chaotic system, which is sensitive to the initial value. The approach can not only exactly localize the malicious manipulations but reveal block substitutions when Holliman-Memon attack (VQ attack) occurs. An image is partitioned into non-overlapped regions according to the requirement on precision. In each region, a chaotic model is iteratively applied to produce the chaotic sequences based on the initial values, which are determined by combining the prominent luminance values of pixels, position information and an image key. Subsequently, an authentication watermark is constructed using the binary chaotic sequences and embedded in the embedding space. At the receiver, a detector extracts the watermark and localizes the tampered regions without access to the host image or the original watermark. The precision of spatial localization can attain to one pixel, which is valuable to the images observed at non-ordinary distance, such as medical images and military images. The detection accuracy rate is defined and analyzed to present the probability of a detector making right decisions. Experimental results demonstrate the effectiveness and advantages of our algorithm.

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA