Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 4 de 4
Filter
Add more filters










Database
Language
Publication year range
1.
Article in English | MEDLINE | ID: mdl-37436863

ABSTRACT

The framework of visually guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such a divide-and-conquer paradigm is parameter-inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this article presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter-efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding (PC)-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via copredicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: https://github.com/zjsong/Audio-Visual-Predictive-Coding.

2.
IEEE Trans Cybern ; 52(4): 2491-2504, 2022 Apr.
Article in English | MEDLINE | ID: mdl-32667884

ABSTRACT

This article addresses two crucial problems of learning disentangled image representations, namely, controlling the degree of disentanglement during image editing, and balancing the disentanglement strength and the reconstruction quality. To encourage disentanglement, we devise distance covariance-based decorrelation regularization. Further, for the reconstruction step, our model leverages a soft target representation combined with the latent image code. By exploring the real-valued space of the soft target representation, we are able to synthesize novel images with the designated properties. To improve the perceptual quality of images generated by autoencoder (AE)-based models, we extend the encoder-decoder architecture with the generative adversarial network (GAN) by collapsing the AE decoder and the GAN generator into one. We also design a classification-based protocol to quantitatively evaluate the disentanglement strength of our model. The experimental results showcase the benefits of the proposed model.

3.
IEEE Trans Image Process ; 30: 868-877, 2021.
Article in English | MEDLINE | ID: mdl-33237859

ABSTRACT

3D object recognition is one of the most important tasks in 3D data processing, and has been extensively studied recently. Researchers have proposed various 3D recognition methods based on deep learning, among which a class of view-based approaches is a typical one. However, in the view-based methods, the commonly used view pooling layer to fuse multi-view features causes a loss of visual information. To alleviate this problem, in this paper, we construct a novel layer called Dynamic Routing Layer (DRL) by modifying the dynamic routing algorithm of capsule network, to more effectively fuse the features of each view. Concretely, in DRL, we use rearrangement and affine transformation to convert features, then leverage the modified dynamic routing algorithm to adaptively choose the converted features, instead of ignoring all but the most active feature in view pooling layer. We also illustrate that the view pooling layer is a special case of our DRL. In addition, based on DRL, we further present a Dynamic Routing Convolutional Neural Network (DRCNN) for multi-view 3D object recognition. Our experiments on three 3D benchmark datasets show that our proposed DRCNN outperforms many state-of-the-arts, which demonstrates the efficacy of our method.

4.
IEEE Trans Neural Netw Learn Syst ; 30(4): 1150-1165, 2019 04.
Article in English | MEDLINE | ID: mdl-30892198

ABSTRACT

As a biomimetic model of visual information processing, predictive coding (PC) has become increasingly popular for explaining a range of neural responses and many aspects of brain organization. While the development of PC model is encouraging in the neurobiology community, its practical applications in machine learning (e.g., image classification) have not been fully explored yet. In this paper, a novel image processing model called fast inference PC (FIPC) is presented for image representation and classification. Compared with the basic PC model, a regression procedure and a classification layer have been added to the proposed FIPC model. The regression procedure is used to learn regression mappings that achieve fast inference at test time, while the classification layer can instruct the model to extract more discriminative features. In addition, effective learning and fine-tuning algorithms are developed for the proposed model. Experimental results obtained on four image benchmark data sets show that our model is able to directly and fast infer representations and, simultaneously, produce lower error rates on image classification tasks.

SELECTION OF CITATIONS
SEARCH DETAIL
...