Search | VHL Regional Portal

Correlating subword articulation with lip shapes for embedding aware audio-visual speech enhancement.

Chen, Hang; Du, Jun; Hu, Yu; Dai, Li-Rong; Yin, Bao-Cai; Lee, Chin-Hui.

Neural Netw ; 143: 171-182, 2021 Nov.

Article in English | MEDLINE | ID: mdl-34157642

ABSTRACT

In this paper, we propose a visual embedding approach to improve embedding aware speech enhancement (EASE) by synchronizing visual lip frames at the phone and place of articulation levels. We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE). Next, we extract audio-visual embedding from noisy speech and lip frames in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE). Experiments on the TCD-TIMIT corpus corrupted by simulated additive noises show that our proposed subword based VEASE approach is more effective than conventional embedding at the word level. Moreover, visual embedding at the articulation place level, leveraging upon a high correlation between place of articulation and lip shapes, demonstrates an even better performance than that at the phone level. Finally the experiments establish that the proposed MEASE framework, incorporating both audio and visual embeddings, yields significantly better speech quality and intelligibility than those obtained with the best visual-only and audio-only EASE systems.

Subject(s)

Speech Perception , Speech , Lip , Noise

Large-Scale Aerial Image Categorization Using a Multitask Topological Codebook.

Zhang, Luming; Wang, Meng; Hong, Richang; Yin, Bao-Cai; Li, Xuelong.

IEEE Trans Cybern ; 46(2): 535-45, 2016 Feb.

Article in English | MEDLINE | ID: mdl-25794407

ABSTRACT

Fast and accurately categorizing the millions of aerial images on Google Maps is a useful technique in pattern recognition. Existing methods cannot handle this task successfully due to two reasons: 1) the aerial images' topologies are the key feature to distinguish their categories, but they cannot be effectively encoded by a conventional visual codebook and 2) it is challenging to build a realtime image categorization system, as some geo-aware Apps update over 20 aerial images per second. To solve these problems, we propose an efficient aerial image categorization algorithm. It focuses on learning a discriminative topological codebook of aerial images under a multitask learning framework. The pipeline can be summarized as follows. We first construct a region adjacency graph (RAG) that describes the topology of each aerial image. Naturally, aerial image categorization can be formulated as RAG-to-RAG matching. According to graph theory, RAG-to-RAG matching is conducted by enumeratively comparing all their respective graphlets (i.e., small subgraphs). To alleviate the high time consumption, we propose to learn a codebook containing topologies jointly discriminative to multiple categories. The learned topological codebook guides the extraction of the discriminative graphlets. Finally, these graphlets are integrated into an AdaBoost model for predicting aerial image categories. Experimental results show that our approach is competitive to several existing recognition models. Furthermore, over 24 aerial images are processed per second, demonstrating that our approach is ready for real-world applications.

Event-Based Media Enrichment Using an Adaptive Probabilistic Hypergraph Model.

Liu, Xueliang; Wang, Meng; Yin, Bao-Cai; Huet, Benoit; Li, Xuelong.

IEEE Trans Cybern ; 45(11): 2461-71, 2015 Nov.

Article in English | MEDLINE | ID: mdl-26470061

ABSTRACT

Nowadays, with the continual development of digital capture technologies and social media services, a vast number of media documents are captured and shared online to help attendees record their experience during events. In this paper, we present a method combining semantic inference and multimodal analysis for automatically finding media content to illustrate events using an adaptive probabilistic hypergraph model. In this model, media items are taken as vertices in the weighted hypergraph and the task of enriching media to illustrate events is formulated as a ranking problem. In our method, each hyperedge is constructed using the K-nearest neighbors of a given media document. We also employ a probabilistic representation, which assigns each vertex to a hyperedge in a probabilistic way, to further exploit the correlation among media data. Furthermore, we optimize the hypergraph weights in a regularization framework, which is solved as a second-order cone problem. The approach is initiated by seed media and then used to rank the media documents using a transductive inference process. The results obtained from validating the approach on an event dataset collected from EventMedia demonstrate the effectiveness of the proposed approach.

High-Resolution Light Field Capture With Coded Aperture.

Wang, Yu-Ping; Wang, Li-Chun; Kong, De-Hui; Yin, Bao-Cai.

IEEE Trans Image Process ; 24(12): 5609-18, 2015 Dec.

Article in English | MEDLINE | ID: mdl-26285152

ABSTRACT

Acquiring light field with larger angular resolution and higher spatial resolution in low cost is the goal of light field capture. Combining or modifying traditional optical cameras is a usual method for designing light field capture equipment, among which most models should deliberate trade-off between angular and spatial resolution, but augmenting coded aperture avoids this consideration by multiplexing information from different views. On the basis of coded aperture, this paper suggests an improved light field camera model that has double measurements and one mask. The two compressive measurements are respectively realized by a coded aperture and a random convolution CMOS imager, the latter is used as imaging sensor of the camera. The single mask design permits high light efficiency, which enables the sampling images to have high clarity. The double measurement design keeps more correlation information, which is conductive to enhancing the reconstructed light field. The higher clarity and more correlation of samplings mean higher quality of rebuilt light field, which also means higher resolution under condition of a lower PSNR requirement for rebuilt light field. Experimental results have verified advantage of the proposed design: compared with the representative mask-based light field camera models, the proposed model has the highest reconstruction quality and a higher light efficiency.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL