Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 79
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-38848233

RESUMO

Temporal answer grounding in instructional video (TAGV) is a new task naturally derived from temporal sentence grounding in general video (TSGV). Given an untrimmed instructional video and a text question, this task aims at locating the frame span from the video that can semantically answer the question, i.e., visual answer. Existing methods tend to solve the TAGV problem with a visual span-based predictor, taking visual information to predict the start and end frames in the video. However, due to the weak correlations between the semantic features of the textual question and visual answer, current methods using the visual span-based predictor do not work well in the TAGV task. In this paper, we propose a visual-prompt text span localization (VPTSL) method, which introduces the timestamped subtitles for a text span-based predictor. Specifically, the visual prompt is a learnable feature embedding, which brings visual knowledge to the pre-trained language model. Meanwhile, the text span-based predictor learns joint semantic representations from the input text question, video subtitles, and visual prompt feature with the pre-trained language model. Thus, the TAGV is reformulated as the task of the visual-prompt subtitle span localization for the visual answer. Extensive experiments on five instructional video datasets, namely MedVidQA, TutorialVQA, VehicleVQA, CrossTalk and Coin, show that the proposed method outperforms several state-of-the-art (SOTA) methods by a large margin in terms of mIoU score, which demonstrates the effectiveness of the proposed visual prompt and text span-based predictor. Besides, all the experimental codes and datasets are open-sourced on the website https://github.com/wengsyx/VPTSL.

2.
Artigo em Inglês | MEDLINE | ID: mdl-38717882

RESUMO

Recently, low-rank tensor regularization has received more and more attention in hyperspectral and multispectral fusion (HMF). However, these methods often suffer from inflexible low-rank tensor definition and are highly sensitive to the permutation of tensor modes, which hinder their performance. To tackle this problem, we propose a novel generalized tensor nuclear norm (GTNN)-based approach for the HMF. First, we define a novel GTNN by extending the existing third-mode-based tensor nuclear norm (TNN) to arbitrary mode, which conducts the Fourier transform on an arbitrary single mode and then computes the TNN for each mode. In this way, we can not only capture more extensive correlations for the three modes of a tensor, and also omit the adverse effect of permutation of tensor modes. To utilize the correlations among spectral bands, the high-resolution hyperspectral image (HSI) is approximated as low-rank spectral basis multiplication by coefficients, and we estimate the spectral basis by conducting singular-value decomposition (SVD) on HSI. Then, the coefficients are estimated by addressing the proposed GTNN regularized optimization. In specific, to exploit the non-local similarities of the HSI, we first cluster the patches of the coefficient into a 3-D, which contains spatial, spectral, and non-local modes. Since the collected tensor contains the strong non-local spatial-spectral similarities of the HSI, the proposed low-rank tensor regularization is imposed on these collected tensors, which fully model the non-local self-similarities. Fusion experiments on both simulated and real datasets prove the advantages of this approach. The code is available at https://github.com/renweidian/GTNN.

3.
BMC Chem ; 18(1): 91, 2024 May 09.
Artigo em Inglês | MEDLINE | ID: mdl-38724989

RESUMO

To improve the thermal and combustion properties of nanothermites, a design theory of changing the state of matter and structural state of the reactants during reaction was proposed. The Al/MoO3/KClO4 (Kp) nanothermite was prepared and the Al/MoO3 nanothermite was used as a control. SEM and XRD were used to characterize the nanothermites; DSC was used to test thermal properties; and constant volume and open combustion tests were performed to examine their combustion performance. Phase and morphology characterization of the combustion products were performed to reveal the mechanism of the aluminothermic reaction. The results show that the Al/MoO3/Kp nanothermite exhibited excellent thermal properties, with a total heat release of 1976 J·g- 1, increasing by approximately 33% of 1486 J·g- 1 of the Al/MoO3 nanothermite, and activation energy of 269.66 kJ·mol- 1, which demonstrated higher stability than the Al/MoO3 nanothermite (205.64 kJ·mol- 1). During the combustion test, the peak pressure of the Al/MoO3/Kp nanothermite was 0.751 MPa, and the average pressure rise rate was 25.03 MPa·s- 1, much higher than 0.188 MPa and 6.27 MPa·s- 1 of the Al/MoO3 nanothermite. The combustion products of Al/MoO3 nanothermite were Al2O3, MoO, and Mo, indicating insufficient combustion and incomplete reaction, whereas, the combustion products of Al/MoO3/Kp nanothermite were Al2O3, MoO, and KCl, indicating complete reaction. Their "coral-like" morphology was the effect of reactants solidifying after melting during the combustion process. The characterization of reactants and pressure test during combustion reveals the three stages of aluminothermic reaction in thermites. The excellent thermal and combustion performance of Al/MoO3/Kp nanothermite is attributed to the melt and decomposition of Kp into O2 in the third stage. This study provides new ideas and guidance for the design of high-performance nanothermites.

4.
Artigo em Inglês | MEDLINE | ID: mdl-38466604

RESUMO

Spectral super-resolution has attracted the attention of more researchers for obtaining hyperspectral images (HSIs) in a simpler and cheaper way. Although many convolutional neural network (CNN)-based approaches have yielded impressive results, most of them ignore the low-rank prior of HSIs resulting in huge computational and storage costs. In addition, the ability of CNN-based methods to capture the correlation of global information is limited by the receptive field. To surmount the problem, we design a novel low-rank tensor reconstruction network (LTRN) for spectral super-resolution. Specifically, we treat the features of HSIs as 3-D tensors with low-rank properties due to their spectral similarity and spatial sparsity. Then, we combine canonical-polyadic (CP) decomposition with neural networks to design an adaptive low-rank prior learning (ALPL) module that enables feature learning in a 1-D space. In this module, there are two core modules: the adaptive vector learning (AVL) module and the multidimensionwise multihead self-attention (MMSA) module. The AVL module is designed to compress an HSI into a 1-D space by using a vector to represent its information. The MMSA module is introduced to improve the ability to capture the long-range dependencies in the row, column, and spectral dimensions, respectively. Finally, our LTRN, mainly cascaded by several ALPL modules and feedforward networks (FFNs), achieves high-quality spectral super-resolution with fewer parameters. To test the effect of our method, we conduct experiments on two datasets: the CAVE dataset and the Harvard dataset. Experimental results show that our LTRN not only is as effective as state-of-the-art methods but also has fewer parameters. The code is available at https://github.com/renweidian/LTRN.

5.
Artigo em Inglês | MEDLINE | ID: mdl-38546989

RESUMO

Interactive image segmentation (IIS) has emerged as a promising technique for decreasing annotation time. Substantial progress has been made in pre-and post-processing for IIS, but the critical issue of interaction ambiguity, notably hindering segmentation quality, has been under-researched. To address this, we introduce AdaptiveClick - a click-aware transformer incorporating an adaptive focal loss (AFL) that tackles annotation inconsistencies with tools for mask-and pixel-level ambiguity resolution. To the best of our knowledge, AdaptiveClick is the first transformer-based, mask-adaptive segmentation framework for IIS. The key ingredient of our method is the click-aware mask-adaptive transformer decoder (CAMD), which enhances the interaction between click and image features. Additionally, AdaptiveClick enables pixel-adaptive differentiation of hard and easy samples in the decision space, independent of their varying distributions. This is primarily achieved by optimizing a generalized AFL with a theoretical guarantee, where two adaptive coefficients control the ratio of gradient values for hard and easy pixels. Our analysis reveals that the commonly used Focal and BCE losses can be considered special cases of the proposed AFL. With a plain ViT backbone, extensive experimental results on nine datasets demonstrate the superiority of AdaptiveClick compared to state-of-the-art methods. The source code is publicly available at https://github.com/lab206/AdaptiveClick.

6.
IEEE Trans Image Process ; 33: 177-190, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38055358

RESUMO

Interactive image segmentation (IIS) has been widely used in various fields, such as medicine, industry, etc. However, some core issues, such as pixel imbalance, remain unresolved so far. Different from existing methods based on pre-processing or post-processing, we analyze the cause of pixel imbalance in depth from the two perspectives of pixel number and pixel difficulty. Based on this, a novel and unified Click-pixel Cognition Fusion network with Balanced Cut (CCF-BC) is proposed in this paper. On the one hand, the Click-pixel Cognition Fusion (CCF) module, inspired by the human cognition mechanism, is designed to increase the number of click-related pixels (namely, positive pixels) being correctly segmented, where the click and visual information are fully fused by using a progressive three-tier interaction strategy. On the other hand, a general loss, Balanced Normalized Focal Loss (BNFL), is proposed. Its core is to use a group of control coefficients related to sample gradients and forces the network to pay more attention to positive and hard-to-segment pixels during training. As a result, BNFL always tends to obtain a balanced cut of positive and negative samples in the decision space. Theoretical analysis shows that the commonly used Focal and BCE losses can be regarded as special cases of BNFL. Experiment results of five well-recognized datasets have shown the superiority of the proposed CCF-BC method compared to other state-of-the-art methods. The source code is publicly available at https://github.com/lab206/CCF-BC.

7.
Artigo em Inglês | MEDLINE | ID: mdl-38090866

RESUMO

Real-time semantic segmentation plays an important role in auto vehicles. However, most real-time small object segmentation methods fail to obtain satisfactory performance on small objects, such as cars and sign symbols, since the large objects usually tend to devote more to the segmentation result. To solve this issue, we propose an efficient and effective architecture, termed small objects segmentation network (SOSNet), to improve the segmentation performance of small objects. The SOSNet works from two perspectives: methodology and data. Specifically, with the former, we propose a dual-branch hierarchical decoder (DBHD) which is viewed as a small-object sensitive segmentation head. The DBHD consists of a top segmentation head that predicts whether the pixels belong to a small object class and a bottom one that estimates the pixel class. In this situation, the latent correlation among small objects can be fully explored. With the latter, we propose a small object example mining (SOEM) algorithm for balancing examples between small objects and large objects automatically. The core idea of the proposed SOEM is that most of the hard examples on small-object classes are reserved for training while most of the easy examples on large-object classes are banned. Experiments on three commonly used datasets show that the proposed SOSNet architecture greatly improves the accuracy compared to the existing real-time semantic segmentation methods while keeping efficiency. The code will be available at https://github.com/StuLiu/SOSNet.

8.
Artigo em Inglês | MEDLINE | ID: mdl-37819817

RESUMO

Camouflaged object detection (COD) aims to identify object pixels visually embedded in the background environment. Existing deep learning methods fail to utilize the context information around different pixels adequately and efficiently. In order to solve this problem, a novel pixel-centric context perception network (PCPNet) is proposed, the core of which is to customize the personalized context of each pixel based on the automatic estimation of its surroundings. Specifically, PCPNet first employs an elegant encoder equipped with the designed vital component generation (VCG) module to obtain a set of compact features rich in low-level spatial and high-level semantic information across multiple subspaces. Then, we present a parameter-free pixel importance estimation (PIE) function based on multiwindow information fusion. Object pixels with complex backgrounds will be assigned with higher PIE values. Subsequently, PIE is utilized to regularize the optimization loss. In this way, the network can pay more attention to those pixels with higher PIE values in the decoding stage. Finally, a local continuity refinement module (LCRM) is used to refine the detection results. Extensive experiments on four COD benchmarks, five salient object detection (SOD) benchmarks, and five polyp segmentation benchmarks demonstrate the superiority of PCPNet with respect to other state-of-the-art methods.

9.
Artigo em Inglês | MEDLINE | ID: mdl-37819819

RESUMO

In recent years, deep-learning-based pixel-level unified image fusion methods have received more and more attention due to their practicality and robustness. However, they usually require a complex network to achieve more effective fusion, leading to high computational cost. To achieve more efficient and accurate image fusion, a lightweight pixel-level unified image fusion (L-PUIF) network is proposed. Specifically, the information refinement and measurement process are used to extract the gradient and intensity information and enhance the feature extraction capability of the network. In addition, these information are converted into weights to guide the loss function adaptively. Thus, more effective image fusion can be achieved while ensuring the lightweight of the network. Extensive experiments have been conducted on four public image fusion datasets across multimodal fusion, multifocus fusion, and multiexposure fusion. Experimental results show that L-PUIF can achieve better fusion efficiency and has a greater visual effect compared with state-of-the-art methods. In addition, the practicability of L-PUIF in high-level computer vision tasks, i.e., object detection and image segmentation, has been verified.

10.
Artigo em Inglês | MEDLINE | ID: mdl-37738195

RESUMO

To obtain a high-resolution hyperspectral image (HR-HSI), fusing a low-resolution hyperspectral image (LR-HSI) and a high-resolution multispectral image (HR-MSI) is a prominent approach. Numerous approaches based on convolutional neural networks (CNNs) have been presented for hyperspectral image (HSI) and multispectral image (MSI) fusion. Nevertheless, these CNN-based methods may ignore the global relevant features from the input image due to the geometric limitations of convolutional kernels. To obtain more accurate fusion results, we provide a spatial-spectral transformer-based U-net (SSTF-Unet). Our SSTF-Unet can capture the association between distant features and explore the intrinsic information of images. More specifically, we use the spatial transformer block (SATB) and spectral transformer block (SETB) to calculate the spatial and spectral self-attention, respectively. Then, SATB and SETB are connected in parallel to form the spatial-spectral fusion block (SSFB). Inspired by the U-net architecture, we build up our SSTF-Unet through stacking several SSFBs for multiscale spatial-spectral feature fusion. Experimental results on public HSI datasets demonstrate that the designed SSTF-Unet achieves better performance than other existing HSI and MSI fusion approaches.

11.
Natl Sci Rev ; 10(6): nwad130, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37347038

RESUMO

This paper reports the background and results of the Surface Defect Detection Competition with Bio-inspired Vision Sensor, as well as summarizes the champion solutions, current challenges and future directions.

12.
Artigo em Inglês | MEDLINE | ID: mdl-37279125

RESUMO

Visible-infrared object detection aims to improve the detector performance by fusing the complementarity of visible and infrared images. However, most existing methods only use local intramodality information to enhance the feature representation while ignoring the efficient latent interaction of long-range dependence between different modalities, which leads to unsatisfactory detection performance under complex scenes. To solve these problems, we propose a feature-enhanced long-range attention fusion network (LRAF-Net), which improves detection performance by fusing the long-range dependence of the enhanced visible and infrared features. First, a two-stream CSPDarknet53 network is used to extract the deep features from visible and infrared images, in which a novel data augmentation (DA) method is designed to reduce the bias toward a single modality through asymmetric complementary masks. Then, we propose a cross-feature enhancement (CFE) module to improve the intramodality feature representation by exploiting the discrepancy between visible and infrared images. Next, we propose a long-range dependence fusion (LDF) module to fuse the enhanced features by associating the positional encoding of multimodality features. Finally, the fused features are fed into a detection head to obtain the final detection results. Experiments on several public datasets, i.e., VEDAI, FLIR, and LLVIP, show that the proposed method obtains state-of-the-art performance compared with other methods.

13.
Chem Res Chin Univ ; 39(3): 408-414, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37303471

RESUMO

Improving the technical performance of related industrial products is an efficient strategy to reducing the application quantities and environmental burden for toxic chemicals. A novel polyfluoroalkyl surfactant potassium 1,1,2,2,3,3,4,4-octafluoro-4-(perfluorobutoxy)butane-1-sulfonate(F404) was synthesized by a commercializable route. It had a surface tension(γ) of 18.2 mN/m at the critical micelle concentration(CMC, 1.04 g/L), significantly lower than that of perfluorooctane sulfonate(PFOS, ca. 33.0 mN/m, 0.72 g/L), and exhibited remarkable suppression of chromium-fog at a dose half that of PFOS. The half maximal inhibitory concentration(IC50) values in HepG2 cells and the lethal concentration of 50%(LC50) in zebrafish embryos after 72 hpf indicated a lower toxicity for F404 in comparison to PFOS. In a UV/sulphite system, 89.3% of F404 were decomposed after 3 h, representing a defluorination efficiency of 43%. The cleavage of the ether C-O bond during the decomposition would be expected to form a short chain·C4F9 as the position of the ether C-O in the F404 fluorocarbon chains is C4-O5. The ether unit is introduced in the perfluoroalkyl chain to improve water solubility, biocompatibility and degradation, thereby minimizing the environmental burden. Electronic Supplementary Material: Supplementary material is available in the online version of this article at 10.1007/s40242-023-3030-4.

14.
IEEE Trans Pattern Anal Mach Intell ; 45(10): 12650-12666, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37235456

RESUMO

Fusing hyperspectral images (HSIs) with multispectral images (MSIs) of higher spatial resolution has become an effective way to sharpen HSIs. Recently, deep convolutional neural networks (CNNs) have achieved promising fusion performance. However, these methods often suffer from the lack of training data and limited generalization ability. To address the above problems, we present a zero-shot learning (ZSL) method for HSI sharpening. Specifically, we first propose a novel method to quantitatively estimate the spectral and spatial responses of imaging sensors with high accuracy. In the training procedure, we spatially subsample the MSI and HSI based on the estimated spatial response and use the downsampled HSI and MSI to infer the original HSI. In this way, we can not only exploit the inherent information in the HSI and MSI, but the trained CNN can also be well generalized to the test data. In addition, we take the dimension reduction on the HSI, which reduces the model size and storage usage without sacrificing fusion accuracy. Furthermore, we design an imaging model-based loss function for CNN, which further boosts the fusion performance. The experimental results show the significantly high efficiency and accuracy of our approach.

15.
IEEE Trans Pattern Anal Mach Intell ; 45(7): 7939-7954, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37015605

RESUMO

Recently, fusing a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI) of different satellites has become an effective way to improve the resolution of an HSI. However, due to different imaging satellites, different illumination, and adjacent imaging time, the LR-HSI and HR-MSI may not satisfy the observation models established by existing works, and the LR-HSI and HR-MSI are hard to be registered. To solve the above problems, we establish new observation models for LR-HSIs and HR-MSIs from different satellites, then a deep-learning-based framework is proposed to solve the key steps in multi-satellite HSI fusion, including image registration, blur kernel learning, and image fusion. Specifically, we first construct a convolutional neural network (CNN), called RegNet, to produce pixel-wise offsets between LR-HSI and HR-MSI, which are utilized to register the LR-HSI. Next, according to the new observation models, a tiny network, called BKLNet, is built to learn the spectral and spatial blur kernels, where the BKLNet and RegNet can be trained jointly. In the fusion part, we further train a FusNet by downsampling the registered data with the learned spatial blur kernel. Extensive experiments demonstrate the superiority of the proposed framework in HSI registration and fusion accuracy.

16.
IEEE Trans Image Process ; 32: 2267-2278, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37067971

RESUMO

Camouflaged object detection (COD) aims to discover objects that blend in with the background due to similar colors or textures, etc. Existing deep learning methods do not systematically illustrate the key tasks in COD, which seriously hinders the improvement of its performance. In this paper, we introduce the concept of focus areas that represent some regions containing discernable colors or textures, and develop a two-stage focus scanning network for camouflaged object detection. Specifically, a novel encoder-decoder module is first designed to determine a region where the focus areas may appear. In this process, a multi-layer Swin transformer is deployed to encode global context information between the object and the background, and a novel cross-connection decoder is proposed to fuse cross-layer textures or semantics. Then, we utilize the multi-scale dilated convolution to obtain discriminative features with different scales in focus areas. Meanwhile, the dynamic difficulty aware loss is designed to guide the network paying more attention to structural details. Extensive experimental results on the benchmarks, including CAMO, CHAMELEON, COD10K, and NC4K, illustrate that the proposed method performs favorably against other state-of-the-art methods.

17.
Artigo em Inglês | MEDLINE | ID: mdl-37018602

RESUMO

Hyperspectral image (HSI) classification methods have made great progress in recent years. However, most of these methods are rooted in the closed-set assumption that the class distribution in the training and testing stages is consistent, which cannot handle the unknown class in open-world scenes. In this work, we propose a feature consistency-based prototype network (FCPN) for open-set HSI classification, which is composed of three steps. First, a three-layer convolutional network is designed to extract the discriminative features, where a contrastive clustering module is introduced to enhance the discrimination. Then, the extracted features are used to construct a scalable prototype set. Finally, a prototype-guided open-set module (POSM) is proposed to identify the known samples and unknown samples. Extensive experiments reveal that our method achieves remarkable classification performance over other state-of-the-art classification techniques.

18.
Front Immunol ; 14: 1132129, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36845130

RESUMO

Objective: Mucosal immunization was an effective defender against pathogens. Nasal vaccines could activate both systemic and mucosal immunity to trigger protective immune responses. However, due to the weak immunogenicity of nasal vaccines and the lack of appropriate antigen carriers, very few nasal vaccines have been clinically approved for human use, which was a major barrier to the development of nasal vaccines. Plant-derived adjuvants are promising candidates for vaccine delivery systems due to their relatively safe immunogenic properties. In particular, the distinctive structure of pollen was beneficial to the stability and retention of antigen in the nasal mucosa. Methods: Herein, a novel wild-type chrysanthemum sporopollenin vaccine delivery system loaded with a w/o/w emulsion containing squalane and protein antigen was fabricated. The unique internal cavities and the rigid external walls within the sporopollenin skeleton construction could preserve and stabilize the inner proteins. The external morphological characteristics were suitable for nasal mucosal administration with high adhesion and retention. Results: Secretory IgA antibodies in the nasal mucosa can be induced by the w/o/w emulsion with the chrysanthemum sporopollenin vaccine delivery system. Moreover, the nasal adjuvants produce a stronger humoral response (IgA and IgG) compared to squalene emulsion adjuvant. Mucosal adjuvant benefited primarily from prolongation of antigens in the nasal cavity, improvement of antigen penetration in the submucosa and promotion of CD8+ T cells in spleen. Disccusion: Based on effective delivering both the adjuvant and the antigen, the increase of protein antigen stability and the realization of mucosal retention, the chrysanthemum sporopollenin vaccine delivery system has the potential to be a promising adjuvant platform. This work provide a novel idea for the fabrication of protein-mucosal delivery vaccine.


Assuntos
Imunidade nas Mucosas , Vacinas , Humanos , Emulsões/farmacologia , Mucosa Nasal , Adjuvantes Imunológicos/farmacologia , Antígenos
19.
IEEE Trans Neural Netw Learn Syst ; 34(3): 1613-1626, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-34432641

RESUMO

Graphs are essential to improve the performance of graph-based machine learning methods, such as spectral clustering. Various well-designed methods have been proposed to learn graphs that depict specific properties of real-world data. Joint learning of knowledge in different graphs is an effective means to uncover the intrinsic structure of samples. However, the existing methods fail to simultaneously mine the global and local information related to sample structure and distribution when multiple graphs are available, and further research is needed. Hence, we propose a novel intrinsic graph learning (IGL) with discrete constrained diffusion-fusion to solve the above problem in this article. In detail, given a set of the predefined graphs, IGL first obtains the graph encoding the global high-order manifold structure via the diffusion-fusion mechanism based on the tensor product graph. Then, two discrete operators are integrated to fine-prune the obtained graph. One of them limits the maximum number of neighbors connected to each sample, thereby removing redundant and erroneous edges. The other one forces the rank of the Laplacian matrix of the obtained graph to be equal to the number of sample clusters, which guarantees that samples from the same subgraph belong to the same cluster and vice versa. Moreover, a new strategy of weight learning is designed to accurately quantify the contribution of pairwise predefined graphs in the optimization process. Extensive experiments on six single-view and two multiview datasets have demonstrated that our proposed method outperforms the previous state-of-the-art methods on the clustering task.

20.
IEEE Trans Neural Netw Learn Syst ; 34(12): 10028-10038, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-35412992

RESUMO

Automatic speech recognition (ASR) is the major human-machine interface in many intelligent systems, such as intelligent homes, autonomous driving, and servant robots. However, its performance usually significantly deteriorates in the presence of external noise, leading to limitations of its application scenes. The audio-visual speech recognition (AVSR) takes visual information as a complementary modality to enhance the performance of audio speech recognition effectively, particularly in noisy conditions. Recently, the transformer-based architectures have been used to model the audio and video sequences for the AVSR, which achieves a superior performance. However, its performance may be degraded in these architectures due to extracting irrelevant information while modeling long-term dependences. In addition, the motion feature is essential for capturing the spatio-temporal information within the lip region to best utilize visual sequences but has not been considered in the AVSR tasks. Therefore, we propose a multimodal sparse transformer network (MMST) in this article. The sparse self-attention mechanism can improve the concentration of attention on global information by selecting the most relevant parts wisely. Moreover, the motion features are seamlessly introduced into the MMST model. We subtly allow motion-modality information to flow into visual modality through the cross-modal attention module to enhance visual features, thereby further improving recognition performance. Extensive experiments conducted on different datasets validate that our proposed method outperforms several state-of-the-art methods in terms of the word error rate (WER).


Assuntos
Percepção da Fala , Humanos , Redes Neurais de Computação , Reconhecimento Psicológico , Fala
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...