Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 3 de 3
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
Artigo em Inglês | MEDLINE | ID: mdl-39172610

RESUMO

Multimodal summarization (MS) for videos aims to generate summaries from multi-source information (e.g., video and text transcript), and this technique has made promising progress recently. However, existing works are limited to monolingual video scenarios, overlooking the demands of non-native language video viewers to understand cross-lingual videos in practical applications. It stimulates us to introduce multimodal cross-lingual summarization for videos (MCLS), which aims at generating cross-lingual summarization from multimodal input of videos. Considering the challenge of high annotation cost and resource constraints in MCLS, we propose a knowledge distillation (KD) induced triple-stage training method to assist MCLS by transferring knowledge from abundant monolingual MS data to those data with insufficient volumes. In the triple-stage training method, a video-guided dual fusion network (VDF) is designed as the backbone network to integrate multimodal and cross-lingual information through different fusion strategies of encoder and decoder; what's more, we propose two cross-lingual knowledge distillation strategies: adaptive pooling distillation and language-adaptive warping distillation (LAWD). These strategies are tailored for distillation objects (i.e., encoder-level and vocab-level KD) to facilitate effective knowledge transfer across cross-lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle the challenge of unequal length of parallel cross-language sequences in KD, our proposed LAWD can directly conduct cross-language distillation while keeping the language feature shape unchanged to reduce potential information loss. We meticulously annotated the How2-MCLS dataset based on the How2 dataset to simulate the MCLS scenario. The experimental results show that the proposed method achieves competitive performance compared to strong baselines, and can bring substantial performance improvements to MCLS models by transferring knowledge from the MS model.

2.
Artigo em Inglês | MEDLINE | ID: mdl-37027272

RESUMO

Natural language moment localization aims to localize the target moment that matches a given natural language query in an untrimmed video. The key to this challenging task is to capture fine-grained video-language correlations to establish the alignment between the query and target moment. Most existing works establish a single-pass interaction schema to capture correlations between queries and moments. Considering the complex feature space of lengthy video and diverse information between frames, the weight distribution of information interaction flow is prone to dispersion or misalignment, which leads to redundant information flow affecting the final prediction. We address this issue by proposing a capsule-based approach to model the query-video interactions, termed the Multimodal, Multichannel, and Dual-step Capsule Network (M 2 DCapsN), which is derived from the intuition that "multiple people viewing multiple times is better than one person viewing one time." First, we introduce a multimodal capsule network, replacing the single-pass interaction schema of "one person viewing one time" with the iterative interaction schema of "one person viewing multiple times", which cyclically updates cross-modal interactions and modifies potential redundant interactions via its routing-by-agreement. Then, considering that the conventional routing mechanism only learns a single iterative interaction schema, we further propose a multichannel dynamic routing mechanism to learn multiple iterative interaction schemas, where each channel performs independent routing iteration to collectively capture cross-modal correlations from multiple subspaces, that is", multiple people viewing." Moreover, we design a dual-step capsule network structure based on the multimodal, multichannel capsule network, bringing together the query and query-guided key moments to jointly enhance the original video, so as to select the target moments according to the enhanced part. Experimental results on three public datasets demonstrate the superiority of our approach in comparison with state-of-the-art methods, and comprehensive ablation and visualization analysis validate the effectiveness of each component of the proposed model.

3.
Nat Commun ; 14(1): 1444, 2023 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-36922495

RESUMO

With the advancement of global civilisation, monitoring and managing dumpsites have become essential parts of environmental governance in various countries. Dumpsite locations are difficult to obtain in a timely manner by local government agencies and environmental groups. The World Bank shows that governments need to spend massive labour and economic costs to collect illegal dumpsites to implement management. Here we show that applying novel deep convolutional networks to high-resolution satellite images can provide an effective, efficient, and low-cost method to detect dumpsites. In sampled areas of 28 cities around the world, our model detects nearly 1000 dumpsites that appeared around 2021. This approach reduces the investigation time by more than 96.8% compared with the manual method. With this novel and powerful methodology, it is now capable of analysing the relationship between dumpsites and various social attributes on a global scale, temporally and spatially.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA