Pesquisa | Portal Regional da BVS (teste)

Single-Frame Supervision for Spatio-Temporal Video Grounding.

Liu, Kun; Qu, Mengxue; Liu, Yang; Wei, Yunchao; Zhe, Wenming; Zhao, Yao; Liu, Wu.

IEEE Trans Pattern Anal Mach Intell ; PP2024 Jun 18.

Artigo em Inglês | MEDLINE | ID: mdl-38889039

RESUMO

Spatio-Temporal Video Grounding (STVG) aims at localizing the spatio-temporal tube of a specific object in an untrimmed video given a free-form natural language query. As the annotation of tubes is labor intensive, researchers are motivated to explore weakly supervised approaches in recent works, which usually results in significant performance degradation. To achieve a less expensive STVG method with acceptable accuracy, this work investigates the "single-frame supervision" paradigm that requires a single frame labeled with a bounding box within the temporal boundary of the fully supervised counterpart as the supervisory signal. Based on the characteristics of the STVG problem, we propose a Two-Stage Multiple Instance Learning (T-SMILE) method, which creates pseudo labels by expanding the annotated frame to its contextual frames, thereby establishing a fully-supervised problem to facilitate further model training. The innovations of the proposed method are three-folded, including 1) utilizing multiple instance learning to dynamically select instances in positive bags for the recognition of starting and ending timestamps, 2) learning highly discriminative query features by incorporating spatial prior constraints in cross-attention, and 3) designing a curriculum learning-based strategy that iterative assigns dynamic weights to spatial and temporal branches, thereby gradually adapting to the learning branch with larger difficulty. To facilitate future research on this task, we also contribute a large-scale benchmark containing 12,469 videos on complex scenes with single-frame annotation. The extensive experiments on two benchmarks demonstrate that T-SMILE significantly outperforms all weakly-supervised methods. Remarkably, it also performs better than some fully-supervised methods associated with much more annotation labor costs. The dataset and codes are available at https://github.com/qumengxue/T-SMILE.

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA