Visual Commonsense-Aware Representation Network for Video Captioning.

Zeng, Pengpeng; Zhang, Haonan; Gao, Lianli; Li, Xiangpeng; Qian, Jin; Shen, Heng Tao

ABSTRACT

Generating consecutive descriptions for videos, that is, video captioning, requires taking full advantage of visual representation along with the generation process. Existing video captioning methods focus on an exploration of spatial-temporal representations and their relationships to produce inferences. However, such methods only exploit the superficial association contained in a video itself without considering the intrinsic visual commonsense knowledge that exists in a video dataset, which may hinder their capabilities of knowledge cognitive to reason accurate descriptions. To address this problem, we propose a simple, yet effective method, called visual commonsense-aware representation network (VCRN), for video captioning. Specifically, we construct a Video Dictionary, a plug-and-play component, obtained by clustering all video features from the total dataset into multiple clustered centers without additional annotation. Each center implicitly represents a visual commonsense concept in a video domain, which is utilized in our proposed visual concept selection (VCS) component to obtain a video-related concept feature. Next, a concept-integrated generation (CIG) component is proposed to enhance caption generation. Extensive experiments on three public video captioning benchmarks MSVD, MSR-VTT, and VATEX, demonstrate that our method achieves state-of-the-art performance, indicating the effectiveness of our method. In addition, our method is integrated into the existing method of video question answering (VideoQA) and improves this performance, which further demonstrates the generalization capability of our method. The source code has been released at https//github.com/zchoi/VCRN.