Pesquisa | Portal Regional da BVS

PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling.

Baradel, Fabien; Bregier, Romain; Groueix, Thibault; Weinzaepfel, Philippe; Kalantidis, Yannis; Rogez, Gregory.

IEEE Trans Pattern Anal Mach Intell ; 45(11): 12798-12815, 2023 11.

Artigo em Inglês | MEDLINE | ID: mdl-37015699

RESUMO

Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a transformer module that is fully trained on 3D Motion Capture (MoCap) data via masked modeling. It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model leveraging temporal information. We showcase variants of PoseBERT with different inputs varying from 3D skeleton keypoints to rotations of a 3D parametric model for either the full body (SMPL) or just the hands (MANO). Since PoseBERT training is task agnostic, the model can be applied to several tasks such as pose refinement, future pose prediction or motion completion without finetuning. Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances, while its low computational cost allows us to use it in a real-time demo for smoothly animating a robotic hand via a webcam. Test code and models are available at https://github.com/naver/posebert.

Assuntos

Algoritmos , Mãos , Humanos , Aprendizagem , Movimento (Física) , Captura de Movimento

Focal Visual-Text Attention for Memex Question Answering.

Liang, Junwei; Jiang, Lu; Cao, Liangliang; Kalantidis, Yannis; Li, Li-Jia; Hauptmann, Alexander G.

IEEE Trans Pattern Anal Mach Intell ; 41(8): 1893-1908, 2019 08.

Artigo em Inglês | MEDLINE | ID: mdl-30624212

RESUMO

Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, to tackle real-life question answering problems on multimedia collections such as personal photo albums, we have to look at whole collections with sequences of photos. This paper proposes a new multimodal MemexQA task: given a sequence of photos from a user, the goal is to automatically answer questions that help users recover their memory about an event captured in these photos. In addition to a text answer, a few grounding photos are also given to justify the answer. The grounding photos are necessary as they help users quickly verifying the answer. Towards solving the task, we 1) present the MemexQA dataset, the first publicly available multimodal question answering dataset consisting of real personal photo albums; 2) propose an end-to-end trainable network that makes use of a hierarchical process to dynamically determine what media and what time to focus on in the sequential data to answer the question. Experimental results on the MemexQA dataset demonstrate that our model outperforms strong baselines and yields the most relevant grounding photos on this challenging task.

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA