Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 4 de 4
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Artigo em Inglês | MEDLINE | ID: mdl-38814779

RESUMO

Natural language plays a critical role in many computer vision applications, such as image captioning, visual question answering, and cross-modal retrieval, to provide fine-grained semantic information. Unfortunately, while human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. To address this issue, we have introduced the PoseScript dataset. This dataset pairs more than six thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. Additionally, to increase the size of the dataset to a scale that is compatible with data-hungry learning algorithms, we have proposed an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information, known as "posecodes", using a set of simple but generic rules on the 3D keypoints. These posecodes are then combined into higher level textual descriptions using syntactic rules. With automatic annotations, the amount of available data significantly scales up (100k), making it possible to effectively pretrain deep models for finetuning on human captions. To showcase the potential of annotated poses, we present three multi-modal learning tasks that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps 3D poses and textual descriptions into a joint embedding space, allowing for cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we establish a baseline for a text-conditioned model generating 3D poses. Thirdly, we present a learned process for generating pose descriptions. These applications demonstrate the versatility and usefulness of annotated poses in various tasks and pave the way for future research in the field. The dataset is available at https://europe.naverlabs.com/research/computer-vision/posescript/.

2.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 12798-12815, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37015699

RESUMO

Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a transformer module that is fully trained on 3D Motion Capture (MoCap) data via masked modeling. It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model leveraging temporal information. We showcase variants of PoseBERT with different inputs varying from 3D skeleton keypoints to rotations of a 3D parametric model for either the full body (SMPL) or just the hands (MANO). Since PoseBERT training is task agnostic, the model can be applied to several tasks such as pose refinement, future pose prediction or motion completion without finetuning. Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances, while its low computational cost allows us to use it in a real-time demo for smoothly animating a robotic hand via a webcam. Test code and models are available at https://github.com/naver/posebert.


Assuntos
Algoritmos , Mãos , Humanos , Aprendizagem , Movimento (Física) , Captura de Movimento
3.
IEEE Trans Pattern Anal Mach Intell ; 42(5): 1146-1161, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-30640602

RESUMO

We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark and demonstrates satisfying 3D pose results even for multi-person images.


Assuntos
Imageamento Tridimensional/métodos , Redes Neurais de Computação , Postura/fisiologia , Bases de Dados Factuais , Humanos , Gravação em Vídeo
4.
IEEE Trans Cybern ; 44(6): 894-909, 2014 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-23955796

RESUMO

Gait recognition can potentially provide a noninvasive and effective biometric authentication from a distance. However, the performance of gait recognition systems will suffer in real surveillance scenarios with multiple interacting individuals and where the camera is usually placed at a significant angle and distance from the floor. We present a methodology for view-invariant monocular 3-D human pose tracking in man-made environments in which we assume that observed people move on a known ground plane. First, we model 3-D body poses and camera viewpoints with a low dimensional manifold and learn a generative model of the silhouette from this manifold to a reduced set of training views. During the online stage, 3-D body poses are tracked using recursive Bayesian sampling conducted jointly over the scene's ground plane and the pose-viewpoint manifold. For each sample, the homography that relates the corresponding training plane to the image points is calculated using the dominant 3-D directions of the scene, the sampled location on the ground plane and the sampled camera view. Each regressed silhouette shape is projected using this homographic transformation and is matched in the image to estimate its likelihood. Our framework is able to track 3-D human walking poses in a 3-D environment exploring only a 4-D state space with success. In our experimental evaluation, we demonstrate the significant improvements of the homographic alignment over a commonly used similarity transformation and provide quantitative pose tracking results for the monocular sequences with a high perspective effect from the CAVIAR dataset.


Assuntos
Marcha/fisiologia , Imageamento Tridimensional/métodos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Teorema de Bayes , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...