Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Int J Comput Vis ; 132(4): 1148-1166, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38549787

RESUMO

Portrait viewpoint and illumination editing is an important problem with several applications in VR/AR, movies, and photography. Comprehensive knowledge of geometry and illumination is critical for obtaining photorealistic results. Current methods are unable to explicitly model in 3D while handling both viewpoint and illumination editing from a single image. In this paper, we propose VoRF, a novel approach that can take even a single portrait image as input and relight human heads under novel illuminations that can be viewed from arbitrary viewpoints. VoRF represents a human head as a continuous volumetric field and learns a prior model of human heads using a coordinate-based MLP with individual latent spaces for identity and illumination. The prior model is learned in an auto-decoder manner over a diverse class of head shapes and appearances, allowing VoRF to generalize to novel test identities from a single input image. Additionally, VoRF has a reflectance MLP that uses the intermediate features of the prior model for rendering One-Light-at-A-Time (OLAT) images under novel views. We synthesize novel illuminations by combining these OLAT images with target environment maps. Qualitative and quantitative evaluations demonstrate the effectiveness of VoRF for relighting and novel view synthesis, even when applied to unseen subjects under uncontrolled illumination. This work is an extension of Rao et al. (VoRF: Volumetric Relightable Faces 2022). We provide extensive evaluation and ablative studies of our model and also provide an application, where any face can be relighted using textual input.

2.
Artigo em Inglês | MEDLINE | ID: mdl-38376960

RESUMO

The reconstruction and novel view synthesis of dynamic scenes recently gained increased attention. As reconstruction from large-scale multi-view data involves immense memory and computational requirements, recent benchmark datasets provide collections of single monocular views per timestamp sampled from multiple (virtual) cameras. We refer to this form of inputs as monocularized data. Existing work shows impressive results for synthetic setups and forward-facing real-world data, but is often limited in the training speed and angular range for generating novel views. This paper addresses these limitations and proposes a new method for full 360° inward-facing novel view synthesis of non-rigidly deforming scenes. At the core of our method are: 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. In addition to existing synthetic monocularized data, we systematically analyze the performance on real-world inward-facing scenes using a newly recorded challenging dataset sampled from a synchronized large-scale multi-view rig. In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views. Our code and dataset are available online: https://github.com/MoritzKappel/MoNeRF.

3.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15098-15119, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37624713

RESUMO

As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays a key role for the creation and perception of multimodal data in computer vision and deep learning research. With superb power in modeling the interaction among multimodal information, multimodal image synthesis and editing has become a hot research topic in recent years. Instead of providing explicit guidance for network training, multimodal guidance offers intuitive and flexible means for image synthesis and editing. On the other hand, this field is also facing several challenges in alignment of multimodal features, synthesis of high-resolution images, faithful evaluation metrics, etc. In this survey, we comprehensively contextualize the advance of the recent multimodal image synthesis and editing and formulate taxonomies according to data modalities and model types. We start with an introduction to different guidance modalities in image synthesis and editing, and then describe multimodal image synthesis and editing approaches extensively according to their model types. After that, we describe benchmark datasets and evaluation metrics as well as corresponding experimental results. Finally, we provide insights about the current research challenges and possible directions for future research.

4.
Artigo em Inglês | MEDLINE | ID: mdl-37585333

RESUMO

We propose a new method for learning a generalized animatable neural human representation from a sparse set of multi-view imagery of multiple persons. The learned representation can be used to synthesize novel view images of an arbitrary person and further animate them with the user's pose control. While most existing methods can either generalize to new persons or synthesize animations with user control, none of them can achieve both at the same time. We attribute this accomplishment to the employment of a 3D proxy for a shared multi-person human model, and further the warping of the spaces of different poses to a shared canonical pose space, in which we learn a neural field and predict the person- and pose-dependent deformations, as well as appearance with the features extracted from input images. To cope with the complexity of the large variations in body shapes, poses, and clothing deformations, we design our neural human model with disentangled geometry and appearance. Furthermore, we utilize the image features both at the spatial point and on the surface points of the 3D proxy for predicting person- and pose-dependent properties. Experiments show that our method significantly outperforms the state-of-the-arts on both tasks.

5.
IEEE Trans Pattern Anal Mach Intell ; 45(4): 4009-4022, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-34191722

RESUMO

Human performance capture is a highly important computer vision problem with many applications in movie production and virtual/augmented reality. Many previous performance capture approaches either required expensive multi-view setups or did not recover dense space-time coherent geometry with frame-to-frame correspondences. We propose a novel deep learning approach for monocular dense human performance capture. Our method is trained in a weakly supervised manner based on multi-view supervision completely removing the need for training data with 3D ground truth annotations. The network architecture is based on two separate networks that disentangle the task into a pose estimation and a non-rigid surface deformation step. Extensive qualitative and quantitative evaluations show that our approach outperforms the state of the art in terms of quality and robustness. This work is an extended version of [1] where we provide more detailed explanations, comparisons and results as well as applications.

6.
IEEE Trans Pattern Anal Mach Intell ; 44(12): 8962-8974, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-34727024

RESUMO

3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. Existing methods addressing it directly regress hand meshes via 2D convolutional neural networks, which leads to artifacts due to perspective distortions in the images. To address the limitations of the existing methods, we develop HandVoxNet++, i.e., a voxel-based deep network with 3D and graph convolutions trained in a fully supervised manner. The input to our network is a 3D voxelized-depth-map-based on the truncated signed distance function (TSDF). HandVoxNet++ relies on two hand shape representations. The first one is the 3D voxelized grid of hand shape, which does not preserve the mesh topology and which is the most accurate representation. The second representation is the hand surface that preserves the mesh topology. We combine the advantages of both representations by aligning the hand surface to the voxelized hand shape either with a new neural Graph-Convolutions-based Mesh Registration (GCN-MeshReg) or classical segment-wise Non-Rigid Gravitational Approach (NRGA++) which does not rely on training data. In extensive evaluations on three public benchmarks, i.e., SynHand5M, depth-based HANDS19 challenge and HO-3D, the proposed HandVoxNet++ achieves the state-of-the-art performance. In this journal extension of our previous approach presented at CVPR 2020, we gain 41.09% and 13.7% higher shape alignment accuracy on SynHand5M and HANDS19 datasets, respectively. Our method is ranked first on the HANDS19 challenge dataset (Task 1: Depth-Based 3D Hand Pose Estimation) at the moment of the submission of our results to the portal in August 2020.


Assuntos
Algoritmos , Redes Neurais de Computação , Mãos/diagnóstico por imagem
7.
IEEE Trans Vis Comput Graph ; 27(10): 4009-4022, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-32746256

RESUMO

Synthesizing realistic videos of humans using neural networks has been a popular alternative to the conventional graphics-based rendering pipeline due to its high efficiency. Existing works typically formulate this as an image-to-image translation problem in 2D screen space, which leads to artifacts such as over-smoothing, missing body parts, and temporal instability of fine-scale detail, such as pose-dependent wrinkles in the clothing. In this article, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. More specifically, our method relies on the combination of two convolutional neural networks (CNNs). Given the pose information, the first CNN predicts a dynamic texture map that contains time-coherent high-frequency details, and the second CNN conditions the generation of the final video on the temporally coherent output of the first CNN. We demonstrate several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.


Assuntos
Gráficos por Computador , Processamento de Imagem Assistida por Computador/métodos , Redes Neurais de Computação , Gravação em Vídeo/métodos , Artefatos , Aprendizado Profundo , Humanos , Realidade Virtual
8.
IEEE Trans Pattern Anal Mach Intell ; 42(2): 357-370, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-30334783

RESUMO

In this work, we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is the differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance, and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world datasets feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation. This work is an extended version of [1] , where we additionally present a stochastic vertex sampling technique for faster training of our networks, and moreover, we propose and evaluate analysis-by-synthesis and shape-from-shading refinement approaches to achieve a high-fidelity reconstruction.


Assuntos
Face/anatomia & histologia , Face/diagnóstico por imagem , Imageamento Tridimensional/métodos , Aprendizado de Máquina não Supervisionado , Aprendizado Profundo , Feminino , Humanos , Masculino , Redes Neurais de Computação
9.
IEEE Trans Vis Comput Graph ; 25(5): 2093-2101, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-30794176

RESUMO

We propose the first real-time system for the egocentric estimation of 3D human body pose in a wide range of unconstrained everyday activities. This setting has a unique set of challenges, such as mobility of the hardware setup, and robustness to long capture sessions with fast recovery from tracking failures. We tackle these challenges based on a novel lightweight setup that converts a standard baseball cap to a device for high-quality pose estimation based on a single cap-mounted fisheye camera. From the captured egocentric live stream, our CNN based 3D pose estimation approach runs at 60 Hz on a consumer-level GPU. In addition to the lightweight hardware setup, our other main contributions are: 1) a large ground truth training corpus of top-down fisheye images and 2) a disentangled 3D pose estimation approach that takes the unique properties of the egocentric viewpoint into account. As shown by our evaluation, we achieve lower 3D joint error as well as better 2D overlay than the existing baselines.


Assuntos
Imageamento Tridimensional/métodos , Óculos Inteligentes , Bases de Dados Factuais , Aprendizado Profundo , Atividades Humanas , Humanos , Postura , Software , Gravação em Vídeo
10.
IEEE Trans Vis Comput Graph ; 23(11): 2447-2454, 2017 11.
Artigo em Inglês | MEDLINE | ID: mdl-28809688

RESUMO

We present a novel real-time approach for user-guided intrinsic decomposition of static scenes captured by an RGB-D sensor. In the first step, we acquire a three-dimensional representation of the scene using a dense volumetric reconstruction framework. The obtained reconstruction serves as a proxy to densely fuse reflectance estimates and to store user-provided constraints in three-dimensional space. User constraints, in the form of constant shading and reflectance strokes, can be placed directly on the real-world geometry using an intuitive touch-based interaction metaphor, or using interactive mouse strokes. Fusing the decomposition results and constraints in three-dimensional space allows for robust propagation of this information to novel views by re-projection. We leverage this information to improve on the decomposition quality of existing intrinsic video decomposition techniques by further constraining the ill-posed decomposition problem. In addition to improved decomposition quality, we show a variety of live augmented reality applications such as recoloring of objects, relighting of scenes and editing of material appearance.


Assuntos
Imageamento Tridimensional/métodos , Gravação em Vídeo/métodos , Algoritmos , Humanos
11.
Int J Comput Vis ; 124(1): 96-113, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-32025094

RESUMO

This paper presents a novel approach to recover true fine surface detail of deforming meshes reconstructed from multi-view video. Template-based methods for performance capture usually produce a coarse-to-medium scale detail 4D surface reconstruction which does not contain the real high-frequency geometric detail present in the original video footage. Fine scale deformation is often incorporated in a second pass by using stereo constraints, features, or shading-based refinement. In this paper, we propose an alternative solution to this second stage by formulating dense dynamic surface reconstruction as a global optimization problem of the densely deforming surface. Our main contribution is an implicit representation of a deformable mesh that uses a set of Gaussian functions on the surface to represent the initial coarse mesh, and a set of Gaussians for the images to represent the original captured multi-view images. We effectively find the fine scale deformations for all mesh vertices, which maximize photo-temporal-consistency, by densely optimizing our model-to-image consistency energy on all vertex positions. Our formulation yields a smooth closed form energy with implicit occlusion handling and analytic derivatives. Furthermore, it does not require error-prone correspondence finding or discrete sampling of surface displacement values. We demonstrate our approach on a variety of datasets of human subjects wearing loose clothing and performing different motions. We qualitatively and quantitatively demonstrate that our technique successfully reproduces finer detail than the input baseline geometry.

12.
IEEE Trans Pattern Anal Mach Intell ; 37(9): 1792-805, 2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-26353127

RESUMO

Improving the quality of degraded images is a key problem in image processing, but the breadth of the problem leads to domain-specific approaches for tasks such as super-resolution and compression artifact removal. Recent approaches have shown that a general approach is possible by learning application-specific models from examples; however, learning models sophisticated enough to generate high-quality images is computationally expensive, and so specific per-application or per-dataset models are impractical. To solve this problem, we present an efficient semi-local approximation scheme to large-scale Gaussian processes. This allows efficient learning of task-specific image enhancements from example images without reducing quality. As such, our algorithm can be easily customized to specific applications and datasets, and we show the efficiency and effectiveness of our approach across five domains: single-image super-resolution for scene, human face, and text images, and artifact removal in JPEG- and JPEG 2000-encoded images.

13.
IEEE Trans Pattern Anal Mach Intell ; 35(11): 2720-35, 2013 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-24051731

RESUMO

Capturing the skeleton motion and detailed time-varying surface geometry of multiple, closely interacting peoples is a very challenging task, even in a multicamera setup, due to frequent occlusions and ambiguities in feature-to-person assignments. To address this task, we propose a framework that exploits multiview image segmentation. To this end, a probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Given the articulated template models of each person and the labeled pixels, a combined optimization scheme, which splits the skeleton pose optimization problem into a local one and a lower dimensional global one, is applied one by one to each individual, followed with surface estimation to capture detailed nonrigid deformations. We show on various sequences that our approach can capture the 3D motion of humans accurately even if they move rapidly, if they wear wide apparel, and if they are engaged in challenging multiperson motions, including dancing, wrestling, and hugging.


Assuntos
Algoritmos , Inteligência Artificial , Interpretação de Imagem Assistida por Computador/métodos , Imageamento Tridimensional/métodos , Movimento/fisiologia , Reconhecimento Automatizado de Padrão/métodos , Imagem Corporal Total/métodos , Humanos
14.
IEEE Trans Cybern ; 43(5): 1370-82, 2013 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-23893757

RESUMO

We present an algorithm for creating free-viewpoint video of interacting humans using three handheld Kinect cameras. Our method reconstructs deforming surface geometry and temporal varying texture of humans through estimation of human poses and camera poses for every time step of the RGBZ video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem, which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Finally, texture recovery is achieved through jointly optimization on spatio-temporal RGB data using matrix completion. As opposed to previous methods, our algorithm succeeds on free-viewpoint video of human actors under general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.


Assuntos
Actigrafia/métodos , Inteligência Artificial , Periféricos de Computador , Imageamento Tridimensional/métodos , Reconhecimento Automatizado de Padrão/métodos , Gravação em Vídeo/métodos , Imagem Corporal Total/métodos , Actigrafia/instrumentação , Algoritmos , Simulação por Computador , Humanos , Aumento da Imagem/instrumentação , Aumento da Imagem/métodos , Transdutores , Jogos de Vídeo , Imagem Corporal Total/instrumentação
15.
IEEE Trans Pattern Anal Mach Intell ; 35(5): 1039-50, 2013 May.
Artigo em Inglês | MEDLINE | ID: mdl-23520250

RESUMO

We describe a method for 3D object scanning by aligning depth scans that were taken from around an object with a Time-of-Flight (ToF) camera. These ToF cameras can measure depth scans at video rate. Due to comparably simple technology, they bear potential for economical production in big volumes. Our easy-to-use, cost-effective scanning solution, which is based on such a sensor, could make 3D scanning technology more accessible to everyday users. The algorithmic challenge we face is that the sensor's level of random noise is substantial and there is a nontrivial systematic bias. In this paper, we show the surprising result that 3D scans of reasonable quality can also be obtained with a sensor of such low data quality. Established filtering and scan alignment techniques from the literature fail to achieve this goal. In contrast, our algorithm is based on a new combination of a 3D superresolution method with a probabilistic scan alignment approach that explicitly takes into account the sensor's noise characteristics.

16.
IEEE Trans Vis Comput Graph ; 13(4): 663-74, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-17495327

RESUMO

By means of passive optical motion capture, real people can be authentically animated and photo-realistically textured. To import real-world characters into virtual environments, however, surface reflectance properties must also be known. We describe a video-based modeling approach that captures human shape and motion as well as reflectance characteristics from a handful of synchronized video recordings. The presented method is able to recover spatially varying surface reflectance properties of clothes from multiview video footage. The resulting model description enables us to realistically reproduce the appearance of animated virtual actors under different lighting conditions, as well as to interchange surface attributes among different people, e.g., for virtual dressing. Our contribution can be used to create 3D renditions of real-world people under arbitrary novel lighting conditions on standard graphics hardware.


Assuntos
Gráficos por Computador , Interpretação de Imagem Assistida por Computador/métodos , Articulações/anatomia & histologia , Articulações/fisiologia , Iluminação/métodos , Modelos Biológicos , Movimento/fisiologia , Simulação por Computador , Aumento da Imagem/métodos , Imageamento Tridimensional/métodos , Interface Usuário-Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...