Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Artículo en Inglés | MEDLINE | ID: mdl-37195841

RESUMEN

People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (i.e. RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (e.g. STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (e.g. PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline.

2.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 9469-9485, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-37027607

RESUMEN

We present a method for reconstructing accurate and consistent 3D hands from a monocular video. We observe that the detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand, which can reduce or even eliminate the requirement on 3D hand annotation. Accordingly, in this work, we propose S2HAND, a self-supervised 3D hand reconstruction model, that can jointly estimate pose, shape, texture, and the camera viewpoint from a single RGB input through the supervision of easily accessible 2D detected keypoints. We leverage the continuous hand motion information contained in the unlabeled video data and explore S2HAND(V), which uses a set of weights shared S2HAND to process each frame and exploits additional motion, texture, and shape consistency constrains to obtain more accurate hand poses, and more consistent shapes and textures. Experiments on benchmark datasets demonstrate that our self-supervised method produces comparable hand reconstruction performance compared with the recent full-supervised methods in single-frame as input setup, and notably improves the reconstruction accuracy and consistency when using the video training data.


Asunto(s)
Algoritmos , Benchmarking , Señales (Psicología) , Movimiento (Física) , Aprendizaje Automático Supervisado
3.
IEEE Trans Pattern Anal Mach Intell ; 44(7): 3791-3806, 2022 07.
Artículo en Inglés | MEDLINE | ID: mdl-33566757

RESUMEN

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.


Asunto(s)
Algoritmos , Redes Neurales de la Computación , Humanos , Movimiento (Física) , Programas Informáticos
4.
IEEE Trans Image Process ; 30: 6107-6116, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34166189

RESUMEN

Recent research has witnessed advances in facial image editing tasks including face swapping and face reenactment. However, these methods are confined to dealing with one specific task at a time. In addition, for video facial editing, previous methods either simply apply transformations frame by frame or utilize multiple frames in a concatenated or iterative fashion, which leads to noticeable visual flickers. In this paper, we propose a unified temporally consistent facial video editing framework termed UniFaceGAN. Based on a 3D reconstruction model and a simple yet efficient dynamic training sample selection mechanism, our framework is designed to handle face swapping and face reenactment simultaneously. To enforce the temporal consistency, a novel 3D temporal loss constraint is introduced based on the barycentric coordinate interpolation. Besides, we propose a region-aware conditional normalization layer to replace the traditional AdaIN or SPADE to synthesize more context-harmonious results. Compared with the state-of-the-art facial image editing methods, our framework generates video portraits that are more photo-realistic and temporally smooth.

5.
IEEE Trans Image Process ; 30: 5600-5612, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34110993

RESUMEN

Face Super-Resolution (FSR) aims to infer High-Resolution (HR) face images from the captured Low-Resolution (LR) face image with the assistance of external information. Existing FSR methods are less effective for the LR face images captured with serious low-quality since the huge imaging/degradation gap caused by the different imaging scenarios (i.e., the complex practical imaging scenario that generates test LR images, the simple manual imaging degradation that generates the training LR images) is not considered in these algorithms. In this paper, we propose an image homogenization strategy via re-expression to solve this problem. In contrast to existing methods, we propose a homogenization projection in LR space and HR space as compensation for the classical LR/HR projection to formulate the FSR in a multi-stage framework. We then develop a re-expression process to bridge the gap between the complex degradation and the simple degradation, which can remove the heterogeneous factors such as serious noise and blur. To further improve the accuracy of the homogenization, we extract the image patch set that is invariant to degradation changes as Robust Neighbor Resources (RNR), with which these two homogenization projections re-express the input LR images and the initial inferred HR images successively. Both quantitative and qualitative results on the public datasets demonstrate the effectiveness of the proposed algorithm against the state-of-the-art methods.

6.
IEEE Trans Image Process ; 30: 4008-4021, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33784621

RESUMEN

Accurate 3D reconstruction of the hand and object shape from a hand-object image is important for understanding human-object interaction as well as human daily activities. Different from bare hand pose estimation, hand-object interaction poses a strong constraint on both the hand and its manipulated object, which suggests that hand configuration may be crucial contextual information for the object, and vice versa. However, current approaches address this task by training a two-branch network to reconstruct the hand and object separately with little communication between the two branches. In this work, we propose to consider hand and object jointly in feature space and explore the reciprocity of the two branches. We extensively investigate cross-branch feature fusion architectures with MLP or LSTM units. Among the investigated architectures, a variant with LSTM units that enhances object feature with hand feature shows the best performance gain. Moreover, we employ an auxiliary depth estimation module to augment the input RGB image with the estimated depth map, which further improves the reconstruction accuracy. Experiments conducted on public datasets demonstrate that our approach significantly outperforms existing approaches in terms of the reconstruction accuracy of objects.

7.
Artículo en Inglés | MEDLINE | ID: mdl-32853150

RESUMEN

In this paper, we present an end-to-end learning framework for detailed 3D face reconstruction from a single image1. Our approach uses a 3DMM-based coarse model and a displacement map in UV-space to represent a 3D face. Unlike previous work addressing the problem, our learning framework does not require supervision of surrogate ground-truth 3D models computed with traditional approaches. Instead, we utilize the input image itself as supervision during learning. In the first stage, we combine a photometric loss and a facial perceptual loss between the input face and the rendered face, to regress a 3DMM-based coarse model. In the second stage, both the input image and the regressed texture of the coarse model are unwrapped into UV-space, and then sent through an image-toimage translation network to predict a displacement map in UVspace. The displacement map and the coarse model are used to render a final detailed face, which again can be compared with the original input image to serve as a photometric loss for the second stage. The advantage of learning displacement map in UV-space is that face alignment can be explicitly done during the unwrapping, thus facial details are easier to learn from large amount of data. Extensive experiments demonstrate the superiority of the proposed method over previous work.

8.
IEEE Trans Image Process ; 23(12): 4996-5006, 2014 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-25252282

RESUMEN

The speed of optical flow algorithm is crucial for many video editing tasks such as slow motion synthesis, selection propagation, tone adjustment propagation, and so on. Variational coarse-to-fine optical flow algorithms can generally produce high-quality results but cannot fulfil the speed requirement of many practical applications. Besides, large motions in real-world videos also pose a difficult problem to coarse-to-fine variational approaches. We, in this paper, present a fast optical flow algorithm that can handle large displacement motions. Our algorithm is inspired by recent successes of local methods in visual correspondence searching as well as approximate nearest neighbor field algorithms. The main novelty is a fast randomized edge-preserving approximate nearest neighbor field algorithm, which propagates self-similarity patterns in addition to offsets. Experimental results on public optical flow benchmarks show that our method is significantly faster than state-of-the-art methods without compromising on quality, especially when scenes contain large motions. Finally, we show some demo applications by applying our technique into real-world video editing tasks.

9.
IEEE Trans Image Process ; 23(2): 555-69, 2014 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-26270908

RESUMEN

We present a new efficient edge-preserving filter-"tree filter"-to achieve strong image smoothing. The proposed filter can smooth out high-contrast details while preserving major edges, which is not achievable for bilateral-filter-like techniques. Tree filter is a weighted-average filter, whose kernel is derived by viewing pixel affinity in a probabilistic framework simultaneously considering pixel spatial distance, color/intensity difference, as well as connectedness. Pixel connectedness is acquired by treating pixels as nodes in a minimum spanning tree (MST) extracted from the image. The fact that an MST makes all image pixels connected through the tree endues the filter with the power to smooth out high-contrast, fine-scale details while preserving major image structures, since pixels in small isolated region will be closely connected to surrounding majority pixels through the tree, while pixels inside large homogeneous region will be automatically dragged away from pixels outside the region. The tree filter can be separated into two other filters, both of which turn out to have fast algorithms. We also propose an efficient linear time MST extraction algorithm to further improve the whole filtering speed. The algorithms give tree filter a great advantage in low computational complexity (linear to number of image pixels) and fast speed: it can process a 1-megapixel 8-bit image at ~ 0.25 s on an Intel 3.4 GHz Core i7 CPU (including the construction of MST). The proposed tree filter is demonstrated on a variety of applications.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...