Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters











Database
Language
Publication year range
1.
Article in English | MEDLINE | ID: mdl-39190514

ABSTRACT

Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30-59 FPS and saves 28-35× computational cost on a single V100 GPU. Code and models are publicly available.

2.
IEEE Trans Pattern Anal Mach Intell ; 46(10): 6905-6918, 2024 Oct.
Article in English | MEDLINE | ID: mdl-38598389

ABSTRACT

Neural Radiance Field (NeRF) has achieved substantial progress in novel view synthesis given multi-view images. Recently, some works have attempted to train a NeRF from a single image with 3D priors. They mainly focus on a limited field of view with a few occlusions, which greatly limits their scalability to real-world 360-degree panoramic scenarios with large-size occlusions. In this paper, we present PERF, a 360-degree novel view synthesis framework that trains a panoramic neural radiance field from a single panorama. Notably, PERF allows 3D roaming in a complex scene without expensive and tedious image collection. To achieve this goal, we propose a novel collaborative RGBD inpainting method and a progressive inpainting-and-erasing method to lift up a 360-degree 2D scene to a 3D scene. Specifically, we first predict a panoramic depth map as initialization given a single panorama and reconstruct visible 3D regions with volume rendering. Then we introduce a collaborative RGBD inpainting approach into a NeRF for completing RGB images and depth maps from random views, which is derived from an RGB Stable Diffusion model and a monocular depth estimator. Finally, we introduce an inpainting-and-erasing strategy to avoid inconsistent geometry between a newly-sampled view and reference views. The two components are integrated into the learning of NeRFs in a unified optimization framework and achieve promising results. Extensive experiments on Replica and a new dataset PERF-in-the-wild demonstrate the superiority of our PERF over state-of-the-art methods. Our PERF can be widely used for real-world applications, such as panorama-to-3D, text-to-3D, and 3D scene stylization applications.

3.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15562-15576, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37788193

ABSTRACT

In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noise. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising: 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our approach begins with an efficient bird's-eye-view (BEV) representation generated from simplex noise, which includes a height field for surface elevation and a semantic field for detailed scene semantics. This BEV scene representation enables: 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Moreover, we propose a novel generative neural hash grid to parameterize the latent space based on 3D positions and scene semantics, aiming to encode generalizable features across various scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.

4.
IEEE Trans Neural Netw Learn Syst ; 33(10): 5401-5415, 2022 10.
Article in English | MEDLINE | ID: mdl-33872158

ABSTRACT

Current state-of-the-art visual recognition systems usually rely on the following pipeline: 1) pretraining a neural network on a large-scale data set (e.g., ImageNet) and 2) finetuning the network weights on a smaller, task-specific data set. Such a pipeline assumes that the sole weight adaptation is able to transfer the network capability from one domain to another domain based on a strong assumption that a fixed architecture is appropriate for all domains. However, each domain with a distinct recognition target may need different levels/paths of feature hierarchy, where some neurons may become redundant, and some others are reactivated to form new network structures. In this work, we prove that dynamically adapting network architectures tailored for each domain task along with weight finetuning benefits in both efficiency and effectiveness, compared to the existing image recognition pipeline that only tunes the weights regardless of the architecture. Our method can be easily generalized to an unsupervised paradigm by replacing supernet training with self-supervised learning in the source domain tasks and performing linear evaluation in the downstream tasks. This further improves the search efficiency of our method. Moreover, we also provide principled and empirical analysis to explain why our approach works by investigating the ineffectiveness of existing neural architecture search. We find that preserving the joint distribution of the network architecture and weights is of importance. This analysis not only benefits image recognition but also provides insights for crafting neural networks. Experiments on five representative image recognition tasks, such as person re-identification, age estimation, gender recognition, image classification, and unsupervised domain adaptation, demonstrate the effectiveness of our method.


Subject(s)
Neural Networks, Computer , Recognition, Psychology , Humans , Neurons
5.
IEEE Trans Image Process ; 30: 6392-6407, 2021.
Article in English | MEDLINE | ID: mdl-34197322

ABSTRACT

RGB-Infrared (RGB-IR) cross-modality person re-identification (re-ID) is attracting more and more attention due to requirements for 24-h scene surveillance. However, the high cost of labeling person identities of an RGB-IR dataset largely limits the scalability of supervised models in real-world scenarios. In this paper, we study the unsupervised RGB-IR person re-ID problem (or briefly uRGB-IR re-ID) in which no identity annotations are available in RGB-IR cross-modality datasets. Considering that intra-modality (i.e., RGB-RGB or IR-IR) re-ID is much easier than cross-modality re-ID and can provide shared knowledge for RGB-IR re-ID, we propose a two-stage method to solve the uRGB-IR re-ID, namely homogeneous-to-heterogeneous learning. In the first stage, the unsupervised self-learning method is conducted to learn the intra-modality feature representation and to generate the pseudo-labeled identities of person images separately for each modality. In the second stage, heterogeneous learning is used to learn a shared discriminative feature representation by distilling the knowledge from intra-modality pseudo-labels, to align two modalities via a modality-based consistent learning module, and finally to target modality-invariant learning via a pseudo-labeled positive instance selection module. With the use of homogeneous-to-heterogeneous learning, the proposed unsupervised framework greatly reduces the modality gap and thus learns a robust feature representation against RGB and infrared modalities, leading to promising accuracy. We also propose a novel cross-modality re-ranking approach that includes a self-modality search and a cycle-modality search to tailor the uRGB-IR re-ID. Unlike conventional re-ranking, the proposed re-ranking method takes a modality-based constraint into re-ranking and thus can select more reliable nearest neighbors, which greatly improves uRGB-IR re-ID. The experimental results demonstrate the superiority of our approach on the SYSU-MM01 and RegDB datasets.

6.
IEEE Trans Neural Netw Learn Syst ; 32(5): 2142-2156, 2021 05.
Article in English | MEDLINE | ID: mdl-32663130

ABSTRACT

Person reidentification (Re-ID) benefits greatly from the accurate annotations of existing data sets (e.g., CUHK03 and Market-1501), which are quite expensive because each image in these data sets has to be assigned with a proper label. In this work, we ease the annotation of Re-ID by replacing the accurate annotation with inaccurate annotation, i.e., we group the images into bags in terms of time and assign a bag-level label for each bag. This greatly reduces the annotation effort and leads to the creation of a large-scale Re-ID benchmark called SYSU- 30k . The new benchmark contains 30k individuals, which is about 20 times larger than CUHK03 (1.3k individuals) and Market-1501 (1.5k individuals), and 30 times larger than ImageNet (1k categories). It sums up to 29606918 images. Learning a Re-ID model with bag-level annotation is called the weakly supervised Re-ID problem. To solve this problem, we introduce a differentiable graphical model to capture the dependencies from all images in a bag and generate a reliable pseudolabel for each person's image. The pseudolabel is further used to supervise the learning of the Re-ID model. Compared with the fully supervised Re-ID models, our method achieves state-of-the-art performance on SYSU- 30k and other data sets. The code, data set, and pretrained model will be available at https://github.com/wanggrun/SYSU-30k.

SELECTION OF CITATIONS
SEARCH DETAIL