Search | VHL Regional Portal

1.

Learning Degradation-Robust Spatiotemporal Frequency-Transformer for Video Super-Resolution.

Qiu, Zhongwei; Yang, Huan; Fu, Jianlong; Liu, Daochang; Xu, Chang; Fu, Dongmei.

IEEE Trans Pattern Anal Mach Intell ; 45(12): 14888-14904, 2023 Dec.

Article in English | MEDLINE | ID: mdl-37669199

ABSTRACT

Video Super-Resolution (VSR) aims to restore high-resolution (HR) videos from low-resolution (LR) videos. Existing VSR techniques usually recover HR frames by extracting pertinent textures from nearby frames with known degradation processes. Despite significant progress, grand challenges remain to effectively extract and transmit high-quality textures from high-degraded low-quality sequences, such as blur, additive noises, and compression artifacts. This work proposes a novel degradation-robust Frequency-Transformer (FTVSR++) for handling low-quality videos that carry out self-attention in a combined space-time-frequency domain. First, video frames are split into patches and each patch is transformed into spectral maps in which each channel represents a frequency band. It permits a fine-grained self-attention on each frequency band so that real visual texture can be distinguished from artifacts. Second, a novel dual frequency attention (DFA) mechanism is proposed to capture the global and local frequency relations, which can handle different complicated degradation processes in real-world scenarios. Third, we explore different self-attention schemes for video processing in the frequency domain and discover that a "divided attention" which conducts joint space-frequency attention before applying temporal-frequency attention, leads to the best video enhancement quality. Extensive experiments on three widely-used VSR datasets show that FTVSR++ outperforms state-of-the-art methods on different low-quality videos with clear visual margins.

2.

4D LUT: Learnable Context-Aware 4D Lookup Table for Image Enhancement.

Liu, Chengxu; Yang, Huan; Fu, Jianlong; Qian, Xueming.

IEEE Trans Image Process ; 32: 4742-4756, 2023.

Article in English | MEDLINE | ID: mdl-37607133

ABSTRACT

Image enhancement aims at improving the aesthetic visual quality of photos by retouching the color and tone, and is an essential technology for professional digital photography. Recent years deep learning-based image enhancement algorithms have achieved promising performance and attracted increasing popularity. However, typical efforts attempt to construct a uniform enhancer for all pixels' color transformation. It ignores the pixel differences between different content (e.g., sky, ocean, etc.) that are significant for photographs, causing unsatisfactory results. In this paper, we propose a novel learnable context-aware 4-dimensional lookup table (4D LUT), which achieves content-dependent enhancement of different contents in each image via adaptively learning of photo context. In particular, we first introduce a lightweight context encoder and a parameter encoder to learn a context map for the pixel-level category and a group of image-adaptive coefficients, respectively. Then, the context-aware 4D LUT is generated by integrating multiple basis 4D LUTs via the coefficients. Finally, the enhanced image can be obtained by feeding the source image and context map into fused context-aware 4D LUT via quadrilinear interpolation. Compared with traditional 3D LUT, i.e., RGB mapping to RGB, which is usually used in camera imaging pipeline systems or tools, 4D LUT, i.e., RGBC(RGB+Context) mapping to RGB, enables finer control of color transformations for pixels with different content in each image, even though they have the same RGB values. Experimental results demonstrate that our method outperforms other state-of-the-art methods in widely-used benchmarks.

3.

TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation.

Liu, Chengxu; Yang, Huan; Fu, Jianlong; Qian, Xueming.

IEEE Trans Image Process ; 32: 4728-4741, 2023.

Article in English | MEDLINE | ID: mdl-37566503

ABSTRACT

Video frame interpolation (VFI) aims to synthesize an intermediate frame between two consecutive frames. State-of-the-art approaches usually adopt a two-step solution, which includes 1) generating locally-warped pixels by calculating the optical flow based on pre-defined motion patterns (e.g., uniform motion, symmetric motion), 2) blending the warped pixels to form a full frame through deep neural synthesis networks. However, for various complicated motions (e.g., non-uniform motion, turn around), such improper assumptions about pre-defined motion patterns introduce the inconsistent warping from the two consecutive frames. This leads to the warped features for new frames are usually not aligned, yielding distortion and blur, especially when large and complex motions occur. To solve this issue, in this paper we propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI). In particular, we formulate the warped features with inconsistent motions as query tokens, and formulate relevant regions in a motion trajectory from two original consecutive frames into keys and values. Self-attention is learned on relevant tokens along the trajectory to blend the pristine features into intermediate frames through end-to-end training. Experimental results demonstrate that our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks. Both code and pre-trained models will be released at https://github.com/ChengxuLiu/TTVFI.

4.

Aggregated Contextual Transformations for High-Resolution Image Inpainting.

Zeng, Yanhong; Fu, Jianlong; Chao, Hongyang; Guo, Baining.

IEEE Trans Vis Comput Graph ; 29(7): 3266-3280, 2023 Jul.

Article in English | MEDLINE | ID: mdl-35254985

ABSTRACT

Image inpainting that completes large free-form missing regions in images is a promising yet challenging task. State-of-the-art approaches have achieved significant progress by taking advantage of generative adversarial networks (GAN). However, these approaches can suffer from generating distorted structures and blurry textures in high-resolution images (e.g., 512×512). The challenges mainly drive from (1) image content reasoning from distant contexts, and (2) fine-grained texture synthesis for a large missing region. To overcome these two challenges, we propose an enhanced GAN-based model, named Aggregated COntextual-Transformation GAN (AOT-GAN), for high-resolution image inpainting. Specifically, to enhance context reasoning, we construct the generator of AOT-GAN by stacking multiple layers of a proposed AOT block. The AOT blocks aggregate contextual transformations from various receptive fields, allowing to capture both informative distant image contexts and rich patterns of interest for context reasoning. For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task. Such a training objective forces the discriminator to distinguish the detailed appearances of real and synthesized patches, and in turn facilitates the generator to synthesize clear textures. Extensive comparisons on Places2, the most challenging benchmark with 1.8 million high-resolution images of 365 complex scenes, show that our model outperforms the state-of-the-art. A user study including more than 30 subjects further validates the superiority of AOT-GAN. We further evaluate the proposed AOT-GAN in practical applications, e.g., logo removal, face editing, and object removal. Results show that our model achieves promising completions in the real world. We release codes and models in https://github.com/researchmm/AOT-GAN-for-Inpainting.

5.

Cyclic Differentiable Architecture Search.

Yu, Hongyuan; Peng, Houwen; Huang, Yan; Fu, Jianlong; Du, Hao; Wang, Liang; Ling, Haibin.

IEEE Trans Pattern Anal Mach Intell ; 45(1): 211-228, 2023 Jan.

Article in English | MEDLINE | ID: mdl-35196225

ABSTRACT

Differentiable ARchiTecture Search, i.e., DARTS, has drawn great attention in neural architecture search. It tries to find the optimal architecture in a shallow search network and then measures its performance in a deep evaluation network. The independent optimization of the search and evaluation networks, however, leaves a room for potential improvement by allowing interaction between the two networks. To address the problematic optimization issue, we propose new joint optimization objectives and a novel Cyclic Differentiable ARchiTecture Search framework, dubbed CDARTS. Considering the structure difference, CDARTS builds a cyclic feedback mechanism between the search and evaluation networks with introspective distillation. First, the search network generates an initial architecture for evaluation, and the weights of the evaluation network are optimized. Second, the architecture weights in the search network are further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks and thus enables the evolution of the architecture to fit the final evaluation network. The experiments and analysis on CIFAR, ImageNet and NATS-Bench [95] demonstrate the effectiveness of the proposed approach over the state-of-the-art ones. Specifically, in the DARTS search space, we achieve 97.52% top-1 accuracy on CIFAR10 and 76.3% top-1 accuracy on ImageNet. In the chain-structured search space, we achieve 78.2% top-1 accuracy on ImageNet, which is 1.1% higher than EfficientNet-B0. Our code and models are publicly available at https://github.com/microsoft/Cream.

6.

Multi-Scale 2D Temporal Adjacency Networks for Moment Localization With Natural Language.

Zhang, Songyang; Peng, Houwen; Fu, Jianlong; Lu, Yijuan; Luo, Jiebo.

IEEE Trans Pattern Anal Mach Intell ; 44(12): 9073-9087, 2022 Dec.

Article in English | MEDLINE | ID: mdl-34665720

ABSTRACT

We address the problem of retrieving a specific moment from an untrimmed video by natural language. It is a challenging problem because a target moment may take place in the context of other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they do not fully consider the temporal contexts between temporal moments. In this paper, we model the temporal context between video moments by a set of predefined two-dimensional maps under different temporal scales. For each map, one dimension indicates the starting time of a moment and the other indicates the duration. These 2D temporal maps can cover diverse video moments with different lengths, while representing their adjacent contexts at different temporal scales. Based on the 2D temporal maps, we propose a Multi-Scale Temporal Adjacency Network (MS-2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal contexts at each scale, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed MS-2D-TAN on three challenging benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our MS-2D-TAN outperforms the state of the art.

7.

Reference-Based Defect Detection Network.

Zeng, Zhaoyang; Liu, Bei; Fu, Jianlong; Chao, Hongyang.

IEEE Trans Image Process ; 30: 6637-6647, 2021.

Article in English | MEDLINE | ID: mdl-34280100

ABSTRACT

The defect detection task can be regarded as a realistic scenario of object detection in the computer vision field and it is widely used in the industrial field. Directly applying vanilla object detector to defect detection task can achieve promising results, while there still exists challenging issues that have not been solved. The first issue is the texture shift which means a trained defect detector model will be easily affected by unseen texture, and the second issue is partial visual confusion which indicates that a partial defect box is visually similar with a complete box. To tackle these two problems, we propose a Reference-based Defect Detection Network (RDDN). Specifically, we introduce template reference and context reference to against those two problems, respectively. Template reference can reduce the texture shift from image, feature or region levels, and encourage the detectors to focus more on the defective area as a result. We can use either well-aligned template images or the outputs of a pseudo template generator as template references in this work, and they are jointly trained with detectors by the supervision of normal samples. To solve the partial visual confusion issue, we propose to leverage the carried context information of context reference, which is the concentric bigger box of each region proposal, to perform more accurate region classification and regression. Experiments on two defect detection datasets demonstrate the effectiveness of our proposed approach.

8.

Revisiting Anchor Mechanisms for Temporal Action Localization.

Yang, Le; Peng, Houwen; Zhang, Dingwen; Fu, Jianlong; Han, Junwei.

IEEE Trans Image Process ; PP2020 Aug 19.

Article in English | MEDLINE | ID: mdl-32813656

ABSTRACT

Most of the current action localization methods follow an anchor-based pipeline: depicting action instances by pre-defined anchors, learning to select the anchors closest to the ground truth, and predicting the confidence of anchors with refinements. Pre-defined anchors set prior about the location and duration for action instances, which facilitates the localization for common action instances but limits the flexibility for tackling action instances with drastic varieties, especially for extremely short or extremely long ones. To address this problem, this paper proposes a novel anchor-free action localization module that assists action localization by temporal points. Specifically, this module represents an action instance as a point with its distances to the starting boundary and ending boundary, alleviating the pre-defined anchor restrictions in terms of action localization and duration. The proposed anchor-free module is capable of predicting the action instances whose duration is either extremely short or extremely long. By combining the proposed anchor-free module with a conventional anchor-based module, we propose a novel action localization framework, called A2Net. The cooperation between anchor-free and anchor-based modules achieves superior performance to the state-of-the-art on THUMOS14 (45.5% vs. 42.8%). Furthermore, comprehensive experiments demonstrate the complementarity between the anchor-free and the anchor-based module, making A2Net simple but effective.

9.

Learning Rich Part Hierarchies with Progressive Attention Networks for Fine-Grained Image Recognition.

Zheng, Heliang; Fu, Jianlong; Zha, Zheng-Jun; Luo, Jiebo; Mei, Tao.

IEEE Trans Image Process ; 2019 Jun 14.

Article in English | MEDLINE | ID: mdl-31217107

ABSTRACT

We investigate the localization of subtle yet discriminative parts for fine-grained image recognition. Based on the observation that such parts typically exist within a hierarchical structure (e.g., from a coarse-scale "head" to a fine-scale "eye" when recognizing bird species), we propose a novel progressive-attention convolutional neural network (PA-CNN) to progressively localize parts at multiple scales. The PA-CNN localizes parts in two steps, where a part proposal network (PPN) generates multiple local attention maps, and a part rectification network (PRN) learns part-specific features from each proposal and provides the PPN with refined part locations. This coupling of the PPN and PRN allows them to be optimized in a mutually reinforcing manner, leading to improved pinpointing of fine-grained parts. Moreover, the convolutional parameters for a PPN at a finer scale can be inherited from the PRN at a coarser scale, enabling a rich part hierarchy (e.g., eye and beak in a bird's head) to be learned in a stacked fashion. Case studies show that PA-CNN can precisely identify parts without using bounding box/part annotations. In addition, quantitative evaluations demonstrate that PA-CNN yields state-of-the-art performance in three challenging fine-grained recognition tasks. i.e., CUB-200-2011, FGVC-Aircraft, and Stanford Cars.

10.

Microstructural change of degummed Bombyx mori silk: an in situ stretching wide-angle X-ray-scattering study.

Liang, Ku; Gong, Yu; Fu, Jianlong; Yan, Shi; Tan, Yuanyuan; Du, Rong; Xing, Xueqing; Mo, Guang; Chen, Zhongjun; Cai, Quan; Sun, Dongbai; Wu, Zhonghua.

Int J Biol Macromol ; 57: 99-104, 2013 Jun.

Article in English | MEDLINE | ID: mdl-23466498

ABSTRACT

The microstructural change of degummed Bombyx mori silk was examined by in situ wide-angle X-ray-scattering (WAXS) with applied stretching force. WAXS patterns confirmed that the crystalline and amorphous regions coexist in the silk fibers. The crystallites with ß-sheet structure have an orthorhombic unit cell with lattice parameters: a=9.10 Å, b=9.71 Å and c=6.80 Å. The crystallite size, crystallite orientation and crystallinity were also estimated based on the WAXS patterns. The results demonstrate that the crystallite size is almost unchanged with the stretching strain. The crystallinity is approximately linearly increasing with the applied stretching force. However, the change of the unit-cell orientation degree with c-axis along the fiber axis behaves as a fast stage and an approximately unchanged stage during the in situ stretching process. All these experimental phenomena confirm that the microstructure of the degummed silk fibers can be well explained by the model of oriented ß-sheet structure with a banded feature.

Subject(s)

Bombyx/chemistry , Models, Molecular , Silk/chemistry , Animals , Protein Structure, Secondary , Scattering, Radiation , X-Rays

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL