Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Artigo em Inglês | MEDLINE | ID: mdl-38662566

RESUMO

Video Coding for Machines (VCM) aims to compress visual signals for machine analysis. However, existing methods only consider a few machines, neglecting the majority. Moreover, the machine's perceptual characteristics are not leveraged effectively, resulting in suboptimal compression efficiency. To overcome these limitations, this paper introduces Satisfied Machine Ratio (SMR), a metric that statistically evaluates the perceptual quality of compressed images and videos for machines by aggregating satisfaction scores from them. Each score is derived from machine perceptual differences between original and compressed images. Targeting image classification and object detection tasks, we build two representative machine libraries for SMR annotation and create a large-scale SMR dataset to facilitate SMR studies. We then propose an SMR prediction model based on the correlation between deep feature differences and SMR. Furthermore, we introduce an auxiliary task to increase the prediction accuracy by predicting the SMR difference between two images in different quality. Extensive experiments demonstrate that SMR models significantly improve compression performance for machines and exhibit robust generalizability on unseen machines, codecs, datasets, and frame types. Code is available at https://github.com/ywwynm/SMR.

2.
IEEE Trans Image Process ; 32: 5478-5493, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37782618

RESUMO

Learned image compression methods have achieved satisfactory results in recent years. However, existing methods are typically designed for RGB format, which are not suitable for YUV420 format due to the variance of different formats. In this paper, we propose an information-guided compression framework using cross-component attention mechanism, which can achieve efficient image compression in YUV420 format. Specifically, we design a dual-branch advanced information-preserving module (AIPM) based on the information-guided unit (IGU) and attention mechanism. On the one hand, the dual-branch architecture can prevent changes in original data distribution and avoid information disturbance between different components. The feature attention block (FAB) can preserve the important information. On the other hand, IGU can efficiently utilize the correlations between Y and UV components, which can further preserve the information of UV by the guidance of Y. Furthermore, we design an adaptive cross-channel enhancement module (ACEM) to reconstruct the details by utilizing the relations from different components, which makes use of the reconstructed Y as the textural and structural guidance for UV components. Extensive experiments show that the proposed framework can achieve the state-of-the-art performance in image compression for YUV420 format. More importantly, the proposed framework outperforms Versatile Video Coding (VVC) with 8.37% BD-rate reduction on common test conditions (CTC) sequences on average. In addition, we propose a quantization scheme for context model without model retraining, which can overcome the cross-platform decoding error caused by the floating-point operations in context model and provide a reference approach for the application of neural codec on different platforms.

3.
IEEE Trans Image Process ; 31: 7222-7236, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36374881

RESUMO

In-loop filters have attracted increasing attention due to the remarkable noise-reduction capability in the hybrid video coding framework. However, the existing in-loop filters in Versatile Video Coding (VVC) mainly take advantage of the image local similarity. Although some non-local based in-loop filters can make up for this shortcoming, the widely-used unsupervised parameter estimation method by non-local filters limits the performance. In view of this, we propose a deformable Wiener Filter (DWF). It combines the local and non-local characteristics and supervisedly trains the filter coefficients based on the Wiener Filter theory. In the filtering process, local adjacent samples and non-local similar samples are first derived for each sample of interest. Then the to-be-filtered samples are classified into specific groups based on the patch-level noise and sample-level characteristics. Samples in each group share the same filter coefficients. After that, the local and non-local reference samples are adaptively fused based on the classification results. Finally, the filtering operation with outlier data constraints is conducted for each to-be-filtered sample. Moreover, the performance of the proposed DWF is analyzed with different reference sample derivation schemes in detail. Simulation results show that the proposed approach achieves 1.16%, 1.92%, and 2.67% bit-rate savings on average compared to the VTM-11.0 for All Intra, Random Access, and Low-Delay B configurations, respectively.

4.
Artigo em Inglês | MEDLINE | ID: mdl-35930518

RESUMO

As a highly ill-posed issue, single-image super-resolution (SISR) has been widely investigated in recent years. The main task of SISR is to recover the information loss caused by the degradation procedure. According to the Nyquist sampling theory, the degradation leads to the aliasing effect and makes it hard to restore the correct textures from low-resolution (LR) images. In practice, there are correlations and self-similarities among the adjacent patches in the natural images. This article considers the self-similarity and proposes a hierarchical image super-resolution network (HSRNet) to suppress the influence of aliasing. We consider the SISR issue in the optimization perspective and propose an iterative solution pattern based on the half-quadratic splitting (HQS) method. To explore the texture with local image prior, we design a hierarchical exploration block (HEB) and progressive increase the receptive field. Furthermore, multilevel spatial attention (MSA) is devised to obtain the relations of adjacent feature and enhance the high-frequency information, which acts as a crucial role for visual experience. The experimental result shows that HSRNet achieves better quantitative and visual performance than other works and remits the aliasing more effectively.

5.
IEEE Trans Image Process ; 31: 2824-2838, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35349440

RESUMO

In the latest video coding standard, namely Versatile Video Coding (VVC), more directional intra modes and reference lines have been utilized to improve prediction efficiency. However, complex content still cannot be predicted well with only the adjacent reference samples. Although nonlocal prediction has been proposed to further improve the prediction efficiency in existing algorithms, explicit signalling or matching error potentially limits the coding efficiency. To address these issues, we propose a joint local and nonlocal progressive prediction scheme, targeting at improving nonlocal prediction accuracy without additional signalling. Specifically, template matching based prediction (TMP) is conducted firstly to derive an initial nonlocal predictor. Based on the first prediction and previously decoded reconstruction information, a local template, including inner textures and neighboring reconstruction, is carefully designed. With the local template involved in nonlocal matching process, a more accurate nonlocal predictor can be found progressively in the second prediction. Finally, the coefficients from the two predictions are fused and transmitted in bitstreams. In this way, more accurate nonlocal predictor can be derived implicitly with local information instead of being explicitly signalled. Experimental results on the reference software VTM-9.0 of VVC show that the method achieves 1.02% BD-Rate reduction for natural sequences and 2.31% BD-Rate reduction for screen content videos under all intra (AI) configuration.

6.
IEEE Trans Image Process ; 31: 1298-1310, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35015635

RESUMO

Transform coding removes redundancy by de-correlating the residual data, which plays a crucial role in video coding. Since different transform cores can adapt to different video content and residual data, multiple cores transform has been proposed to improve the coding performance. However, the overhead for representing the transform type is inevitable. This paper intensively studies the statistical characteristics of transform coefficients. Theoretical analysis shows that the implicit method in the Implicitly Selected Transform (IST) is superior to the explicit signalling method. As such, we propose the parity adjustment scheme which seamlessly cooperates with the rate-distortion optimization quantization for IST. Furthermore, we propose two combination methods, size-based and number-based, to optimize IST. Moreover, a restriction region of IST is applied to reduce the complexity. Experimental results on HPM-6.0, which is the reference software of the AVS3 video coding standard, show that our proposed method can achieve 1.76% and 0.76% BD-Rate savings on average under AI and RA configurations, respectively, along with negligible decoding time variations. The comparison results between the proposed method and explicit signalling method illustrate that the proposed method can achieve better coding gain than the latter. Our method has been adopted in AVS3.

7.
IEEE Trans Image Process ; 31: 30-42, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34793298

RESUMO

Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.

8.
IEEE Trans Image Process ; 30: 7305-7316, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34403346

RESUMO

Cross-component linear model (CCLM) prediction has been repeatedly proven to be effective in reducing the inter-channel redundancies in video compression. Essentially speaking, the linear model is identically trained by employing accessible luma and chroma reference samples at both encoder and decoder, elevating the level of operational complexity due to the least square regression or max-min based model parameter derivation. In this paper, we investigate the capability of the linear model in the context of sub-sampled based cross-component correlation mining, as a means of significantly releasing the operation burden and facilitating the hardware and software design for both encoder and decoder. In particular, the sub-sampling ratios and positions are elaborately designed by exploiting the spatial correlation and the inter-channel correlation. Extensive experiments verify that the proposed method is characterized by its simplicity in operation and robustness in terms of rate-distortion performance, leading to the adoption by Versatile Video Coding (VVC) standard and the third generation of Audio Video Coding Standard (AVS3).

9.
IEEE Trans Image Process ; 30: 2422-2435, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33493117

RESUMO

Human pose transfer (HPT) is an emerging research topic with huge potential in fashion design, media production, online advertising and virtual reality. For these applications, the visual realism of fine-grained appearance details is crucial for production quality and user engagement. However, existing HPT methods often suffer from three fundamental issues: detail deficiency, content ambiguity and style inconsistency, which severely degrade the visual quality and realism of generated images. Aiming towards real-world applications, we develop a more challenging yet practical HPT setting, termed as Fine-grained Human Pose Transfer (FHPT), with a higher focus on semantic fidelity and detail replenishment. Concretely, we analyze the potential design flaws of existing methods via an illustrative example, and establish the core FHPT methodology by combing the idea of content synthesis and feature transfer together in a mutually-guided fashion. Thereafter, we substantiate the proposed methodology with a Detail Replenishing Network (DRN) and a corresponding coarse-to-fine model training scheme. Moreover, we build up a complete suite of fine-grained evaluation protocols to address the challenges of FHPT in a comprehensive manner, including semantic analysis, structural detection and perceptual quality assessment. Extensive experiments on the DeepFashion benchmark dataset have verified the power of proposed benchmark against start-of-the-art works, with 12%-14% gain on top-10 retrieval recall, 5% higher joint localization accuracy, and near 40% gain on face identity preservation. Our codes, models and evaluation tools will be released at https://github.com/Lotayou/RATE.


Assuntos
Processamento de Imagem Assistida por Computador/métodos , Aprendizado de Máquina , Postura/fisiologia , Algoritmos , Feminino , Humanos , Masculino
10.
Artigo em Inglês | MEDLINE | ID: mdl-32011252

RESUMO

In recent years, supervised deep learning methods have shown a great promise in dense depth estimation. However, massive high-quality training data are expensive and impractical to acquire. Alternatively, self-supervised learning-based depth estimators can learn the latent transformation from monocular or binocular video sequences by minimizing the photometric warp error between consecutive frames, but they suffer from the scale ambiguity problem or have difficulty in estimating precise pose changes between frames. In this paper, we propose a joint self-supervised deep learning pipeline for depth and ego-motion estimation by employing the advantages of adversarial learning and joint optimization with spatial-temporal geometrical constraints. The stereo reconstruction error provides the spatial geometric constraint to estimate the absolute scale depth. Meanwhile, the depth map with an absolute scale and a pre-trained pose network serves as a good starting point for direct visual odometry (DVO). DVO optimization based on spatial geometric constraints can result in a fine-grained ego-motion estimation with the additional backpropagation signals provided to the depth estimation network. Finally, the spatial and temporal domain-based reconstructed views are concatenated, and the iterative coupling optimization process is implemented in combination with the adversarial learning for accurate depth and precise ego-motion estimation. The experimental results show superior performance compared with state-of-the-art methods for monocular depth and ego-motion estimation on the KITTI dataset and a great generalization ability of the proposed approach.

11.
IEEE Trans Image Process ; 28(10): 4832-4844, 2019 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-31059444

RESUMO

In this paper, we propose an efficient inter prediction scheme by introducing the deep virtual reference frame (VRF), which serves better reference in the temporal redundancy removal process of video coding. In particular, the high quality VRF is generated with the deep learning-based frame rate up conversion (FRUC) algorithm from two reconstructed bi-directional frames, which is subsequently incorporated into the reference list serving as the high quality reference. Moreover, to alleviate the compression artifacts of VRF, we develop a convolutional neural network (CNN)-based enhancement model to further improve its quality. To facilitate better utilization of the VRF, a CTU level coding mode termed as direct virtual reference frame (DVRF) is devised, which achieves better trade-off between compression performance and complexity. The proposed scheme is integrated into HM-16.6 and JEM-7.1 software platforms, and the simulation results under random access (RA) configuration demonstrate significant superiority of the proposed method. When adding VRF to RPS, more than 6% average BD-rate gain is achieved for HEVC test sequences on HM-16.6, and 0.8% BD-rate gain is observed based on JEM-7.1 software. Regarding the DVRF mode, 3.6% bitrate saving is achieved on HM-16.6 with the computational complexity effectively reduced.

12.
Artigo em Inglês | MEDLINE | ID: mdl-30998463

RESUMO

Motion compensation has been widely employed for removing temporal redundancies in typical hybrid video coding framework. The popular video compression standards, such as H.264/AVC and HEVC, adopt the block based partitioning model to describe the motion filed due to its high compression efficiency and relatively low computational complexity. However, block based motion compensation may not align with the actual object motion boundaries, potentially limiting the compression efficiency. In view of this, we propose a three-zone segmentation based motion compensation scheme to improve the description accuracy of motion field as well as the coding efficiency. In particular, the segmentation information is implied in the reference frame instead of being explicitly signalled. Based on the segmentation information, three motion compensation zones can be identified, including one edge, one foreground, and one background zone. The foreground zone is motion compensated by the signalled motion vector of the block, and the background zone is motion compensated by the motion information implicitly derived from the local motion field. Regarding the edge zone, it is viewed as an overlapped area and the weighted compensation strategy is adopted. The proposed algorithm is implemented into the reference software VTM-1.0 of Versatile Video Coding (VVC), and simulation results show that the algorithm can achieve 1.14% and 1.06% bitrate savings for random access and lowdelay configurations, respectively.

13.
IEEE Trans Image Process ; 28(7): 3343-3356, 2019 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-30714920

RESUMO

Recently, convolutional neural network (CNN) has attracted tremendous attention and has achieved great success in many image processing tasks. In this paper, we focus on CNN technology combined with image restoration to facilitate video coding performance and propose the content-aware CNN based in-loop filtering for high-efficiency video coding (HEVC). In particular, we quantitatively analyze the structure of the proposed CNN model from multiple dimensions to make the model interpretable and optimal for CNN-based loop filtering. More specifically, each coding tree unit (CTU) is treated as an independent region for processing, such that the proposed content-aware multimodel filtering mechanism is realized by the restoration of different regions with different CNN models under the guidance of the discriminative network. To adapt the image content, the discriminative neural network is learned to analyze the content characteristics of each region for the adaptive selection of the deep learning model. The CTU level control is also enabled in the sense of rate-distortion optimization. To learn the CNN model, an iterative training method is proposed by simultaneously labeling filter categories at the CTU level and fine-tuning the CNN model parameters. The CNN based in-loop filter is implemented after sample adaptive offset in HEVC, and extensive experiments show that the proposed approach significantly improves the coding performance and achieves up to 10.0% bit-rate reduction. On average, 4.1%, 6.0%, 4.7%, and 6.0% bit-rate reduction can be obtained under all intra, low delay, low delay P, and random access configurations, respectively.

14.
IEEE Trans Image Process ; 27(10): 4987-5001, 2018 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-29985138

RESUMO

Transform and quantization account for a considerable amount of computation time in video encoding process. However, there are a large number of discrete cosine transform coefficients which are finally quantized into zeros. In essence, blocks with all zero quantized coefficients do not transmit any information, but still occupy substantial unnecessary computational resources. As such, detecting all-zero block (AZB) before transform and quantization has been recognized to be an efficient approach to speed up the encoding process. Instead of considering the hard-decision quantization (HDQ) only, in this paper, we incorporate the properties of soft-decision quantization into the AZB detection. In particular, we categorize the AZB blocks into genuine AZBs (G-AZB) and pseudo AZBs (P-AZBs) to distinguish their origins. For G-AZBs directly generated from HDQ, the sum of absolute transformed difference-based approach is adopted for early termination. Regarding the classification of P-AZBs which are generated in the sense of rate-distortion optimization, the rate-distortion models established based on transform coefficients together with the adaptive searching of the maximum transform coefficient are jointly employed for the discrimination. Experimental results show that our algorithm can achieve up to 24.16% transform and quantization time-savings with less than 0.06% RD performance loss. The total encoder time saving is about 5.18% on average with the maximum value up to 9.12%. Moreover, the detection accuracy of larger TU sizes, such as and can reach to 95% on average.

15.
IEEE Trans Image Process ; 26(8): 3802-3816, 2017 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-28500003

RESUMO

Rate distortion optimized quantization (RDOQ) is an efficient encoder optimization method that plays an important role in improving the rate-distortion (RD) performance of the high-efficiency video coding (HEVC) codecs. However, the superior performance of RDOQ is achieved at the expense of high computational complexity cost in two stages RD minimization, including the determination of optimal quantized level among available candidates for each transformed coefficient and the determination of best quantized coefficients for transform units with the minimum total cost, to softly optimize the quantized coefficients. To reduce the computational cost of the RDOQ algorithm in HEVC, we propose a low-complexity RDOQ scheme by modeling the statistics of the transform coefficients with hybrid Laplace distribution. In this manner, specifically designed block level rate and distortion models are established based on the coefficient distribution. Therefore, the optimal quantization levels can be directly determined by optimizing the RD performance of the whole block, while the complicated RD cost calculations can be eventually avoided. Extensive experimental results show that with about 0.3%-0.4% RD performance degradation, the proposed low-complexity RDOQ algorithm is able to reduce around 70% quantization time with up to 17% total encoding time reduction compared with the original RDOQ implementation in HEVC on average.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...