Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
Add more filters










Publication year range
1.
IEEE Trans Image Process ; 33: 2074-2089, 2024.
Article in English | MEDLINE | ID: mdl-38470584

ABSTRACT

Recently, attempts to learn the underlying 3D structures of a scene from monocular videos in a fully self-supervised fashion have drawn much attention. One of the most challenging aspects of this task is to handle independently moving objects as they break the rigid-scene assumption. In this paper, we show for the first time that pixel positional information can be exploited to learn SVDE (Single View Depth Estimation) from videos. The proposed moving object (MO) masks, which are induced by the depth variance to shifted positional information (SPI) and are referred to as 'SPIMO' masks, are highly robust and consistently remove independently moving objects from the scenes, allowing for robust and consistent learning of SVDE from videos. Additionally, we introduce a new adaptive quantization scheme that assigns the best per-pixel quantization curve for depth discretization, improving the fine granularity and accuracy of the final aggregated depth maps. Finally, we employ existing boosting techniques in a new way that self-supervises moving object depths further. With these features, our pipeline is robust against moving objects and generalizes well to high-resolution images, even when trained with small patches, yielding state-of-the-art (SOTA) results with four- to eight-fold fewer parameters than the previous SOTA techniques that learn from videos. We present extensive experiments on KITTI and CityScapes that show the effectiveness of our method.

2.
IEEE Trans Pattern Anal Mach Intell ; 44(12): 9131-9149, 2022 Dec.
Article in English | MEDLINE | ID: mdl-34727025

ABSTRACT

We propose a novel two-stage training strategy with ambiguity boosting for the self-supervised learning of single view depths from stereo images. Our proposed two-stage learning strategy first aims to obtain a coarse depth prior by training an auto-encoder network for a stereoscopic view synthesis task. This prior knowledge is then boosted and used to self-supervise the model in the second stage of training in our novel ambiguity boosting loss. Our ambiguity boosting loss is a confidence-guided type of data augmentation loss that improves the accuracy and consistency of generated depth maps under several transformations of the single-image input. To show the benefits of the proposed two-stage training strategy with boosting, our two previous depth estimation (DE) networks, one with t-shaped adaptive kernels and the other with exponential disparity volumes, are extended with our new learning strategy, referred to as DBoosterNet-t and DBoosterNet-e, respectively. Our self-supervised DBoosterNets are competitive, and in some cases even better, compared to the most recent supervised SOTA methods, and are remarkably superior to the previous self-supervised methods for monocular DE on the challenging KITTI dataset. We present intensive experimental results, showing the efficacy of our method for the self-supervised monocular DE task.

3.
IEEE Trans Image Process ; 28(12): 5839-5851, 2019 Dec.
Article in English | MEDLINE | ID: mdl-30802861

ABSTRACT

Joint exploration model (JEM) reference codecs of ISO/IEC and ITU-T utilize multiple types of integer transforms based on DCT and DST of various transform sizes for intra- and inter-predictive coding, which has brought a significant improvement in coding efficiency. JEM adopts three types of integer DCTs (DCT-II, DCT-V, and DCT-VIII), and two types of integer DSTs (DST-I and DST-VII). The fast computations of Integer DCT-II and DST-I are well known, but few studies have been performed for the other types such as DCT-V, DCT-VIII, and DST-VII for all transform sizes. In this paper, we present fast computation methods of N-point DCT-V and DCT-VIII. For this, we first decompose the DCT-VIII into a preprocessing matrix, the DST-VII and a post-processing matrix to quickly compute it by using the linear relation between DCT-VIII and DST-VII. Then, we approximate integer kernels of N = 4, 8, 16, and 32 for DCT-V, DCT-VIII, and DST-VII with norm scaling and bit-shift to be compatible with quantization in each stage of multiplications between decomposed matrices for video coding. In various experiments, the proposed fast computation methods have shown to effectively reduce the total complexity of the matrix operations with little loss in BDBR performance. In particular, our methods reduce the number of addition and multiplication operations by 38% and 80.3%, respectively, in average, compared to the original JEM.

4.
IEEE Trans Image Process ; 27(12): 5918-5932, 2018 Dec.
Article in English | MEDLINE | ID: mdl-30072323

ABSTRACT

We present a novel and effective learning-based frame rate upconversion (FRUC) scheme, using linear mapping. The proposed learning-based FRUC scheme consists of: 1) a new hierarchical extended bilateral motion estimation (HEBME) method; 2) a light-weight motion deblur (LWMD) method; and 3) a synthesis-based motion-compensated frame interpolation (S-MCFI) method. First, the HEBME method considerably enhances the accuracy of the motion estimation (ME), which can lead to a significant improvement of the FRUC performance. The proposed HEBME method consists of two ME pyramids with a three-layered hierarchy, where the motion vectors (MVs) are searched in a coarse-to-fine manner via each pyramid. The found MVs are further refined in an enhanced resolution of four times by jointly combining the MVs from the two pyramids. The HEBME method employs a new elaborate matching criterion for precise ME which effectively combines a bilateral absolute difference, an edge variance, pixel variances, and an MV difference among two consecutive blocks and its neighboring blocks. Second, the LWMD method uses the MVs found by the HEBME method and removes the small motion blurs in original frames via transformations by linear mapping. Third, the S-MCFI method finally generates interpolated frames by applying linear mapping kernels for the deblurred original frames. In consequence, our FRUC scheme is capable of precisely generating interpolated frames based on the HEBME for accurate ME, the S-MCFI for elaborate frame interpolation, and the LWMD for contrast enhancement. The experimental results show that our FRUC significantly outperforms the state-of-the-art non-deep learning-based schemes with an average of 1.42 dB higher in the peak signal-to-noise-ratio and shows comparable performance with the state-of-the-art deep learning-based scheme.

5.
IEEE Trans Image Process ; 27(7): 3178-3193, 2018 Jul.
Article in English | MEDLINE | ID: mdl-29641399

ABSTRACT

Conventional predictive video coding-based approaches are reaching the limit of their potential coding efficiency improvements, because of severely increasing computation complexity. As an alternative approach, perceptual video coding (PVC) has attempted to achieve high coding efficiency by eliminating perceptual redundancy, using just-noticeable-distortion (JND) directed PVC. The previous JNDs were modeled by adding white Gaussian noise or specific signal patterns into the original images, which were not appropriate in finding JND thresholds due to distortion with energy reduction. In this paper, we present a novel discrete cosine transform-based energy-reduced JND model, called ERJND, that is more suitable for JND-based PVC schemes. Then, the proposed ERJND model is extended to two learning-based just-noticeable-quantization-distortion (JNQD) models as preprocessing that can be applied for perceptual video coding. The two JNQD models can automatically adjust JND levels based on given quantization step sizes. One of the two JNQD models, called LR-JNQD, is based on linear regression and determines the model parameter for JNQD based on extracted handcraft features. The other JNQD model is based on a convolution neural network (CNN), called CNN-JNQD. To our best knowledge, our paper is the first approach to automatically adjust JND levels according to quantization step sizes for preprocessing the input to video encoders. In experiments, both the LR-JNQD and CNN-JNQD models were applied to high efficiency video coding (HEVC) and yielded maximum (average) bitrate reductions of 38.51% (10.38%) and 67.88% (24.91%), respectively, with little subjective video quality degradation, compared with the input without preprocessing applied.

6.
IEEE Trans Image Process ; 26(3): 1300-1314, 2017 Mar.
Article in English | MEDLINE | ID: mdl-28092557

ABSTRACT

Super-resolution (SR) has become more vital, because of its capability to generate high-quality ultra-high definition (UHD) high-resolution (HR) images from low-resolution (LR) input images. Conventional SR methods entail high computational complexity, which makes them difficult to be implemented for up-scaling of full-high-definition input images into UHD-resolution images. Nevertheless, our previous super-interpolation (SI) method showed a good compromise between Peak-Signal-to-Noise Ratio (PSNR) performances and computational complexity. However, since SI only utilizes simple linear mappings, it may fail to precisely reconstruct HR patches with complex texture. In this paper, we present a novel SR method, which inherits the large-to-small patch conversion scheme from SI but uses global regression based on local linear mappings (GLM). Thus, our new SR method is called GLM-SI. In GLM-SI, each LR input patch is divided into 25 overlapped subpatches. Next, based on the local properties of these subpatches, 25 different local linear mappings are applied to the current LR input patch to generate 25 HR patch candidates, which are then regressed into one final HR patch using a global regressor. The local linear mappings are learned cluster-wise in our off-line training phase. The main contribution of this paper is as follows: Previously, linear-mapping-based conventional SR methods, including SI only used one simple yet coarse linear mapping to each patch to reconstruct its HR version. On the contrary, for each LR input patch, our GLM-SI is the first to apply a combination of multiple local linear mappings, where each local linear mapping is found according to local properties of the current LR patch. Therefore, it can better approximate nonlinear LR-to-HR mappings for HR patches with complex texture. Experiment results show that the proposed GLM-SI method outperforms most of the state-of-the-art methods, and shows comparable PSNR performance with much lower computational complexity when compared with a super-resolution method based on convolutional neural nets (SRCNN15). Compared with the previous SI method that is limited with a scale factor of 2, GLM-SI shows superior performance with average 0.79 dB higher in PSNR, and can be used for scale factors of 3 or higher.

7.
IEEE Trans Image Process ; 25(8): 3787-800, 2016 08.
Article in English | MEDLINE | ID: mdl-27305681

ABSTRACT

In this paper, a low complexity coding unit (CU)-level rate and distortion estimation scheme is proposed for High Efficiency Video Coding (HEVC) hardware-friendly implementation where a Walsh-Hadamard transform (WHT)-based low-complexity integer discrete cosine transform (DCT) is employed for distortion estimation. Since HEVC adopts quadtree structures of coding blocks with hierarchical coding depths, it becomes more difficult to estimate accurate rate and distortion values without actually performing transform, quantization, inverse transform, de-quantization, and entropy coding. Furthermore, DCT for rate-distortion optimization (RDO) is computationally high, because it requires a number of multiplication and addition operations for various transform block sizes of 4-, 8-, 16-, and 32-orders and requires recursive computations to decide the optimal depths of CU or transform unit. Therefore, full RDO-based encoding is highly complex, especially for low-power implementation of HEVC encoders. In this paper, a rate and distortion estimation scheme is proposed in CU levels based on a low-complexity integer DCT that can be computed in terms of WHT whose coefficients are produced in prediction stages. For rate and distortion estimation in CU levels, two orthogonal matrices of 4×4 and 8×8 , which are applied to WHT that are newly designed in a butterfly structure only with addition and shift operations. By applying the integer DCT based on the WHT and newly designed transforms in each CU block, the texture rate can precisely be estimated after quantization using the number of non-zero quantized coefficients and the distortion can also be precisely estimated in transform domain without de-quantization and inverse transform required. In addition, a non-texture rate estimation is proposed by using a pseudoentropy code to obtain accurate total rate estimates. The proposed rate and the distortion estimation scheme can effectively be used for HW-friendly implementation of HEVC encoders with 9.8% loss over HEVC full RDO, which much less than 20.3% and 30.2% loss of a conventional approach and Hadamard-only scheme, respectively.

8.
IEEE Trans Image Process ; 25(5): 2392-406, 2016 May.
Article in English | MEDLINE | ID: mdl-27046873

ABSTRACT

Computational models for image quality assessment (IQA) have been developed by exploring effective features that are consistent with the characteristics of a human visual system (HVS) for visual quality perception. In this paper, we first reveal that many existing features used in computational IQA methods can hardly characterize visual quality perception for local image characteristics and various distortion types. To solve this problem, we propose a new IQA method, called the structural contrast-quality index (SC-QI), by adopting a structural contrast index (SCI), which can well characterize local and global visual quality perceptions for various image characteristics with structural-distortion types. In addition to SCI, we devise some other perceptually important features for our SC-QI that can effectively reflect the characteristics of HVS for contrast sensitivity and chrominance component variation. Furthermore, we develop a modified SC-QI, called structural contrast distortion metric (SC-DM), which inherits desirable mathematical properties of valid distance metricability and quasi-convexity. So, it can effectively be used as a distance metric for image quality optimization problems. Extensive experimental results show that both SC-QI and SC-DM can very well characterize the HVS's properties of visual quality perception for local image characteristics and various distortion types, which is a distinctive merit of our methods compared with other IQA methods. As a result, both SC-QI and SC-DM have better performances with a strong consilience of global and local visual quality perception as well as with much lower computation complexity, compared with the state-of-the-art IQA methods. The MATLAB source codes of the proposed SC-QI and SC-DM are publicly available online at https://sites.google.com/site/sunghobaecv/iqa.

9.
IEEE Trans Cybern ; 45(8): 1476-90, 2015 Aug.
Article in English | MEDLINE | ID: mdl-25291807

ABSTRACT

Social TV is a social media service via TV and social networks through which TV users exchange their experiences about TV programs that they are viewing. For social TV service, two technical aspects are envisioned: grouping of similar TV users to create social TV communities and recommending TV programs based on group and personal interests for personalizing TV. In this paper, we propose a unified topic model based on grouping of similar TV users and recommending TV programs as a social TV service. The proposed unified topic model employs two latent Dirichlet allocation (LDA) models. One is a topic model of TV users, and the other is a topic model of the description words for viewed TV programs. The two LDA models are then integrated via a topic proportion parameter for TV programs, which enforces the grouping of similar TV users and associated description words for watched TV programs at the same time in a unified topic modeling framework. The unified model identifies the semantic relation between TV user groups and TV program description word groups so that more meaningful TV program recommendations can be made. The unified topic model also overcomes an item ramp-up problem such that new TV programs can be reliably recommended to TV users. Furthermore, from the topic model of TV users, TV users with similar tastes can be grouped as topics, which can then be recommended as social TV communities. To verify our proposed method of unified topic-modeling-based TV user grouping and TV program recommendation for social TV services, in our experiments, we used real TV viewing history data and electronic program guide data from a seven-month period collected by a TV poll agency. The experimental results show that the proposed unified topic model yields an average 81.4% precision for 50 topics in TV program recommendation and its performance is an average of 6.5% higher than that of the topic model of TV users only. For TV user prediction with new TV programs, the average prediction precision was 79.6%. Also, we showed the superiority of our proposed model in terms of both topic modeling performance and recommendation performance compared to two related topic models such as polylingual topic model and bilingual topic model.


Subject(s)
Computer Communication Networks , Models, Theoretical , Social Media , Television , Humans , Reproducibility of Results
10.
IEEE Trans Image Process ; 23(8): 3227-40, 2014 Aug.
Article in English | MEDLINE | ID: mdl-24960103

ABSTRACT

In this paper, we propose a new DCT-based just noticeable difference (JND) profile incorporating the spatial contrast sensitivity function, the luminance adaptation effect, and the contrast masking (CM) effect. The proposed JND profile overcomes two limitations of conventional JND profiles: 1) the CM JND models in the conventional JND profiles employed simple texture complexity metrics, which are not often highly correlated with perceived complexity, especially for unstructured patterns. So, we proposed a new texture complexity metric that considers not only contrast intensity, but also structureness of image patterns, called the structural contrast index. We also newly found out that, as the structural contrast index of a background texture pattern increases, the modulation factors for CM-JND show a bandpass property in frequency. Based on this observation, a new CM-JND is modeled as a function of DCT frequency and the proposed structural contrast index, showing significantly high correlations with measured CM-JND values and 2) while the conventional DCT-based JND profiles are only applicable for specific transform block sizes, our proposed DCT-based JND profile is first designed to be applicable to any size of transform by deriving a new summation effect function, which can also be appropriately applied for quad-tree transform of high efficiency video coding. For the overall performance, the proposed DCT-based JND profile shows more tolerance for distortions with better perceptual quality than other JND profiles under comparison.

SELECTION OF CITATIONS
SEARCH DETAIL
...