Search | VHL Regional Portal

SipMaskv2: Enhanced Fast Image and Video Instance Segmentation.

Cao, Jiale; Pang, Yanwei; Anwer, Rao Muhammad; Cholakkal, Hisham; Khan, Fahad Shahbaz; Shao, Ling.

IEEE Trans Pattern Anal Mach Intell ; 45(3): 3798-3812, 2023 Mar.

Article in English | MEDLINE | ID: mdl-37815954

ABSTRACT

We propose a fast single-stage method for both image and video instance segmentation, called SipMask, that preserves the instance spatial information by performing multiple sub-region mask predictions. The main module in our method is a light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for the sub-regions within a bounding-box, enabling a better delineation of spatially adjacent instances. To better correlate mask prediction with object detection, we further propose a mask alignment weighting loss and a feature alignment scheme. In addition, we identify two issues that impede the performance of single-stage instance segmentation and introduce two modules, including a sample selection scheme and an instance refinement module, to address these two issues. Experiments are performed on both image instance segmentation dataset MS COCO and video instance segmentation dataset YouTube-VIS. On MS COCO test-dev set, our method achieves a state-of-the-art performance. In terms of real-time capabilities, it outperforms YOLACT by a gain of 3.0% (mask AP) under the similar settings, while operating at a comparable speed. On YouTube-VIS validation set, our method also achieves promising results. The source code is available at https://github.com/JialeCao001/SipMask.

Towards Partial Supervision for Generic Object Counting in Natural Scenes.

Cholakkal, Hisham; Sun, Guolei; Khan, Salman; Khan, Fahad Shahbaz; Shao, Ling; Van Gool, Luc.

IEEE Trans Pattern Anal Mach Intell ; 44(3): 1604-1622, 2022 03.

Article in English | MEDLINE | ID: mdl-32870786

ABSTRACT

Generic object counting in natural scenes is a challenging computer vision problem. Existing approaches either rely on instance-level supervision or absolute count information to train a generic object counter. We introduce a partially supervised setting that significantly reduces the supervision level required for generic object counting. We propose two novel frameworks, named lower-count (LC) and reduced lower-count (RLC), to enable object counting under this setting. Our frameworks are built on a novel dual-branch architecture that has an image classification and a density branch. Our LC framework reduces the annotation cost due to multiple instances in an image by using only lower-count supervision for all object categories. Our RLC framework further reduces the annotation cost arising from large numbers of object categories in a dataset by only using lower-count supervision for a subset of categories and class-labels for the remaining ones. The RLC framework extends our dual-branch LC framework with a novel weight modulation layer and a category-independent density map prediction. Experiments are performed on COCO, Visual Genome and PASCAL 2007 datasets. Our frameworks perform on par with state-of-the-art approaches using higher levels of supervision. Additionally, we demonstrate the applicability of our LC supervised density map for image-level supervised instance segmentation.

Subject(s)

Algorithms

Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection.

Cholakkal, Hisham; Johnson, Jubin; Rajan, Deepu.

IEEE Trans Image Process ; 2018 Aug 10.

Article in English | MEDLINE | ID: mdl-30106724

ABSTRACT

Top-down saliency models produce a probability map that peaks at target locations specified by a task/goal such as object detection. They are usually trained in a fully supervised setting involving pixel-level annotations of objects. We propose a weakly supervised top-down saliency framework using only binary labels that indicate the presence/absence of an object in an image. First, the probabilistic contribution of each image region to the confidence of a CNN-based image classifier is computed through a backtracking strategy to produce top-down saliency. From a set of saliency maps of an image produced by fast bottom-up saliency approaches, we select the best saliency map suitable for the top-down task. The selected bottom-up saliency map is combined with the top-down saliency map. Features having high combined saliency are used to train a linear SVM classifier to estimate feature saliency. This is integrated with combined saliency and further refined through a multi-scale superpixel-averaging of saliency map. We evaluate the performance of the proposed weakly supervised topdown saliency and achieve comparable performance with fully supervised approaches. Experiments are carried out on seven challenging datasets and quantitative results are compared with 40 closely related approaches across 4 different applications. Code will be made publicly available.

Sparse Coding for Alpha Matting.

Johnson, Jubin; Varnousfaderani, Ehsan Shahrian; Cholakkal, Hisham; Rajan, Deepu.

IEEE Trans Image Process ; 25(7): 3032-3043, 2016 07.

Article in English | MEDLINE | ID: mdl-28113175

ABSTRACT

Existing color sampling-based alpha matting methods use the compositing equation to estimate alpha at a pixel from the pairs of foreground ( F ) and background ( B ) samples. The quality of the matte depends on the selected ( F,B ) pairs. In this paper, the matting problem is reinterpreted as a sparse coding of pixel features, wherein the sum of the codes gives the estimate of the alpha matte from a set of unpaired F and B samples. A non-parametric probabilistic segmentation provides a certainty measure on the pixel belonging to foreground or background, based on which a dictionary is formed for use in sparse coding. By removing the restriction to conform to ( F,B ) pairs, this method allows for better alpha estimation from multiple F and B samples. The same framework is extended to videos, where the requirement of temporal coherence is handled effectively. Here, the dictionary is formed by samples from multiple frames. A multi-frame graph model, as opposed to a single image as for image matting, is proposed that can be solved efficiently in closed form. Quantitative and qualitative evaluations on a benchmark dataset are provided to show that the proposed method outperforms the current stateoftheart in image and video matting.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL