Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 27
Filter
Add more filters










Publication year range
1.
IEEE Trans Neural Netw Learn Syst ; 33(3): 1051-1065, 2022 03.
Article in English | MEDLINE | ID: mdl-33296311

ABSTRACT

Deep neural networks are vulnerable to adversarial attacks. More importantly, some adversarial examples crafted against an ensemble of source models transfer to other target models and, thus, pose a security threat to black-box applications (when attackers have no access to the target models). Current transfer-based ensemble attacks, however, only consider a limited number of source models to craft an adversarial example and, thus, obtain poor transferability. Besides, recent query-based black-box attacks, which require numerous queries to the target model, not only come under suspicion by the target model but also cause expensive query cost. In this article, we propose a novel transfer-based black-box attack, dubbed serial-minigroup-ensemble-attack (SMGEA). Concretely, SMGEA first divides a large number of pretrained white-box source models into several "minigroups." For each minigroup, we design three new ensemble strategies to improve the intragroup transferability. Moreover, we propose a new algorithm that recursively accumulates the "long-term" gradient memories of the previous minigroup to the subsequent minigroup. This way, the learned adversarial information can be preserved, and the intergroup transferability can be improved. Experiments indicate that SMGEA not only achieves state-of-the-art black-box attack ability over several data sets but also deceives two online black-box saliency prediction systems in real world, i.e., DeepGaze-II (https://deepgaze.bethgelab.org/) and SALICON (http://salicon.net/demo/). Finally, we contribute a new code repository to promote research on adversarial attack and defense over ubiquitous pixel-to-pixel computer vision tasks. We share our code together with the pretrained substitute model zoo at https://github.com/CZHQuality/AAA-Pix2pix.


Subject(s)
Algorithms , Neural Networks, Computer , Learning , Memory, Long-Term
2.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 8006-8021, 2022 11.
Article in English | MEDLINE | ID: mdl-34437058

ABSTRACT

CNN-based salient object detection (SOD) methods achieve impressive performance. However, the way semantic information is encoded in them and whether they are category-agnostic is less explored. One major obstacle in studying these questions is the fact that SOD models are built on top of the ImageNet pre-trained backbones which may cause information leakage and feature redundancy. To remedy this, here we first propose an extremely light-weight holistic model tied to the SOD task that can be freed from classification backbones and trained from scratch, and then employ it to study the semantics of SOD models. With the holistic network and representation redundancy reduction by a novel dynamic weight decay scheme, our model has only 100K parameters,  âˆ¼  0.2% of parameters of large models, and performs on par with SOTA on popular SOD benchmarks. Using CSNet, we find that a) SOD and classification methods use different mechanisms, b) SOD models are category insensitive, c) ImageNet pre-training is not necessary for SOD training, and d) SOD models require far fewer parameters than the classification models. The source code is publicly available at https://mmcheng.net/sod100k/.


Subject(s)
Neural Networks, Computer , Semantics , Algorithms
3.
IEEE Trans Image Process ; 30: 8727-8742, 2021.
Article in English | MEDLINE | ID: mdl-34613915

ABSTRACT

Multi-level feature fusion is a fundamental topic in computer vision. It has been exploited to detect, segment and classify objects at various scales. When multi-level features meet multi-modal cues, the optimal feature aggregation and multi-modal learning strategy become a hot potato. In this paper, we leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel Bifurcated Backbone Strategy Network (BBS-Net). Our architecture, is simple, efficient, and backbone-independent. In particular, first, we propose to regroup the multi-level features into teacher and student features using a bifurcated backbone strategy (BBS). Second, we introduce a depth-enhanced module (DEM) to excavate informative depth cues from the channel and spatial views. Then, RGB and depth modalities are fused in a complementary way. Extensive experiments show that BBS-Net significantly outperforms 18 state-of-the-art (SOTA) models on eight challenging datasets under five evaluation measures, demonstrating the superiority of our approach (~4% improvement in S-measure vs . the top-ranked model: DMRA). In addition, we provide a comprehensive analysis on the generalization ability of different RGB-D datasets and provide a powerful training set for future research. The complete algorithm, benchmark results, and post-processing toolbox are publicly available at https://github.com/zyjwuyan/BBS-Net.

4.
IEEE Trans Image Process ; 30: 1973-1988, 2021.
Article in English | MEDLINE | ID: mdl-33444138

ABSTRACT

Saliency detection is an effective front-end process to many security-related tasks, e.g. automatic drive and tracking. Adversarial attack serves as an efficient surrogate to evaluate the robustness of deep saliency models before they are deployed in real world. However, most of current adversarial attacks exploit the gradients spanning the entire image space to craft adversarial examples, ignoring the fact that natural images are high-dimensional and spatially over-redundant, thus causing expensive attack cost and poor perceptibility. To circumvent these issues, this paper builds an efficient bridge between the accessible partially-white-box source models and the unknown black-box target models. The proposed method includes two steps: 1) We design a new partially-white-box attack, which defines the cost function in the compact hidden space to punish a fraction of feature activations corresponding to the salient regions, instead of punishing every pixel spanning the entire dense output space. This partially-white-box attack reduces the redundancy of the adversarial perturbation. 2) We exploit the non-redundant perturbations from some source models as the prior cues, and use an iterative zeroth-order optimizer to compute the directional derivatives along the non-redundant prior directions, in order to estimate the actual gradient of the black-box target model. The non-redundant priors boost the update of some "critical" pixels locating at non-zero coordinates of the prior cues, while keeping other redundant pixels locating at the zero coordinates unaffected. Our method achieves the best tradeoff between attack ability and perturbation redundancy. Finally, we conduct a comprehensive experiment to test the robustness of 18 state-of-the-art deep saliency models against 16 malicious attacks, under both of white-box and black-box settings, which contributes a new robustness benchmark to the saliency community for the first time.

5.
IEEE Trans Pattern Anal Mach Intell ; 43(2): 679-700, 2021 02.
Article in English | MEDLINE | ID: mdl-31425064

ABSTRACT

Visual saliency models have enjoyed a big leap in performance in recent years, thanks to advances in deep learning and large scale annotated data. Despite enormous effort and huge breakthroughs, however, models still fall short in reaching human-level accuracy. In this work, I explore the landscape of the field emphasizing on new deep saliency models, benchmarks, and datasets. A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets. Further, I identify factors that contribute to the gap between models and humans and discuss the remaining issues that need to be addressed to build the next generation of more powerful saliency models. Some specific questions that are addressed include: in what ways current models fail, how to remedy them, what can be learned from cognitive studies of attention, how explicit saliency judgments relate to fixations, how to conduct fair model comparison, and what are the emerging applications of saliency models.

6.
IEEE Trans Pattern Anal Mach Intell ; 43(1): 220-237, 2021 01.
Article in English | MEDLINE | ID: mdl-31247542

ABSTRACT

Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.


Subject(s)
Deep Learning , Algorithms , Humans
7.
PLoS Comput Biol ; 16(4): e1007698, 2020 04.
Article in English | MEDLINE | ID: mdl-32271746

ABSTRACT

Humans are able to track multiple objects at any given time in their daily activities-for example, we can drive a car while monitoring obstacles, pedestrians, and other vehicles. Several past studies have examined how humans track targets simultaneously and what underlying behavioral and neural mechanisms they use. At the same time, computer-vision researchers have proposed different algorithms to track multiple targets automatically. These algorithms are useful for video surveillance, team-sport analysis, video analysis, video summarization, and human-computer interaction. Although there are several efficient biologically inspired algorithms in artificial intelligence, the human multiple-target tracking (MTT) ability is rarely imitated in computer-vision algorithms. In this paper, we review MTT studies in neuroscience and biologically inspired MTT methods in computer vision and discuss the ways in which they can be seen as complementary.


Subject(s)
Artificial Intelligence , Memory/physiology , Vision, Ocular/physiology , Algorithms , Animals , Brain/physiology , Cognition , Humans , Image Processing, Computer-Assisted/methods , Motion , Neurosciences , Video Recording/methods
8.
Article in English | MEDLINE | ID: mdl-31905138

ABSTRACT

Deep convolutional neural networks (CNNs) have been successfully applied to a wide variety of problems in computer vision, including salient object detection. To accurately detect and segment salient objects, it is necessary to extract and combine high-level semantic features with low-level fine details simultaneously. This is challenging for CNNs because repeated subsampling operations such as pooling and convolution lead to a significant decrease in the feature resolution, which results in the loss of spatial details and finer structures. Therefore, we propose augmenting feedforward neural networks by using the multistage refinement mechanism. In the first stage, a master net is built to generate a coarse prediction map in which most detailed structures are missing. In the following stages, the refinement net with layerwise recurrent connections to the master net is equipped to progressively combine local context information across stages to refine the preceding saliency maps in a stagewise manner. Furthermore, the pyramid pooling module and channel attention module are applied to aggregate different-region-based global contexts. Extensive evaluations over six benchmark datasets show that the proposed method performs favorably against the state-of-the-art approaches.

9.
IEEE Trans Pattern Anal Mach Intell ; 42(8): 1913-1927, 2020 08.
Article in English | MEDLINE | ID: mdl-30892201

ABSTRACT

Previous research in visual saliency has been focused on two major types of models namely fixation prediction and salient object detection. The relationship between the two, however, has been less explored. In this work, we propose to employ the former model type to identify salient objects. We build a novel Attentive Saliency Network (ASNet)1 1.Available at: https://github.com/wenguanwang/ASNet. that learns to detect salient objects from fixations. The fixation map, derived at the upper network layers, mimics human visual attention mechanisms and captures a high-level understanding of the scene from a global view. Salient object detection is then viewed as fine-grained object-level saliency segmentation and is progressively optimized with the guidance of the fixation map in a top-down manner. ASNet is based on a hierarchy of convLSTMs that offers an efficient recurrent mechanism to sequentially refine the saliency features over multiple steps. Several loss functions, derived from existing saliency evaluation metrics, are incorporated to further boost the performance. Extensive experiments on several challenging datasets show that our ASNet outperforms existing methods and is capable of generating accurate segmentation maps with the help of the computed fixation prior. Our work offers a deeper insight into the mechanisms of attention and narrows the gap between salient object detection and fixation prediction.

10.
Article in English | MEDLINE | ID: mdl-31613763

ABSTRACT

Data size is the bottleneck for developing deep saliency models, because collecting eye-movement data is very time-consuming and expensive. Most of current studies on human attention and saliency modeling have used high-quality stereotype stimuli. In real world, however, captured images undergo various types of transformations. Can we use these transformations to augment existing saliency datasets? Here, we first create a novel saliency dataset including fixations of 10 observers over 1900 images degraded by 19 types of transformations. Second, by analyzing eye movements, we find that observers look at different locations over transformed versus original images. Third, we utilize the new data over transformed images, called data augmentation transformation (DAT), to train deep saliency models. We find that label-preserving DATs with negligible impact on human gaze boost saliency prediction, whereas some other DATs that severely impact human gaze degrade the performance. These label-preserving valid augmentation transformations provide a solution to enlarge existing saliency datasets. Finally, we introduce a novel saliency model based on generative adversarial networks (dubbed GazeGAN). A modified U-Net is utilized as the generator of the GazeGAN, which combines classic "skip connection" with a novel "center-surround connection" (CSC) module. Our proposed CSC module mitigates trivial artifacts while emphasizing semantic salient regions, and increases model nonlinearity, thus demonstrating better robustness against transformations. Extensive experiments and comparisons indicate that GazeGAN achieves state-of-the-art performance over multiple datasets. We also provide a comprehensive comparison of 22 saliency models on various transformed scenes, which contributes a new robustness benchmark to saliency community. Our code and dataset are available at.

11.
IEEE Trans Pattern Anal Mach Intell ; 41(4): 815-828, 2019 04.
Article in English | MEDLINE | ID: mdl-29993862

ABSTRACT

Recent progress on salient object detection is substantial, benefiting mostly from the explosive development of Convolutional Neural Networks (CNNs). Semantic segmentation and salient object detection algorithms developed lately have been mostly based on Fully Convolutional Neural Networks (FCNs). There is still a large room for improvement over the generic FCN models that do not explicitly deal with the scale-space problem. The Holistically-Nested Edge Detector (HED) provides a skip-layer structure with deep supervision for edge and boundary detection, but the performance gain of HED on saliency detection is not obvious. In this paper, we propose a new salient object detection method by introducing short connections to the skip-layer structures within the HED architecture. Our framework takes full advantage of multi-level and multi-scale features extracted from FCNs, providing more advanced representations at each layer, a property that is critically needed to perform segment detection. Our method produces state-of-the-art results on 5 widely tested salient object detection benchmarks, with advantages in terms of efficiency (0.08 seconds per image), effectiveness, and simplicity over the existing algorithms. Beyond that, we conduct an exhaustive analysis of the role of training data on performance. We provide a training set for future research and fair comparisons.

12.
IEEE Trans Pattern Anal Mach Intell ; 41(6): 1353-1366, 2019 06.
Article in English | MEDLINE | ID: mdl-29994045

ABSTRACT

Thanks to the availability and increasing popularity of wearable devices such as GoPro cameras, smart phones, and glasses, we have access to a plethora of videos captured from first person perspective. Surveillance cameras and Unmanned Aerial Vehicles (UAVs) also offer tremendous amounts of video data recorded from top and oblique view points. Egocentric and surveillance vision have been studied extensively but separately in the computer vision community. The relationship between these two domains, however, remains unexplored. In this study, we make the first attempt in this direction by addressing two basic yet challenging questions. First, having a set of egocentric videos and a top-view surveillance video, does the top-view video contain all or some of the egocentric viewers? In other words, have these videos been shot in the same environment at the same time? Second, if so, can we identify the egocentric viewers in the top-view video? These problems can become extremely challenging when videos are not temporally aligned. Each view, egocentric or top, is modeled by a graph and the assignment and time-delays are computed iteratively using the spectral graph matching framework. We evaluate our method in terms of ranking and assigning egocentric viewers to identities present in the top-view video over a dataset of 50 top-view and 188 egocentric videos captured under different conditions. We also evaluate the capability of our proposed approaches in terms of temporal alignment. The experiments and results demonstrate the capability of the proposed approaches in terms of jointly addressing the temporal alignment and assignment tasks.

13.
Article in English | MEDLINE | ID: mdl-29994308

ABSTRACT

We propose a novel unsupervised game-theoretic salient object detection algorithm that does not require labeled training data. First, saliency detection problem is formulated as a non-cooperative game, hereinafter referred to as Saliency Game, in which image regions are players who choose to be "background" or "foreground" as their pure strategies. A payoff function is constructed by exploiting multiple cues and combining complementary features. Saliency maps are generated according to each region's strategy in the Nash equilibrium of the proposed Saliency Game. Second, we explore the complementary relationship between color and deep features and propose an Iterative Random Walk algorithm to combine saliency maps produced by the Saliency Game using different features. Iterative random walk allows sharing information across feature spaces, and detecting objects that are otherwise very hard to detect. Extensive experiments over 6 challenging datasets demonstrate the superiority of our proposed unsupervised algorithm compared to several state of the art supervised algorithms.

14.
J Vis ; 16(14): 18, 2016 11 01.
Article in English | MEDLINE | ID: mdl-27903005

ABSTRACT

Several structural scene cues such as gist, layout, horizontal line, openness, and depth have been shown to guide scene perception (e.g., Oliva & Torralba, 2001); Ross & Oliva, 2009). Here, to investigate whether vanishing point (VP) plays a significant role in gaze guidance, we ran two experiments. In the first one, we recorded fixations of 10 observers (six male, four female; mean age 22; SD = 0.84) freely viewing 532 images, out of which 319 had a VP (shuffled presentation; each image for 4 s). We found that the average number of fixations at a local region (80 × 80 pixels) centered at the VP is significantly higher than the average fixations at random locations (t test; n = 319; p < 0.001). To address the confounding factor of saliency, we learned a combined model of bottom-up saliency and VP. The AUC (area under curve) score of our model (0.85; SD = 0.01) is significantly higher than the base saliency model (e.g., 0.8 using attention for information maximization (AIM) model by Bruce & Tsotsos, 2005, t test; p = 3.14e-16) and the VP-only model (0.64, t test; p < 0.001). In the second experiment, we asked 14 subjects (10 male, four female; mean age 23.07, SD = 1.26) to search for a target character (T or L) placed randomly on a 3 × 3 imaginary grid overlaid on top of an image. Subjects reported their answers by pressing one of the two keys. Stimuli consisted of 270 color images (180 with a single VP, 90 without). The target happened with equal probability inside each cell (15 times L, 15 times T). We found that subjects were significantly faster (and more accurate) when the target appeared inside the cell containing the VP compared to cells without the VP (median across 14 subjects 1.34 s vs. 1.96 s; Wilcoxon rank-sum test; p = 0.0014). These findings support the hypothesis that vanishing point, similar to face, text (Cerf, Frady, & Koch, 2009), and gaze direction Borji, Parks, & Itti, 2014) guides attention in free-viewing and visual search tasks.


Subject(s)
Eye Movements/physiology , Fixation, Ocular/physiology , Pattern Recognition, Visual/physiology , Visual Perception/physiology , Attention/physiology , Cues , Female , Humans , Male , Probability , Young Adult
15.
IEEE Trans Image Process ; 25(4): 1566-79, 2016 Apr.
Article in English | MEDLINE | ID: mdl-26829792

ABSTRACT

A large number of saliency models, each based on a different hypothesis, have been proposed over the past 20 years. In practice, while subscribing to one hypothesis or computational principle makes a model that performs well on some types of images, it hinders the general performance of a model on arbitrary images and large-scale data sets. One natural approach to improve overall saliency detection accuracy would then be fusing different types of models. In this paper, inspired by the success of late-fusion strategies in semantic analysis and multi-modal biometrics, we propose to fuse the state-of-the-art saliency models at the score level in a para-boosting learning fashion. First, saliency maps generated by several models are used as confidence scores. Then, these scores are fed into our para-boosting learner (i.e., support vector machine, adaptive boosting, or probability density estimator) to generate the final saliency map. In order to explore the strength of para-boosting learners, traditional transformation-based fusion strategies, such as Sum, Min, and Max, are also explored and compared in this paper. To further reduce the computation cost of fusing too many models, only a few of them are considered in the next step. Experimental results show that score-level fusion outperforms each individual model and can further reduce the performance gap between the current models and the human inter-observer model.

16.
IEEE Trans Neural Netw Learn Syst ; 27(6): 1266-78, 2016 06.
Article in English | MEDLINE | ID: mdl-26277009

ABSTRACT

Advances in image quality assessment have shown the potential added value of including visual attention aspects in its objective assessment. Numerous models of visual saliency are implemented and integrated in different image quality metrics (IQMs), but the gain in reliability of the resulting IQMs varies to a large extent. The causes and the trends of this variation would be highly beneficial for further improvement of IQMs, but are not fully understood. In this paper, an exhaustive statistical evaluation is conducted to justify the added value of computational saliency in objective image quality assessment, using 20 state-of-the-art saliency models and 12 best-known IQMs. Quantitative results show that the difference in predicting human fixations between saliency models is sufficient to yield a significant difference in performance gain when adding these saliency models to IQMs. However, surprisingly, the extent to which an IQM can profit from adding a saliency model does not appear to have direct relevance to how well this saliency model can predict human fixations. Our statistical analysis provides useful guidance for applying saliency models in IQMs, in terms of the effect of saliency model dependence, IQM dependence, and image distortion dependence. The testbed and software are made publicly available to the research community.

17.
IEEE Trans Neural Netw Learn Syst ; 27(6): 1214-26, 2016 06.
Article in English | MEDLINE | ID: mdl-26452292

ABSTRACT

Predicting where people look in natural scenes has attracted a lot of interest in computer vision and computational neuroscience over the past two decades. Two seemingly contrasting categories of cues have been proposed to influence where people look: 1) low-level image saliency and 2) high-level semantic information. Our first contribution is to take a detailed look at these cues to confirm the hypothesis proposed by Henderson and Nuthmann and Henderson that observers tend to look at the center of objects. We analyzed fixation data for scene free-viewing over 17 observers on 60 object-annotated images with various types of objects. Images contained different types of scenes, such as natural scenes, line drawings, and 3-D rendered scenes. Our second contribution is to propose a simple combined model of low-level saliency and object center bias that outperforms each individual component significantly over our data, as well as on the Object and Semantic Images and Eye-tracking data set by Xu et al. The results reconcile saliency with object center-bias hypotheses and highlight that both types of cues are important in guiding fixations. Our work opens new directions to understand strategies that humans use in observing scenes and objects, and demonstrates the construction of combined models of low-level saliency and high-level object-based information.

18.
IEEE Trans Image Process ; 24(12): 5706-22, 2015 Dec.
Article in English | MEDLINE | ID: mdl-26452281

ABSTRACT

We extensively compare, qualitatively and quantitatively, 41 state-of-the-art models (29 salient object detection, 10 fixation prediction, 1 objectness, and 1 baseline) over seven challenging data sets for the purpose of benchmarking salient object detection and segmentation methods. From the results obtained so far, our evaluation shows a consistent rapid progress over the last few years in terms of both accuracy and running time. The top contenders in this benchmark significantly outperform the models identified as the best in the previous benchmark conducted three years ago. We find that the models designed specifically for salient object detection generally work better than models in closely related areas, which in turn provides a precise definition and suggests an appropriate treatment of this problem that distinguishes it from other problems. In particular, we analyze the influences of center bias and scene complexity in model performance, which, along with the hard cases for the state-of-the-art models, provide useful hints toward constructing more challenging large-scale data sets and better saliency models. Finally, we propose probable solutions for tackling several open problems, such as evaluation scores and data set bias, which also suggest future research directions in the rapidly growing field of salient object detection.

19.
Vision Res ; 116(Pt B): 113-26, 2015 Nov.
Article in English | MEDLINE | ID: mdl-25448115

ABSTRACT

Previous studies have shown that gaze direction of actors in a scene influences eye movements of passive observers during free-viewing (Castelhano, Wieth, & Henderson, 2007; Borji, Parks, & Itti, 2014). However, no computational model has been proposed to combine bottom-up saliency with actor's head pose and gaze direction for predicting where observers look. Here, we first learn probability maps that predict fixations leaving head regions (gaze following fixations), as well as fixations on head regions (head fixations), both dependent on the actor's head size and pose angle. We then learn a combination of gaze following, head region, and bottom-up saliency maps with a Markov chain composed of head region and non-head region states. This simple structure allows us to inspect the model and make comments about the nature of eye movements originating from heads as opposed to other regions. Here, we assume perfect knowledge of actor head pose direction (from an oracle). The combined model, which we call the Dynamic Weighting of Cues model (DWOC), explains observers' fixations significantly better than each of the constituent components. Finally, in a fully automatic combined model, we replace the oracle head pose direction data with detections from a computer vision model of head pose. Using these (imperfect) automated detections, we again find that the combined model significantly outperforms its individual components. Our work extends the engineering and scientific applications of saliency models and helps better understand mechanisms of visual attention.


Subject(s)
Eye Movements/physiology , Fixation, Ocular/physiology , Head , Pattern Recognition, Visual/physiology , Posture/physiology , Female , Humans , Imaging, Three-Dimensional , Male , Probability
20.
IEEE Trans Image Process ; 24(2): 742-56, 2015 Feb.
Article in English | MEDLINE | ID: mdl-25532178

ABSTRACT

Salient object detection or salient region detection models, diverging from fixation prediction models, have traditionally been dealing with locating and segmenting the most salient object or region in a scene. While the notion of most salient object is sensible when multiple objects exist in a scene, current datasets for evaluation of saliency detection approaches often have scenes with only one single object. We introduce three main contributions in this paper. First, we take an in-depth look at the problem of salient object detection by studying the relationship between where people look in scenes and what they choose as the most salient object when they are explicitly asked. Based on the agreement between fixations and saliency judgments, we then suggest that the most salient object is the one that attracts the highest fraction of fixations. Second, we provide two new less biased benchmark data sets containing scenes with multiple objects that challenge existing saliency models. Indeed, we observed a severe drop in performance of eight state-of-the-art models on our data sets (40%-70%). Third, we propose a very simple yet powerful model based on superpixels to be used as a baseline for model evaluation and comparison. While on par with the best models on MSRA-5 K data set, our model wins over other models on our data highlighting a serious drawback of existing models, which is convoluting the processes of locating the most salient object and its segmentation. We also provide a review and statistical analysis of some labeled scene data sets that can be used for evaluating salient object detection models. We believe that our work can greatly help remedy the over-fitting of models to existing biased data sets and opens new venues for future research in this fast-evolving field.

SELECTION OF CITATIONS
SEARCH DETAIL
...