Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 59
Filter
1.
IEEE Trans Pattern Anal Mach Intell ; 46(7): 4747-4762, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38261478

ABSTRACT

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods.

2.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3772-3783, 2024 May.
Article in English | MEDLINE | ID: mdl-38153825

ABSTRACT

The cross-model transferability of adversarial examples makes black-box attacks to be practical. However, it typically requires access to the input of the same modality as black-box models to attain reliable transferability. Unfortunately, the collection of datasets may be difficult in security-critical scenarios. Hence, developing cross-modal attacks for fooling models with different modalities of inputs would highly threaten real-world DNNs applications. The above considerations motivate us to investigate cross-modal transferability of adversarial examples. In particular, we aim to generate video adversarial examples from white-box image models to attack video CNN and ViT models. We introduce the Image To Video (I2V) attack based on the observation that image and video models share similar low-level features. For each video frame, I2V optimizes perturbations by reducing the similarity of intermediate features between benign and adversarial frames on image models. Then I2V combines adversarial frames together to generate video adversarial examples. I2V can be easily extended to simultaneously perturb multi-layer features extracted from an ensemble of image models. To efficiently integrate various features, we introduce an adaptive approach to re-weight the contributions of each layer based on its cosine similarity values of the previous attack step. Experimental results demonstrate the effectiveness of the proposed method.

3.
IEEE Trans Image Process ; 32: 6346-6358, 2023.
Article in English | MEDLINE | ID: mdl-37966925

ABSTRACT

The transferability of adversarial examples across different convolutional neural networks (CNNs) makes it feasible to perform black-box attacks, resulting in security threats for CNNs. However, fewer endeavors have been made to investigate transferable attacks for vision transformers (ViTs), which achieve superior performance on various computer vision tasks. Unlike CNNs, ViTs establish relationships between patches extracted from inputs by the self-attention module. Thus, adversarial examples crafted on CNNs might hardly attack ViTs. To assess the security of ViTs comprehensively, we investigate the transferability across different ViTs in both untargetd and targeted scenarios. More specifically, we propose a Pay No Attention (PNA) attack, which ignores attention gradients during backpropagation to improve the linearity of backpropagation. Additionally, we introduce a PatchOut/CubeOut attack for image/video ViTs. They optimize perturbations within a randomly selected subset of patches/cubes during each iteration, preventing over-fitting to the white-box surrogate ViT model. Furthermore, we maximize the L2 norm of perturbations, ensuring that the generated adversarial examples deviate significantly from the benign ones. These strategies are designed to be harmoniously compatible. Combining them can enhance transferability by jointly considering patch-based inputs and the self-attention of ViTs. Moreover, the proposed combined attack seamlessly integrates with existing transferable attacks, providing an additional boost to transferability. We conduct experiments on ImageNet and Kinetics-400 for image and video ViTs, respectively. Experimental results demonstrate the effectiveness of the proposed method.

4.
World J Surg Oncol ; 21(1): 203, 2023 Jul 11.
Article in English | MEDLINE | ID: mdl-37430268

ABSTRACT

PURPOSE: Thymoma is the most common primary tumor in the anterior mediastinum. The prognostic factors of patients with thymoma still need to be clarified. In this study, we aimed to investigate the prognostic factors of patients with thymoma who received radical resection and establish the nomogram to predict the prognosis of these patients. MATERIALS AND METHODS: Patients who underwent radical resection for thymoma with complete follow-up data between 2005 and 2021 were enrolled. Their clinicopathological characteristics and treatment methods were retrospectively analyzed. Progression-free survival (PFS) and overall survival (OS) were estimated using the Kaplan-Meier method and compared by the log-rank test. Univariate and multivariate Cox proportional hazards regression analyses were performed to identify the independent prognostic factors. According to the results of the univariate analysis in the Cox regression model, the predictive nomograms were created. RESULTS: A total of 137 patients with thymoma were enrolled. With a median follow-up of 52 months, the 5-year and 10-year PFS rates were 79.5% and 68.1%, respectively. The 5-year and 10-year OS rates were 88.4% and 73.1%, respectively. Smoking status (P = 0.022) and tumor size (P = 0.039) were identified as independent prognostic factors for PFS. Multivariate analysis showed that a high level of neutrophils (P = 0.040) was independently associated with OS. The nomogram showed that the World Health Organization (WHO) histological classification contributed more to the risk of recurrence than other factors. Neutrophil count was the most important predictor of OS in patients with thymoma. CONCLUSION: Smoking status and tumor size are risk factors for PFS in patients with thymoma. A high level of neutrophils is an independent prognostic factor for OS. The nomograms developed in this study accurately predict PFS and OS rates at 5 and 10 years in patients with thymoma based on individual characteristics.


Subject(s)
Thymoma , Thymus Neoplasms , Humans , Thymoma/surgery , Prognosis , Retrospective Studies , Thymus Neoplasms/surgery , World Health Organization
5.
IEEE Trans Image Process ; 31: 7078-7090, 2022.
Article in English | MEDLINE | ID: mdl-36346859

ABSTRACT

The vanilla Few-shot Learning (FSL) learns to build a classifier for a new concept from one or very few target examples, with the general assumption that source and target classes are sampled from the same domain. Recently, the task of Cross-Domain Few-Shot Learning (CD-FSL) aims at tackling the FSL where there is a huge domain shift between the source and target datasets. Extensive efforts on CD-FSL have been made via either directly extending the meta-learning paradigm of vanilla FSL methods, or employing massive unlabeled target data to help learn models. In this paper, we notice that in the CD-FSL task, the few labeled target images have never been explicitly leveraged to inform the model in the training stage. However, such a labeled target example set is very important to bridge the huge domain gap. Critically, this paper advocates a more practical training scenario for CD-FSL. And our key insight is to utilize a few labeled target data to guide the learning of the CD-FSL model. Technically, we propose a novel Generalized Meta-learning based Feature-Disentangled Mixup network, namely GMeta-FDMixup. We make three key contributions of utilizing GMeta-FDMixup to address CD-FSL. Firstly, we present two mixup modules - mixup-P and mixup-M that help facilitate utilizing the unbalanced and disjoint source and target datasets. These two novel modules enable diverse image generation for training the model on the source domain. Secondly, to narrow the domain gap explicitly, we contribute a novel feature disentanglement module that learns to decouple the domain-irrelevant and domain-specific features. By stripping the domain-specific features, we alleviate the negative effects caused by the domain inductive bias. Finally, we repurpose a new contrastive learning module, dubbed ConL. ConL prevents the model from only capturing category-related features via introducing contrastive loss. Thus, the generalization ability on novel categories is improved. Extensive experimental results on two benchmarks show the superiority of our setting and the effectiveness of our method. Code and models will be released.

6.
Nutr Neurosci ; 25(5): 1001-1010, 2022 May.
Article in English | MEDLINE | ID: mdl-33078688

ABSTRACT

OBJECTIVE: To investigate the effect of maternal zinc deficiency on learning and memory in offspring and the changes in DNA methylation patterns. METHODS: Pregnant rats were divided into zinc adequate (ZA), zinc deficient (ZD), and paired fed (PF) groups. Serum zinc contents and AKP activity in mother rats and offspring at P21 (end of lactation) and P60 (weaned, adult) were detected. Cognitive ability of offspring at P21 and P60 were determined by Morris water maze. The expression of proteins including DNMT3a, DNMT1, GADD45ß, MeCP2 and BDNF in the offspring hippocampus were detected by Western-blot. The methylation status of BDNF promoter region in hippocampus of offspring rats was detected by MS-qPCR. RESULTS: Compared with the ZA and PF groups, pups in the ZD group had lower zinc levels and AKP activity in the serum, spent more time finding the platform and spent less time going through the platform area. Protein expression of DNMT1 and GADD45b were downregulated in the ZD group during P0 and P21 but not P60 compared with the ZA and PF group, these results were consistent with a reduction in BDNF protein at P0 (neonate), P21. However, when pups of rats in the ZD group were supplemented with zinc ion from P21 to P60, MeCP2 and GADD45b expression were significantly downregulated compared with the ZA and PF group. CONCLUSION: Post-weaning zinc supplementation may improve cognitive impairment induced by early life zinc deficiency, whereas it may not completely reverse the abnormal expression of particular genes that are involved in DNA methylation, binding to methylated DNA and neurogenesis.


Subject(s)
DNA Methylation , Malnutrition , Animals , Antigens, Differentiation/genetics , Brain-Derived Neurotrophic Factor/genetics , Brain-Derived Neurotrophic Factor/metabolism , Female , Hippocampus/metabolism , Learning , Malnutrition/metabolism , Pregnancy , Rats , Zinc
8.
IEEE Trans Pattern Anal Mach Intell ; 44(4): 1699-1711, 2022 04.
Article in English | MEDLINE | ID: mdl-33026981

ABSTRACT

We introduce AdaFrame, a conditional computation framework that adaptively selects relevant frames on a per-input basis for fast video recognition. AdaFrame, which contains a Long Short-Term Memory augmented with a global memory to provide context information, operates as an agent to interact with video sequences aiming to search over time which frames to use. Trained with policy search methods, at each time step, AdaFrame computes a prediction, decides where to observe next, and estimates a utility, i.e., expected future rewards, of viewing more frames in the future. Exploring predicted utilities at testing time, AdaFrame is able to achieve adaptive lookahead inference so as to minimize the overall computational cost without incurring a degradation in accuracy. We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet. With a vanilla ResNet-101 model, AdaFrame achieves similar performance of using all frames while only requiring, on average, 8.21 and 8.65 frames on FCVID and ActivityNet, respectively. We also demonstrate AdaFrame is compatible with modern 2D and 3D networks for video recognition. Furthermore, we show, among other things, learned frame usage can reflect the difficulty of making prediction decisions both at instance-level within the same class and at class-level among different categories.


Subject(s)
Algorithms
9.
Nat Nanotechnol ; 16(8): 874-881, 2021 08.
Article in English | MEDLINE | ID: mdl-34083773

ABSTRACT

Flash memory has become a ubiquitous solid-state memory device widely used in portable digital devices, computers and enterprise applications. The development of the information age has demanded improvements in memory speed and retention performance. Here we demonstrate an ultrafast non-volatile flash memory based on MoS2/hBN/multilayer graphene van der Waals heterostructures, which achieves an ultrafast writing/erasing speed of 20 ns through two-triangle-barrier modified Fowler-Nordheim tunnelling. Using detailed theoretical analysis and experimental verification, we postulate that a suitable barrier height, gate coupling ratio and clean interface are the main reasons for the breakthrough writing/erasing speed of our flash memory devices. Because of its non-volatility this ultrafast flash memory could provide the foundation for the next generation of high-speed non-volatile memory.

10.
Nano Lett ; 21(4): 1758-1764, 2021 Feb 24.
Article in English | MEDLINE | ID: mdl-33565310

ABSTRACT

In the continuous transistor feature size scaling down, the scaling of the supply voltage is stagnant because of the subthreshold swing (SS) limit. A transistor with a new mechanism is needed to break through the thermionic limit of SS and hold the large drive current at the same time. Here, by adopting the recently proposed Dirac-source field-effect transistor (DSFET) technology, we experimentally demonstrate a MoS2/graphene (1.8 nm/0.3 nm) DSFET for the first time, and a steep SS of 37.9 mV/dec at room temperature with nearly free hysteresis is observed. Besides, by bringing in the structure of gate-all-around (GAA), the MoS2/graphene DSFET exhibits a steeper SS of 33.5 mV/dec and a 40% increased normalized drive current up to 52.7 µA·µm/µm (VDS = 1 V) with a current on/off ratio of 108, which shows potential for low-power and high-performance electronics applications.

11.
Aging (Albany NY) ; 13(3): 4115-4137, 2021 01 20.
Article in English | MEDLINE | ID: mdl-33494069

ABSTRACT

In vitro and in vivo models of Parkinson's disease were established to investigate the effects of the lncRNA XIST/miR-199a-3p/Sp1/LRRK2 axis. The binding between XIST and miR-199a-3p as well as miR-199a-3p and Sp1 were examined by luciferase reporter assay and confirmed by RNA immunoprecipitation analysis. Following the Parkinson's disease animal behavioural assessment by suspension and swim tests, the brain tissue injuries were evaluated by hematoxylin and eosin, TdT-mediated dUTP-biotin nick end labelling, and tyrosine hydroxylase stainings. The results indicated that miR-199a-3p expression was downregulated, whereas that of XIST, Sp1 and LRRK2 were upregulated in Parkinson's disease. Moreover, miR-199a-3p overexpression or XIST knockdown inhibited the cell apoptosis induced by MPP+ treatment and promoted cell proliferation. The neurodegenerative defects were significantly recovered by treating the cells with shXIST or shSp1, whereas miR-199a-3p inhibition or Sp1 and LRRK2 overexpression abrogated these beneficial effects. Furthermore, the results of our in vivo experiments confirmed the neuroprotective effects of shXIST and miR-199a-3p against MPTP-induced brain injuries, and the Parkinson's disease behavioural symptoms were effectively alleviated upon shXIST or miR-199a-3p treatment. In summary, the results of the present study showed that lncRNA XIST sponges miR-199a-3p to modulate Sp1 expression and further accelerates Parkinson's disease progression by targeting LRRK2.


Subject(s)
Apoptosis/genetics , Carrier Proteins/genetics , Leucine-Rich Repeat Serine-Threonine Protein Kinase-2/genetics , MicroRNAs/genetics , Nerve Tissue Proteins/genetics , Neurons/metabolism , Parkinson Disease/genetics , RNA, Long Noncoding/genetics , 1-Methyl-4-phenylpyridinium/toxicity , Animals , Apoptosis/drug effects , Carrier Proteins/metabolism , Cell Line, Tumor , Disease Progression , Gene Knockdown Techniques , Herbicides/toxicity , Humans , Intracellular Signaling Peptides and Proteins/genetics , Intracellular Signaling Peptides and Proteins/metabolism , Leucine-Rich Repeat Serine-Threonine Protein Kinase-2/metabolism , Mice , MicroRNAs/metabolism , Nerve Tissue Proteins/metabolism , Neurons/drug effects , PC12 Cells , Parkinson Disease/metabolism , Parkinson Disease/physiopathology , Parkinsonian Disorders/genetics , Parkinsonian Disorders/metabolism , Parkinsonian Disorders/physiopathology , RNA, Long Noncoding/metabolism , Rats
12.
IEEE Trans Image Process ; 30: 1514-1526, 2021.
Article in English | MEDLINE | ID: mdl-33360994

ABSTRACT

Food recognition has captured numerous research attention for its importance for health-related applications. The existing approaches mostly focus on the categorization of food according to dish names, while ignoring the underlying ingredient composition. In reality, two dishes with the same name do not necessarily share the exact list of ingredients. Therefore, the dishes under the same food category are not mandatorily equal in nutrition content. Nevertheless, due to limited datasets available with ingredient labels, the problem of ingredient recognition is often overlooked. Furthermore, as the number of ingredients is expected to be much less than the number of food categories, ingredient recognition is more tractable in the real-world scenario. This paper provides an insightful analysis of three compelling issues in ingredient recognition. These issues involve recognition in either image-level or region level, pooling in either single or multiple image scales, learning in either single or multi-task manner. The analysis is conducted on a large food dataset, Vireo Food-251, contributed by this paper. The dataset is composed of 169,673 images with 251 popular Chinese food and 406 ingredients. The dataset includes adequate challenges in scale and complexity to reveal the limit of the current approaches in ingredient recognition.


Subject(s)
Deep Learning , Food Ingredients/classification , Image Processing, Computer-Assisted/methods , Pattern Recognition, Automated/methods , China , Cooking , Humans
13.
IEEE Trans Pattern Anal Mach Intell ; 43(10): 3600-3613, 2021 10.
Article in English | MEDLINE | ID: mdl-32248097

ABSTRACT

In this paper, we propose an end-to-end deep learning architecture that generates 3D triangular meshes from single color images. Restricted by the nature of prevalent deep learning techniques, the majority of previous works represent 3D shapes in volumes or point clouds. However, it is non-trivial to convert these representations to compact and ready-to-use mesh models. Unlike the existing methods, our network represents 3D shapes in meshes, which are essentially graphs and well suited for graph-based convolutional neural networks. Leveraging perceptual features extracted from an input image, our network produces the correct geometry by progressively deforming an ellipsoid. To make the whole deformation procedure stable, we adopt a coarse-to-fine strategy, and define various mesh/surface related losses to capture properties of various aspects, which benefits producing the visually appealing and physically accurate 3D geometry. In addition, our model by nature can be adapted to objects in specific domains, e.g., human faces, and be easily extended to learn per-vertex properties, e.g., color. Extensive experiments show that our method not only qualitatively produces the mesh model with better details, but also achieves the higher 3D shape estimation accuracy compared against the state-of-the-arts.

14.
Medicine (Baltimore) ; 99(38): e22238, 2020 Sep 18.
Article in English | MEDLINE | ID: mdl-32957367

ABSTRACT

BACKGROUND: Systematic evaluation of the effectiveness and safety of combined procarbazine, lomustine, and vincristine for treating recurrent high-grade glioma. METHODS: Electronic databases including PubMed, MEDLINE, EMBASE, Cochrane Library Central Register of Controlled Trials, WanFang, and China National Knowledge Infrastructure (CNKI) were used to search for studies related to the utilization of combined procarbazine, lomustine, and vincristine as a therapeutic method for recurrent high-grade glioma. Literature screening, extraction of data, and evaluation of high standard studies were conducted by 2 independent researchers. The robustness and strength of the effectiveness and safety of combined procarbazine, lomustine, and vincristine as a therapeutic methodology for recurrent high-grade glioma was assessed based on the odds ratio (OR), mean differences (MDs), and 95% confidence interval (CI). RevMan 5.3 software was used for carrying out the statistical analysis. RESULTS: These results obtained in this study will be published in a peer-reviewed journal. CONCLUSION: Evidently, the conclusion of this study will provide an assessment on whether combined procarbazine, lomustine, and vincristine provides an effective and safe form of treatment for recurrent high-grade glioma. SYSTEMATIC REVIEW REGISTRATION NUMBER: INPLASY202080078.


Subject(s)
Antineoplastic Combined Chemotherapy Protocols/adverse effects , Antineoplastic Combined Chemotherapy Protocols/therapeutic use , Brain Neoplasms/drug therapy , Glioma/drug therapy , Meta-Analysis as Topic , Neoplasm Recurrence, Local/drug therapy , Systematic Reviews as Topic , Adolescent , Adult , Brain Neoplasms/pathology , Glioma/pathology , Humans , Lomustine/adverse effects , Lomustine/therapeutic use , Neoplasm Grading , Neoplasm Recurrence, Local/pathology , Procarbazine/adverse effects , Procarbazine/therapeutic use , Vincristine/adverse effects , Vincristine/therapeutic use , Young Adult
15.
Article in English | MEDLINE | ID: mdl-32946393

ABSTRACT

Generating realistic images with the guidance of reference images and human poses is challenging. Despite the success of previous works on synthesizing person images in the iconic views, no efforts are made towards the task of poseguided image synthesis in the non-iconic views. Particularly, we find that previous models cannot handle such a complex task, where the person images are captured in the non-iconic views by commercially-available digital cameras. To this end, we propose a new framework - Multi-branch Refinement Network (MR-Net), which utilizes several visual cues, including target person poses, foreground person body and scene images parsed. Furthermore, a novel Region of Interest (RoI) perceptual loss is proposed to optimize the MR-Net. Extensive experiments on two non-iconic datasets, Penn Action and BBC-Pose, as well as an iconic dataset - Market-1501, show the efficacy of the proposed model that can tackle the problem of pose-guided person image generation from the non-iconic views. The data, models, and codes are downloadable from https://github.com/loadder/MR-Net.

16.
Article in English | MEDLINE | ID: mdl-32857695

ABSTRACT

The process of learning good representations for machine learning tasks can be very computationally expensive. Typically, we facilitate the same backbones learned on the training set to infer the labels of testing data. Interestingly, This learning and inference paradigm, however, is quite different from the typical inference scheme of human biological visual systems. Essentially, neuroscience studies have shown that the right hemisphere of the human brain predominantly makes a fast processing of low-frequency spatial signals, while the left hemisphere more focuses on analyzing high-frequency information in a slower way. And the low-pass analysis helps facilitate the high-pass analysis via a feedback form. Inspired by this biological vision mechanism, this paper explores the possibility of learning a layer-skippable inference network. Specifically, we propose a layer-skippable network that dynamically carries out coarse-tofine object categorization. Such a network has two branches to jointly deal with both coarse and fine-grained classification tasks. The layer-skipping mechanism is proposed to learn a gating network by generating dynamic inference graphs, and reducing the computational cost by detouring the inference path from some layers. This adaptive path inference strategy endows the network with better flexibility and larger capacity and makes the high-performance deep networks with dynamic structures. To efficiently train the gating network, a novel ranking-based loss function is presented. Furthermore, the learned representations are enhanced by the proposed top-down feedback facilitation and feature-wise affine transformation, individually. The former one employs features of a coarse branch to help the finegrained object recognition task, while the latter one encodes the selected path to enhance the final feature representations. Extensive experiments are conducted on several widely used coarse-to-fine object categorization benchmarks, and promising results are achieved by our proposed model. Quite surprisingly, our layer-skipping mechanism improves the network robustness to adversarial attacks. The codes and models are released on https://github.com/avalonstrel/DSN.

17.
Nat Nanotechnol ; 15(7): 545-557, 2020 07.
Article in English | MEDLINE | ID: mdl-32647168

ABSTRACT

Rapid digital technology advancement has resulted in a tremendous increase in computing tasks imposing stringent energy efficiency and area efficiency requirements on next-generation computing. To meet the growing data-driven demand, in-memory computing and transistor-based computing have emerged as potent technologies for the implementation of matrix and logic computing. However, to fulfil the future computing requirements new materials are urgently needed to complement the existing Si complementary metal-oxide-semiconductor technology and new technologies must be developed to enable further diversification of electronics and their applications. The abundance and rich variety of electronic properties of two-dimensional materials have endowed them with the potential to enhance computing energy efficiency while enabling continued device downscaling to a feature size below 5 nm. In this Review, from the perspective of matrix and logic computing, we discuss the opportunities, progress and challenges of integrating two-dimensional materials with in-memory computing and transistor-based computing technologies.

18.
Article in English | MEDLINE | ID: mdl-32406834

ABSTRACT

During the past decade, both multi-label learning and zero-shot learning have attracted huge research attention, and significant progress has been made. Multi-label learning algorithms aim to predict multiple labels given one instance, while most existing zero-shot learning approaches target at predicting a single testing label for each unseen class via transferring knowledge from auxiliary seen classes to target unseen classes. However, relatively less effort has been made on predicting multiple labels in the zero-shot setting, which is nevertheless a quite challenging task. In this work, we investigate and formalize a flexible framework consisting of two components, i.e., visual-semantic embedding and zero-shot multi-label prediction. First, we present a deep regression model to project the visual features into the semantic space, which explicitly exploits the correlations in the intermediate semantic layer of word vectors and makes label prediction possible. Then, we formulate the label prediction problem as a pairwise one and employ Ranking SVM to seek the unique multi-label correlations in the embedding space. Furthermore, we provide a transductive multi-label zeroshot prediction approach that exploits the testing data manifold structure. We demonstrate the effectiveness of the proposed approach on three popular multi-label datasets with state-of-theart performance obtained on both conventional and generalized ZSL settings.

19.
IEEE Trans Pattern Anal Mach Intell ; 42(12): 3136-3152, 2020 12.
Article in English | MEDLINE | ID: mdl-31199251

ABSTRACT

Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.

20.
IEEE Trans Pattern Anal Mach Intell ; 42(2): 398-412, 2020 02.
Article in English | MEDLINE | ID: mdl-31199252

ABSTRACT

In this paper, we propose Deeply Supervised Object Detectors (DSOD), an object detection framework that can be trained from scratch. Recent advances in object detection heavily depend on the off-the-shelf models pre-trained on large-scale classification datasets like ImageNet and OpenImage. However, one problem is that adopting pre-trained models from classification to detection task may incur learning bias due to the different objective function and diverse distributions of object categories. Techniques like fine-tuning on detection task could alleviate this issue to some extent but are still not fundamental. Furthermore, transferring these pre-trained models across discrepant domains will be more difficult (e.g., from RGB to depth images). Thus, a better solution to handle these critical problems is to train object detectors from scratch, which motivates our proposed method. Previous efforts on this direction mainly failed by reasons of the limited training data and naive backbone network structures for object detection. In DSOD, we contribute a set of design principles for learning object detectors from scratch. One of the key principles is the deep supervision, enabled by layer-wise dense connections in both backbone networks and prediction layers, plays a critical role in learning good detectors from scratch. After involving several other principles, we build our DSOD based on the single-shot detection framework (SSD). We evaluate our method on PASCAL VOC 2007, 2012 and COCO datasets. DSOD achieves consistently better results than the state-of-the-art methods with much more compact models. Specifically, DSOD outperforms baseline method SSD on all three benchmarks, while requiring only 1/2 parameters. We also observe that DSOD can achieve comparable/slightly better results than Mask RCNN [1] + FPN [2] (under similar input size) with only 1/3 parameters, using no extra data or pre-trained models.

SELECTION OF CITATIONS
SEARCH DETAIL
...