Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 35
Filter
1.
Article in English | MEDLINE | ID: mdl-38809744

ABSTRACT

We study multi-sensor fusion for 3D semantic segmentation that is important to scene understanding for many applications, such as autonomous driving and robotics. For example, for autonomous cars equipped with RGB cameras and LiDAR, it is crucial to fuse complementary information from different sensors for robust and accurate segmentation. Existing fusion-based methods, however, may not achieve promising performance due to the vast difference between the two modalities. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to effectively exploit perceptual information from two modalities, namely, appearance information from RGB images and spatio-depth information from point clouds. To this end, we first project point clouds to the camera coordinate using perspective projection. In this way, we can process both inputs from LiDAR and cameras in 2D space while preventing the information loss of RGB images. Then, we propose a two-stream network that consists of a LiDAR stream and a camera stream to extract features from the two modalities, separately. The extracted features are fused by effective residual-based fusion modules. Moreover, we introduce additional perception-aware losses to measure the perceptual difference between the two modalities. Last, we propose an improved version of PMF, i.e., EPMF, which is more efficient and effective by optimizing data pre-processing and network architecture under perspective projection. Specifically, we propose cross-modal alignment and cropping to obtain tight inputs and reduce unnecessary computational costs. We then explore more efficient contextual modules under perspective projection and fuse the LiDAR features into the camera stream to boost the performance of the two-stream network. Extensive experiments on benchmark data sets show the superiority of our method. For example, on nuScenes test set, our EPMF outperforms the state-of-the-art method, i.e., RangeFormer, by 0.9% in mIoU. Compared to PMF, EPMF also achieves 2.06× acceleration with 2.0% improvement in mIoU. Our source code is available at https://github.com/ICEORY/PMF.

2.
Neural Netw ; 175: 106275, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38653078

ABSTRACT

Face Anti-Spoofing (FAS) seeks to protect face recognition systems from spoofing attacks, which is applied extensively in scenarios such as access control, electronic payment, and security surveillance systems. Face anti-spoofing requires the integration of local details and global semantic information. Existing CNN-based methods rely on small stride or image patch-based feature extraction structures, which struggle to capture spatial and cross-layer feature correlations effectively. Meanwhile, Transformer-based methods have limitations in extracting discriminative detailed features. To address the aforementioned issues, we introduce a multi-stage CNN-Transformer-based framework, which extracts local features through the convolutional layer and long-distance feature relationships via self-attention. Based on this, we proposed a cross-attention multi-stage feature fusion, employing semantically high-stage features to query task-relevant features in low-stage features for further cross-stage feature fusion. To enhance the discrimination of local features for subtle differences, we design pixel-wise material classification supervision and add a auxiliary branch in the intermediate layers of the model. Moreover, to address the limitations of a single acquisition environment and scarcity of acquisition devices in the existing Near-Infrared dataset, we create a large-scale Near-Infrared Face Anti-Spoofing dataset with 380k pictures of 1040 identities. The proposed method could achieve the state-of-the-art in OULU-NPU and our proposed Near-Infrared dataset at just 1.3GFlops and 3.2M parameter numbers, which demonstrate the effective of the proposed method.


Subject(s)
Neural Networks, Computer , Humans , Automated Facial Recognition/methods , Image Processing, Computer-Assisted/methods , Face , Computer Security , Algorithms
3.
IEEE Trans Med Imaging ; PP2024 Apr 26.
Article in English | MEDLINE | ID: mdl-38669168

ABSTRACT

Many of the tissues/lesions in the medical images may be ambiguous. Therefore, medical segmentation is typically annotated by a group of clinical experts to mitigate personal bias. A common solution to fuse different annotations is the majority vote, e.g., taking the average of multiple labels. However, such a strategy ignores the difference between the grader expertness. Inspired by the observation that medical image segmentation is usually used to assist the disease diagnosis in clinical practice, we propose the diagnosis-first principle, which is to take disease diagnosis as the criterion to calibrate the inter-observer segmentation uncertainty. Following this idea, a framework named Diagnosis-First segmentation Framework (DiFF) is proposed. Specifically, DiFF will first learn to fuse the multi-rater segmentation labels to a single ground-truth which could maximize the disease diagnosis performance. We dubbed the fused ground-truth as Diagnosis-First Ground-truth (DF-GT). Then, the Take and Give Model (T&G Model) to segment DF-GT from the raw image is proposed. With the T&G Model, DiFF can learn the segmentation with the calibrated uncertainty that facilitate the disease diagnosis. We verify the effectiveness of DiFF on three different medical segmentation tasks: optic-disc/optic-cup (OD/OC) segmentation on fundus images, thyroid nodule segmentation on ultrasound images, and skin lesion segmentation on dermoscopic images. Experimental results show that the proposed DiFF can effectively calibrate the segmentation uncertainty, and thus significantly facilitate the corresponding disease diagnosis, which outperforms previous state-of-the-art multi-rater learning methods.

4.
IEEE Trans Pattern Anal Mach Intell ; 46(2): 764-779, 2024 Feb.
Article in English | MEDLINE | ID: mdl-37930907

ABSTRACT

Image captioning is a core challenge in computer vision, attracting significant attention. Traditional methods prioritize caption quality, often overlooking style control. Our research enhances method controllability, enabling descriptions of varying detail. By integrating a length level embedding into current models, they can produce detailed or concise captions, increasing diversity. We introduce a length-level reranking transformer to correlate image and text complexity, optimizing caption length for informativeness without redundancy. Additionally, with caption length increase, computational complexity grows due to the autoregressive (AR) design of existing methods. To address this, our non-autoregressive (NAR) model maintains constant complexity regardless of caption length. We've developed a training approach that includes refinement sequence training and sequence-level knowledge distillation to close the performance gap between NAR and AR models. In testing, our models set new standards for caption quality on the MS COCO dataset and offer enhanced controllability and diversity. Our NAR model excels over AR models in these aspects and shows greater efficiency with longer captions. With advanced training techniques, our NAR's caption quality rivals that of leading AR models.

5.
IEEE J Biomed Health Inform ; 27(12): 5904-5913, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37682645

ABSTRACT

Videofluoroscopic swallowing study (VFSS) visualizes the swallowing movement by using X-ray fluoroscopy, which is the most widely used method for dysphagia examination. To better facilitate swallowing assessment, the temporal parameter is one of the most important indicators. However, most information of that acquire is hand-crafted and elaborated, which is time-consuming and difficult to ensure objectivity and accuracy. In this article, we propose to formulate this task as a temporal action localization task and solve it using deep neural networks. However, the action of VFSS has the following characteristics such as small motion targets, small action amplitudes, large sample variances, short duration, and variations in duration. Furthermore, all existing methods often rely on daily behaviors, which makes locating and recognizing micro-actions more challenging. To address the above issues, we first collect and annotate the VFSS micro-action dataset, which includes 847 VFSS data from 71 subjects, due to the lack of benchmarks. We then introduce a coarse-to-fine mechanism to handle the short and repeated nature of micro-actions, which can significantly enhancing micro-action localization accuracy. Moreover, we propose a Variable-Size Window Generator method, which improves the model's characterization performance and addresses the issue of different action timings, leading to further improvements in localization accuracy. The results of our experiments demonstrate the superiority of our method, with significantly improved performance (46.10% vs. 37.70%).


Subject(s)
Deglutition Disorders , Deglutition , Humans , Fluoroscopy/methods , Deglutition Disorders/diagnostic imaging , Neural Networks, Computer , Time Factors
6.
Neural Netw ; 164: 177-185, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37149918

ABSTRACT

Deep neural networks (DNNs) are vulnerable to adversarial examples with small perturbations. Adversarial defense thus has been an important means which improves the robustness of DNNs by defending against adversarial examples. Existing defense methods focus on some specific types of adversarial examples and may fail to defend well in real-world applications. In practice, we may face many types of attacks where the exact type of adversarial examples in real-world applications can be even unknown. In this paper, motivated by that adversarial examples are more likely to appear near the classification boundary and are vulnerable to some transformations, we study adversarial examples from a new perspective that whether we can defend against adversarial examples by pulling them back to the original clean distribution. We empirically verify the existence of defense affine transformations that restore adversarial examples. Relying on this, we learn defense transformations to counterattack the adversarial examples by parameterizing the affine transformations and exploiting the boundary information of DNNs. Extensive experiments on both toy and real-world data sets demonstrate the effectiveness and generalization of our defense method. The code is avaliable at https://github.com/SCUTjinchengli/DefenseTransformer.


Subject(s)
Generalization, Psychological , Learning , Neural Networks, Computer
7.
IEEE Trans Pattern Anal Mach Intell ; 45(10): 12459-12473, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37167046

ABSTRACT

Network pruning and quantization are proven to be effective ways for deep model compression. To obtain a highly compact model, most methods first perform network pruning and then conduct quantization based on the pruned model. However, this strategy may ignore that the pruning and quantization would affect each other and thus performing them separately may lead to sub-optimal performance. To address this, performing pruning and quantization jointly is essential. Nevertheless, how to make a trade-off between pruning and quantization is non-trivial. Moreover, existing compression methods often rely on some pre-defined compression configurations (i.e., pruning rates or bitwidths). Some attempts have been made to search for optimal configurations, which however may take unbearable optimization cost. To address these issues, we devise a simple yet effective method named Single-path Bit Sharing (SBS) for automatic loss-aware model compression. To this end, we consider the network pruning as a special case of quantization and provide a unified view for model pruning and quantization. We then introduce a single-path model to encode all candidate compression configurations, where a high bitwidth value will be decomposed into the sum of a lowest bitwidth value and a series of re-assignment offsets. Relying on the single-path model, we introduce learnable binary gates to encode the choice of configurations and learn the binary gates and model parameters jointly. More importantly, the configuration search problem can be transformed into a subset selection problem, which helps to significantly reduce the optimization difficulty and computation cost. In this way, the compression configurations of each layer and the trade-off between pruning and quantization can be automatically determined. Extensive experiments on CIFAR-100 and ImageNet show that SBS significantly reduces computation cost while achieving promising performance. For example, our SBS compressed MobileNetV2 achieves 22.6× Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.

8.
Br J Ophthalmol ; 107(5): 650-656, 2023 05.
Article in English | MEDLINE | ID: mdl-34893473

ABSTRACT

AIMS: To characterise the influence of primary open-angle glaucoma (POAG) and high myopia (HM) on the macular and choroidal capillary density (CD). METHODS: Two hundred and seven eyes were enrolled, including 80 POAG without HM, 50 POAG with HM, 31 HM without POAG and 46 normal controls. A fovea-centred 6×6 mm optical coherence tomography angiography scan was performed to obtain the CD of the superficial capillary plexus (SCP), deep capillary plexus (DCP) and choriocapillaris. Macular and choroidal CDs were compared among the groups and the association of CDs with visual field mean deviation (MD) was determined using linear regression models. RESULTS: Compared with normal eyes, SCP CD was decreased in the POAG without HM group (p<0.05), while DCP CD was significantly decreased in the HM without POAG group (p<0.05). Both SCP and DCP CDs were significantly decreased in the POAG with HM group (p<0.05). CD reduction occurred mainly in the outer rather than inner ring of the 6×6 mm scan size. In multivariate regression analysis, worse MD was associated with lower CD in the outer ring of the SCP in all the HM eyes (p<0.05). CONCLUSIONS: POAG and HM reduced macular CD in different layers of the retinal capillary plexus and both particularly in the outer ring of the 6×6 mm scans. Furthermore, assessment of the CD in the outer ring of the SCP may facilitate the diagnosis of glaucoma in eyes with HM.


Subject(s)
Glaucoma, Open-Angle , Myopia , Humans , Glaucoma, Open-Angle/diagnosis , Retina , Choroid/blood supply , Microvessels , Tomography, Optical Coherence/methods , Retinal Vessels , Fluorescein Angiography/methods
9.
IEEE Trans Image Process ; 31: 1870-1881, 2022.
Article in English | MEDLINE | ID: mdl-35139015

ABSTRACT

OCT fluid segmentation is a crucial task for diagnosis and therapy in ophthalmology. The current convolutional neural networks (CNNs) supervised by pixel-wise annotated masks achieve great success in OCT fluid segmentation. However, requiring pixel-wise masks from OCT images is time-consuming, expensive and expertise needed. This paper proposes an Intra- and inter-Slice Contrastive Learning Network (ISCLNet) for OCT fluid segmentation with only point supervision. Our ISCLNet learns visual representation by designing contrastive tasks that exploit the inherent similarity or dissimilarity from unlabeled OCT data. Specifically, we propose an intra-slice contrastive learning strategy to leverage the fluid-background similarity and the retinal layer-background dissimilarity. Moreover, we construct an inter-slice contrastive learning architecture to learn the similarity of adjacent OCT slices from one OCT volume. Finally, an end-to-end model combining intra- and inter-slice contrastive learning processes learns to segment fluid under the point supervision. The experimental results on two public OCT fluid segmentation datasets (i.e., AI Challenger and RETOUCH) demonstrate that the ISCLNet bridges the gap between fully-supervised and weakly-supervised OCT fluid segmentation and outperforms other well-known point-supervised segmentation methods.


Subject(s)
Image Processing, Computer-Assisted , Neural Networks, Computer , Image Processing, Computer-Assisted/methods , Retina , Supervised Machine Learning
10.
IEEE Trans Pattern Anal Mach Intell ; 44(3): 1670-1684, 2022 03.
Article in English | MEDLINE | ID: mdl-32956036

ABSTRACT

Visual grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. Generally, it requires the machine to first understand the query, identify the key concepts in the image, and then locate the target object by specifying its bounding box. However, in many real-world visual grounding applications, we have to face with ambiguous queries and images with complicated scene structures. Identifying the target based on highly redundant and correlated information can be very challenging, and often leading to unsatisfactory performance. To tackle this, in this paper, we exploit an attention module for each kind of information to reduce internal redundancies. We then propose an accumulated attention (A-ATT) mechanism to reason among all the attention modules jointly. In this way, the relation among different kinds of information can be explicitly captured. Moreover, to improve the performance and robustness of our VG models, we additionally introduce some noises into the training procedure to bridge the distribution gap between the human-labeled training data and the real-world poor quality data. With this "noised" training strategy, we can further learn a bounding box regressor, which can be used to refine the bounding box of the target object. We evaluate the proposed methods on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and GuessWhat?!). The experimental results show that our methods significantly outperform all previous works on every dataset in terms of accuracy.


Subject(s)
Algorithms , Attention , Humans
11.
IEEE Trans Pattern Anal Mach Intell ; 44(1): 211-227, 2022 Jan.
Article in English | MEDLINE | ID: mdl-32750833

ABSTRACT

Generative adversarial networks (GANs) have shown remarkable success in generating realistic data from some predefined prior distribution (e.g., Gaussian noises). However, such prior distribution is often independent of real data and thus may lose semantic information (e.g., geometric structure or content in images) of data. In practice, the semantic information might be represented by some latent distribution learned from data. However, such latent distribution may incur difficulties in data sampling for GAN methods. In this paper, rather than sampling from the predefined prior distribution, we propose a GAN model with local coordinate coding (LCC), termed LCCGAN, to improve the performance of the image generation. First, we propose an LCC sampling method in LCCGAN to sample meaningful points from the latent manifold. With the LCC sampling method, we can explicitly exploit the local information on the latent manifold and thus produce new data with promising quality. Second, we propose an improved version, namely LCCGAN++, by introducing a higher-order term in the generator approximation. This term is able to achieve better approximation and thus further improve the performance. More critically, we derive the generalization bound for both LCCGAN and LCCGAN++ and prove that a low-dimensional input is sufficient to achieve good generalization performance. Extensive experiments on several benchmark datasets demonstrate the superiority of the proposed method over existing GAN methods.

12.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6501-6516, 2022 10.
Article in English | MEDLINE | ID: mdl-34097606

ABSTRACT

Designing effective architectures is one of the key factors behind the success of deep neural networks. Existing deep architectures are either manually designed or automatically searched by some Neural Architecture Search (NAS) methods. However, even a well-designed/searched architecture may still contain many nonsignificant or redundant modules/operations (e.g., some intermediate convolution or pooling layers). Such redundancy may not only incur substantial memory consumption and computational cost but also deteriorate the performance. Thus, it is necessary to optimize the operations inside an architecture to improve the performance without introducing extra computational cost. To this end, we have proposed a Neural Architecture Transformer (NAT) method which casts the optimization problem into a Markov Decision Process (MDP) and seeks to replace the redundant operations with more efficient operations, such as skip or null connection. Note that NAT only considers a small number of possible replacements/transitions and thus comes with a limited search space. As a result, such a small search space may hamper the performance of architecture optimization. To address this issue, we propose a Neural Architecture Transformer++ (NAT++) method which further enlarges the set of candidate transitions to improve the performance of architecture optimization. Specifically, we present a two-level transition rule to obtain valid transitions, i.e., allowing operations to have more efficient types (e.g., convolution → separable convolution) or smaller kernel sizes (e.g., 5×5 → 3×3). Note that different operations may have different valid transitions. We further propose a Binary-Masked Softmax (BMSoftmax) layer to omit the possible invalid transitions. Last, based on the MDP formulation, we apply policy gradient to learn an optimal policy, which will be used to infer the optimized architectures. Extensive experiments show that the transformed architectures significantly outperform both their original counterparts and the architectures optimized by existing methods.


Subject(s)
Algorithms , Neural Networks, Computer
13.
Ophthalmology ; 129(1): 45-53, 2022 01.
Article in English | MEDLINE | ID: mdl-34619247

ABSTRACT

PURPOSE: To develop and evaluate the performance of a 3-dimensional (3D) deep-learning-based automated digital gonioscopy system (DGS) in detecting 2 major characteristics in eyes with suspected primary angle-closure glaucoma (PACG): (1) narrow iridocorneal angles (static gonioscopy, Task I) and (2) peripheral anterior synechiae (PAS) (dynamic gonioscopy, Task II) on OCT scans. DESIGN: International, cross-sectional, multicenter study. PARTICIPANTS: A total of 1.112 million images of 8694 volume scans (2294 patients) from 3 centers were included in this study (Task I, training/internal validation/external testing: 4515, 1101, and 2222 volume scans, respectively; Task II, training/internal validation/external testing: 378, 376, and 102 volume scans, respectively). METHODS: For Task I, a narrow angle was defined as an eye in which the posterior pigmented trabecular meshwork was not visible in more than 180° without indentation in the primary position captured in the dark room from the scans. For Task II, PAS was defined as the adhesion of the iris to the trabecular meshwork. The diagnostic performance of the 3D DGS was evaluated in both tasks with gonioscopic records as reference. MAIN OUTCOME MEASURES: The area under the curve (AUC), sensitivity, and specificity of the 3D DGS were calculated. RESULTS: In Task I, 29.4% of patients had a narrow angle. The AUC, sensitivity, and specificity of 3D DGS on the external testing datasets were 0.943 (0.933-0.953), 0.867 (0.838-0.895), and 0.878 (0.859-0.896), respectively. For Task II, 13.8% of patients had PAS. The AUC, sensitivity, and specificity of 3D DGS were 0.902 (0.818-0.985), 0.900 (0.714-1.000), and 0.890 (0.841-0.938), respectively, on the external testing set at quadrant level following normal clinical practice; and 0.885 (0.836-0.933), 0.912 (0.816-1.000), and 0.700 (0.660-0.741), respectively, on the external testing set at clock-hour level. CONCLUSIONS: The 3D DGS is effective in detecting eyes with suspected PACG. It has the potential to be used widely in the primary eye care community for screening of subjects at high risk of developing PACG.


Subject(s)
Cornea/pathology , Glaucoma, Angle-Closure/diagnosis , Gonioscopy/methods , Imaging, Three-Dimensional/methods , Iris/pathology , Tomography, Optical Coherence/methods , Trabecular Meshwork/pathology , Adult , Aged , Area Under Curve , Cornea/diagnostic imaging , Cross-Sectional Studies , Diagnosis, Computer-Assisted , Female , Humans , Intraocular Pressure , Iris/diagnostic imaging , Male , Middle Aged , Sensitivity and Specificity
14.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6454-6471, 2022 10.
Article in English | MEDLINE | ID: mdl-34101584

ABSTRACT

This paper focuses on the challenging task of learning 3D object surface reconstructions from RGB images. Existing methods achieve varying degrees of success by using different surface representations. However, they all have their own drawbacks, and cannot properly reconstruct the surface shapes of complex topologies, arguably due to a lack of constraints on the topological structures in their learning frameworks. To this end, we propose to learn and use the topology-preserved, skeletal shape representation to assist the downstream task of object surface reconstruction from RGB images. Technically, we propose the novel SkeletonNet design that learns a volumetric representation of a skeleton via a bridged learning of a skeletal point set, where we use parallel decoders each responsible for the learning of points on 1D skeletal curves and 2D skeletal sheets, as well as an efficient module of globally guided subvolume synthesis for a refined, high-resolution skeletal volume; we present a differentiable Point2Voxel layer to make SkeletonNet end-to-end and trainable. With the learned skeletal volumes, we propose two models, the Skeleton-Based Graph Convolutional Neural Network (SkeGCNN) and the Skeleton-Regularized Deep Implicit Surface Network (SkeDISN), which respectively build upon and improve over the existing frameworks of explicit mesh deformation and implicit field learning for the downstream surface reconstruction task. We conduct thorough experiments that verify the efficacy of our proposed SkeletonNet. SkeGCNN and SkeDISN outperform existing methods as well, and they have their own merits when measured by different metrics. Additional results in generalized task settings further demonstrate the usefulness of our proposed methods. We have made our implementation code publicly available at https://github.com/tangjiapeng/SkeletonNet.


Subject(s)
Algorithms , Learning , Machine Learning , Neural Networks, Computer
15.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6649-6666, 2022 10.
Article in English | MEDLINE | ID: mdl-34181534

ABSTRACT

Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages. Existing solutions either rely on hand-crafted descriptors or supervised gait representation learning. This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID. Specifically, we first create self-supervision by learning to reconstruct unlabeled skeleton sequences reversely, which involves richer high-level semantics to obtain better gait representations. Other pretext tasks are also explored to further improve self-supervised learning. Second, inspired by the fact that motion's continuity endows adjacent skeletons in one skeleton sequence and temporally consecutive skeleton sequences with higher correlations (referred as locality in 3D skeleton data), we propose a locality-aware attention mechanism and a locality-aware contrastive learning scheme, which aim to preserve locality-awareness on intra-sequence level and inter-sequence level respectively during self-supervised learning. Last, with context vectors learned by our locality-aware attention mechanism and contrastive learning scheme, a novel feature named Constrastive Attention-based Gait Encodings (CAGEs) is designed to represent gait effectively. Empirical evaluations show that our approach significantly outperforms skeleton-based counterparts by 15-40 percent Rank-1 accuracy, and it even achieves superior performance to numerous multi-modal methods with extra RGB or depth information. Our codes are available at https://github.com/Kali-Hac/Locality-Awareness-SGE.


Subject(s)
Algorithms , Gait , Humans , Skeleton
16.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6140-6152, 2022 10.
Article in English | MEDLINE | ID: mdl-34125669

ABSTRACT

This paper tackles the problem of training a deep convolutional neural network of both low-bitwidth weights and activations. Optimizing a low-precision network is very challenging due to the non-differentiability of the quantizer, which may result in substantial accuracy loss. To address this, we propose three practical approaches, including (i) progressive quantization; (ii) stochastic precision; and (iii) joint knowledge distillation to improve the network training. First, for progressive quantization, we propose two schemes to progressively find good local minima. Specifically, we propose to first optimize a network with quantized weights and subsequently quantize activations. This is in contrast to the traditional methods which optimize them simultaneously. Furthermore, we propose a second progressive quantization scheme which gradually decreases the bitwidth from high-precision to low-precision during training. Second, to alleviate the excessive training burden due to the multi-round training stages, we further propose a one-stage stochastic precision strategy to randomly sample and quantize sub-networks while keeping other parts in full-precision. Finally, we adopt a novel learning scheme to jointly train a full-precision model alongside the low-precision one. By doing so, the full-precision model provides hints to guide the low-precision model training and significantly improves the performance of the low-precision network. Extensive experiments on various datasets (e.g., CIFAR-100, ImageNet) show the effectiveness of the proposed methods.


Subject(s)
Algorithms , Neural Networks, Computer
17.
IEEE Trans Pattern Anal Mach Intell ; 44(10): 6209-6223, 2022 Oct.
Article in English | MEDLINE | ID: mdl-34138701

ABSTRACT

Temporal action localization, which requires a machine to recognize the location as well as the category of action instances in videos, has long been researched in computer vision. The main challenge of temporal action localization lies in that videos are usually long and untrimmed with diverse action contents involved. Existing state-of-the-art action localization methods divide each video into multiple action units (i.e., proposals in two-stage methods and segments in one-stage methods) and then perform action recognition/regression on each of them individually, without explicitly exploiting their relations during learning. In this paper, we claim that the relations between action units play an important role in action localization, and a more powerful action detector should not only capture the local content of each action unit but also allow a wider field of view on the context related to it. To this end, we propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods, including two-stage and one-stage paradigms. To be specific, we first construct a graph, where each action unit is represented as a node and their relations between two action units as an edge. Here, we use two types of relations, one for capturing the temporal connections between different action units, and the other one for characterizing their semantic relationship. Particularly for the temporal connections in two-stage methods, we further explore two different kinds of edges, one connecting the overlapping action units and the other one connecting surrounding but disjointed units. Upon the graph we built, we then apply graph convolutional networks (GCNs) to model the relations among different action units, which is able to learn more informative representations to enhance action localization. Experimental results show that our GCM consistently improves the performance of existing action localization methods, including two-stage methods (e.g., CBR [15] and R-C3D [47]) and one-stage methods (e.g., D-SSAD [22]), verifying the generality and effectiveness of our GCM. Moreover, with the aid of GCM, our approach significantly outperforms the state-of-the-art on THUMOS14 (50.9 percent versus 42.8 percent). Augmentation experiments on ActivityNet also verify the efficacy of modeling the relationships between action units. The source code and the pre-trained models are available at https://github.com/Alvin-Zeng/GCM.

18.
IEEE Trans Pattern Anal Mach Intell ; 44(8): 4035-4051, 2022 08.
Article in English | MEDLINE | ID: mdl-33755553

ABSTRACT

We study network pruning which aims to remove redundant channels/kernels and hence speed up the inference of deep networks. Existing pruning methods either train from scratch with sparsity constraints or minimize the reconstruction error between the feature maps of the pre-trained models and the compressed ones. Both strategies suffer from some limitations: the former kind is computationally expensive and difficult to converge, while the latter kind optimizes the reconstruction error but ignores the discriminative power of channels. In this paper, we propose a simple-yet-effective method called discrimination-aware channel pruning (DCP) to choose the channels that actually contribute to the discriminative power. To this end, we first introduce additional discrimination-aware losses into the network to increase the discriminative power of the intermediate layers. Next, we select the most discriminative channels for each layer by considering the discrimination-aware loss and the reconstruction error, simultaneously. We then formulate channel pruning as a sparsity-inducing optimization problem with a convex objective and propose a greedy algorithm to solve the resultant problem. Note that a channel (3D tensor) often consists of a set of kernels (each with a 2D matrix). Besides the redundancy in channels, some kernels in a channel may also be redundant and fail to contribute to the discriminative power of the network, resulting in kernel level redundancy. To solve this issue, we propose a discrimination-aware kernel pruning (DKP) method to further compress deep networks by removing redundant kernels. To avoid manually determining the pruning rate for each layer, we propose two adaptive stopping conditions to automatically determine the number of selected channels/kernels. The proposed adaptive stopping conditions tend to yield more efficient models with better performance in practice. Extensive experiments on both image classification and face recognition demonstrate the effectiveness of our methods. For example, on ILSVRC-12, the resultant ResNet-50 model with 30 percent reduction of channels even outperforms the baseline model by 0.36 percent in terms of Top-1 accuracy. We also deploy the pruned models on a smartphone (equipped with a Qualcomm Snapdragon 845 processor). The pruned MobileNetV1 and MobileNetV2 achieve 1.93× and 1.42× inference acceleration on the mobile device, respectively, with negligible performance degradation. The source code and the pre-trained models are available at https://github.com/SCUT-AILab/DCP.


Subject(s)
Algorithms , Data Compression , Data Compression/methods , Pressure
19.
Invest Ophthalmol Vis Sci ; 62(15): 1, 2021 12 01.
Article in English | MEDLINE | ID: mdl-34851376

ABSTRACT

Purpose: The purpose of this study was to determine the longitudinal changes in macular retinal and choroidal microvasculature in normal healthy and highly myopic eyes. Methods: Seventy-one eyes, including 32 eyes with high myopia and 39 healthy control eyes, followed for at least 12 months and examined using optical coherence tomography angiography imaging in at least 3 visits, were included in this study. Fovea-centered 6 × 6 mm scans were performed to measure capillary density (CD) of the superficial capillary plexus (SCP), deep capillary plexus (DCP), and choriocapillaris (CC). The rates of CD changes in both groups were estimated using a linear mixed model. Results: Over a mean 14-month follow-up period, highly myopic eyes exhibited a faster rate of whole image CD (wiCD) loss (-1.44%/year vs. -0.11%/year, P = 0.001) and CD loss in the outer ring of the DCP (-1.67%/year vs. -0.14%/year, P < 0.001) than healthy eyes. In multivariate regression analysis, baseline axial length (AL) was negatively correlated with the rate of wiCD loss (estimate = -0.27, 95% confidence interval [CI] = -0.48 to -0.06, P = 0.012) and CD loss in the outer ring (estimate = -0.33, 95% CI = -0.56 to -0.11, P = 0.005), of the DCP. The CD reduction rates in the SCP and CC were comparable in both groups (all P values > 0.05). Conclusions: The rate of CD loss in the DCP is significantly faster in highly myopic eyes than in healthy eyes and is related to baseline AL. The CD in the outer ring reduces faster in eyes with longer baseline AL.


Subject(s)
Choroid/blood supply , Myopia, Degenerative/physiopathology , Retinal Vessels/physiopathology , Adult , Capillaries/diagnostic imaging , Capillaries/physiopathology , Choroid/diagnostic imaging , Female , Fluorescein Angiography , Follow-Up Studies , Healthy Volunteers , Humans , Intraocular Pressure/physiology , Longitudinal Studies , Male , Middle Aged , Myopia, Degenerative/diagnostic imaging , Prospective Studies , Retinal Vessels/diagnostic imaging , Tomography, Optical Coherence , Visual Acuity/physiology
20.
Neural Netw ; 144: 553-564, 2021 Dec.
Article in English | MEDLINE | ID: mdl-34627120

ABSTRACT

Neural architecture search (NAS) has gained increasing attention in the community of architecture design. One of the key factors behind the success lies in the training efficiency brought by the weight sharing (WS) technique. However, WS-based NAS methods often suffer from a performance disturbance (PD) issue. That is, the training of subsequent architectures inevitably disturbs the performance of previously trained architectures due to the partially shared weights. This leads to inaccurate performance estimation for the previous architectures, which makes it hard to learn a good search strategy. To alleviate the performance disturbance issue, we propose a new disturbance-immune update strategy for model updating. Specifically, to preserve the knowledge learned by previous architectures, we constrain the training of subsequent architectures in an orthogonal space via orthogonal gradient descent. Equipped with this strategy, we propose a novel disturbance-immune training scheme for NAS. We theoretically analyze the effectiveness of our strategy in alleviating the PD risk. Extensive experiments on CIFAR-10 and ImageNet verify the superiority of our method.


Subject(s)
Learning , Neural Networks, Computer
SELECTION OF CITATIONS
SEARCH DETAIL
...